Wednesday, May 31, 2017

Database proof-of-concept & migration best practices

This post tries to answer some of the most popular questions I receive from customers, regarding the best practices for database proof-of-concept and migration projects.

Notes

  • This document discusses best practices for database proof-of-concept projects with MySQL-compatible databases. I’ll be making some references to products offered by Amazon Web Services, because it’s the service environment I work with on a daily basis. However, the guide is written in a generic enough way that you can most likely use it for any database POC, regardless of the database engine (MySQL or not) and provider (AWS or not). I’ll often use the terms “new database” or “new product” to refer to the product you’re evaluating.
  • Topics and challenges mentioned in the below recommendations do overlap. The review of a POC plan is not considered complete unless recommendations from all areas are considered and applied together. No loose ends, no blind spots.
  • Recommendations and ideas found in this document are motivated by the most common issues affecting customer projects. Many of the recommendations do require a substantial effort on customer side, which is why the POC plan review should be one of the first steps in the project and not a final step. In other words, the document is intended as an instruction for how to approach things correctly rather than a simple checklist for verifying if things have already been done correctly. If used merely as a checklist in the final stages of the project, it will reveal issues and gaps that are too serious to be resolved at such stage.
  • This document alone is not enough to guide you through the difficult task of building a well-architected database platform. Some of the recommendations will likely require assistance from your service provider or a respected MySQL consultant (e.g. Percona).

Use Case Evaluation

Recommendations

Carefully evaluate the new database in the context of your intended use case. Review the range of data products offered by your service provider and ensure that your choice is optimal.

For example, if you’re shopping for an OLTP database, you’re looking for a product that offers best performance in the following scenarios:
  • High concurrency (dozens, hundreds or thousands of simultaneous clients).
  • Large volume of low-latency queries (milliseconds – seconds).
  • Short, real-time transactions.
  • Highly selective query patterns (index lookups, selecting/modifying individual rows or small groups of rows).

You need not and should not obsess over a single product offered by your provider. Different types of workloads might be best served by different products. For example, in AWS, you can use Aurora for OLTP workloads, Redshift can help with analytics/reporting, S3 will do well for object/large BLOB storage and ElasticSearch is an option for a document repository and full text search.

Even if the new product isn’t the best fit for your workload, it might still provide certain benefits (automation, manageability, HA, durability) that can be attractive. Also, you might be able to adjust your workload patterns to leverage the new database’s performance characteristics. For example, reporting workloads with long-running, fat queries may be split into smaller sub-problems and executed in parallel on an OLTP engine with pretty good results. Ultimately though, you should ensure that your use case is more or less aligned with what the product was designed for.

In summary, it is critical that the project needs and goals (both performance and non-performance) are well understood before a migration attempt is made. This is true for any migration and any database product.

Comments and Common Mistakes

The most common mistake is to commit to a particular product without full understanding of the pros and cons behind all product options available to you. There most certainly exists a product or a mix of products that fits your needs, but if you’re against a tight deadline, you might be tempted (or pushed by your supervisors) to go for the first available solution and “hope for the best”. If things do not work well because your product choice was fundamentally incorrect, blaming the deadline or the product itself isn’t going to be fair, and certainly won’t make things better.

No single database product is a “silver bullet” that can be used as a one-size-fits-all solution. At best, such assumption can lead to unexpected engineering effort (making things work according to the expectations). At worst, it leads to customer frustration, abandoned migrations and missed project deadlines.

Questions

Example questions for project owners:
  1. Do you clearly understand your current performance metrics and the upper limit for what’s considered “acceptable performance”?
  2. What type of workload will you run in the new database? Does it align with the database’s performance characteristics?
  3. If you have workload components that don’t fully align with the new database’s performance characteristics, can you use it as part of a hybrid solution? For example, Aurora (OLTP database) can be paired with Amazon Redshift (OLAP) to handle mixed OLAP+OLTP workloads.
  4. If none of the workload components align with the new database’s performance characteristics (e.g. the new database is OLTP but your workload is pure OLAP), have you evaluated other data products offered by your provider to identify choices that might be more optimal? It might still be a good idea to use the candidate database if the lack of alignment in one area (e.g. performance) is offset by benefits in other areas such as durability and availability. Will you be open to making workload adjustments (table/query design) to try and make better use of new database’s performance architecture? 

Having made the initial assessment of new database’s viability for the project, have you requested help from your provider’s support teams to help validate your findings and ensure you’re not missing any optimization opportunities? When dealing with large POCs and migration projects, it’s almost always dangerous to go Solo (you should take a Wookiee with you). 

Compatibility and Feature Evaluation

Recommendations

Compatibility is rarely an issue during migrations between drop-in compatible products. For example: when migrating from MySQL, you can use Amazon Aurora with little or no modifications to your existing application code. For homogeneous migrations, your solutions might depend on certain features specific to the current database product, but in many cases the new product may be able to achieve the same goal using different means. 

When investigating potential compatibility issues, it’s important to keep in mind the underlying goal rather than the feature/setting you’re looking for. The focus should be on achieving the goal rather than using this or the other feature. When in doubt, try to come up with a clear description of your goals and seek assistance from your provider’s support representative.

Take your time to build knowledge about the new product and the underlying platform. Review service documentation, talk to support experts and take note of all features that may help you achieve cost savings, performance gains or simplify your database operations. Your data stack likely employs solutions designed specifically for the current database product, and engineered around its limitations. Stay open-minded: the features offered by the new product might help you simplify some of these solutions/workarounds or even make them unnecessary. For example, if you’re currently using a self-managed database, you likely have a lot of custom code that’s responsible for all the operational heavy lifting (configuration changes, backups, restores, node provisioning, high availability etc.). Inability to run this code on the new provider’s platform may look like a compatibility problem, but perhaps you simply do not need to run that code because of automation built into the service?

For best results, identify a senior individual from your engineering team and task him/her with becoming an all-around expert in the new database. Having reviewed all available knowledge materials and having established a relationship with the provider’s support team, he/she would become the go-to person and a technical lead for project owners in your company. For larger projects, this person would also be the point of contact for your provider’s support teams or external consultants.

Comments and Common Mistakes

Project owners often obsess over the features of their existing product rather than the goals and challenges they want to solve. When migrating a database, an engineer might insist on using a particular feature or implementing a given operational procedure, without verifying if it’s still the best solution, or without checking if this feature/procedure is even necessary in the new platform.

It is natural that your data stack was developed with and around the features/limitations of your database product. It is however incorrect to try to apply the same solutions when moving to a different product. Always start with the goal and work backwards i.e. understand your actual needs first, then pick the right solutions out of all available options. This is the only way you can take advantage of technological advancements offered by the new product. If you stick to the same solutions forever, you’re inhibiting your own innovation.

Questions

Example questions for project owners:
  1. Is your current database drop-in compatible with the new database? If yes, your migration path will be easier but you’re still not done.
  2. Having evaluated/tested the features of the new database, have you identified any features that are used in your current platform but appear to be unavailable/unsupported in the new product? If yes, what goals are you trying to achieve through these features? There’s likely more than one way to achieve your goals.
  3. Have you allocated engineering resources to research and understand the features of the new database engine and its underlying platform? Can you use any of these features to improve your design, simplify database management, increase performance/availability or reduce cost?
  4. For each feature that you choose and each solution you deploy (including legacy solutions ported from the existing database), can you easily describe the motivation behind your choice and explain why it was the best one?
  5. Have you requested assistance from your provider’s support representative? Architectural guidance, use-case guidance and a fresh perspective from a 3rd party will be very helpful even if your current design appears to be free of issues.

Data and Workload Design

Recommendations

Implement a regular design review covering the following areas:
  • Design of the data stack (products, solutions and the way they’re connected) as well as access patterns used by applications to interact with the data.
  • Design of tables and data structures: data organization, column types, index choices, data maintenance and lifecycle.
  • Design of your database interaction code i.e. queries, stored programs and application logic that interacts with the database. Ensure that you fully leverage your server’s performance characteristics and make efficient use of the data structures (tables).
  • Regular jobs and ETLs: design, scheduling and execution of database jobs (where “jobs” are database operations that are not part of the real-time operational traffic, but might impact operational performance).
  • Planning for scalability: dealing with organic data/traffic growth as well as traffic spikes if your use case expects them. See also Sizing of Database Resources below.

Such review should be run regularly as part of routine platform maintenance and must be run during all migrations between different database products. In either case, assistance from a respected database expert (e.g. Percona consultants for MySQL) is highly recommended. 

Comments and Common Mistakes

Most database users don’t want this to be true, but… the greatest performance gains are achieved not through configuration changes (parameters), but through correct architecture and schema/query design

A well-planned database migration project will require you to gather a whole lot of information about the workload, performance expectations, data/traffic growth predictions, and the current design of your data stacks and applications. This is a perfect occasion to conduct research into architectural improvements, table design and query optimization opportunities. If you take this task seriously, you will be rewarded with performance and cost benefits for months and years to come.

These are the most common mistakes made by project owners:
  1. For homogeneous migrations from MySQL-compatible to MySQL-compatible database: assuming that the current MySQL schemas and queries are already optimal and migrating them without any review. This assumption is typically not true.
  2. For heterogeneous migrations: assuming that design decisions and assumptions made for one database engine will hold for another database engine. This assumption is literally never true.
  3. Assuming that application design and development can be driven by application developers alone, without assistance from database experts. This assumption is also never true, even though its consequences may not be immediately visible. 

In my experience, almost all databases are affected by major design issues such as sub-optimal data stack architecture, incorrect table or query design. Such issues may not immediately push the performance/cost past acceptable thresholds, but they do always lead to lower performance, reduced scalability and higher operating costs in the long term. The consequences of neglecting your “database hygiene” may not impact you immediately, but when they do, they will be difficult and costly to remove.

Questions

Example questions for project owners:
  1. If you are migrating from a non-MySQL-compatible database to a MySQL-compatible product, have you researched and evaluated the differences between the current and the new database? You can safely assume that your existing schemas (likely also queries) are not optimal and will require attention from an expert. You wouldn’t deploy major code changes without a code review, so why do that with your database?
  2. When in doubt, seek help from a 3rd party consultant specializing in the database engine you’re migrating to. Not all database knowledge is universal: even if you have DBAs experienced in your current database product, they will need help! Would you drop a JavaScript developer into a C++ environment and expect him to make all the correct decisions?
  3. Even if it’s a homogeneous migration (MySQL-compatible to MySQL-compatible), have you conducted a schema and query design review with a MySQL expert? Have you applied schema/query design changes recommended by the review?
  4. Can you explain the decision behind each design choice and each query pattern? Can you easily describe how each design choice takes advantage of the new product’s performance characteristics?
  5. For each part of the application/solution that has not been reviewed, what is the potential impact if this application/solution fails to deliver the expected performance or availability? Is the risk acceptable?
  6. Are database experts routinely involved in your application design and development? How do you ensure that the design decisions made by application developers are correct?

Sizing of Database Resources

Recommendations

Keep the following ideas in mind when planning your database platform:
  • The resources (database instances, clusters) should be sized to provide adequate performance without waste. Avoid under-utilized instances, unused read replicas. Pay for what you need, use what you paid for.
  • Database instances should have enough spare capacity to handle sudden traffic spikes. Mission-critical databases should not run close to 100% CPU and memory utilization even if you don’t expect any spikes.
  • Design your platform so that provisioning and removal of capacity is easy and non-intrusive. Use read replicas for easy scaling of read-only workloads and ensure sure the applications can tolerate platform topology changes (provisioning and removal of instances) with minimum interruption and without the need for manual reconfiguration.
  • Leverage the service’s built-in failure recovery mechanisms. If the platform’s automatic recovery mechanisms are based on DNS, make sure that your applications do not delay recovery e.g. due to DNS caching or due to misconfigured connection pools that cannot quickly close and recycle broken connections.
  • Think ahead and be ready for long-term growth, in terms of both data size and traffic volume. Know your performance figures and expectations, be aware of your growth patterns, regularly review and update your scaling plans.

Comments and Common Mistakes

Common mistakes include:
  • Dealing with short traffic surges and long-term growth in a reactive and not proactive manner e.g. resizing or launching additional instances when existing resources are already overloaded.
  • Incorrect design leading to unequal resource utilization of database clusters e.g. traffic not evenly distributed across replicas; master instances busy with read-only traffic while the replicas are under-utilized or not used at all.
  • Incorrect application design where applications do not respond well to platform topology changes (e.g. adding or removing instances) and database traffic is not automatically re-balanced across the available resources.
  • Lack of proactive capacity planning, delaying the discussion about scaling plans until performance issues are already happening or are imminent.

Questions

Example questions for project owners:
  1. For each of your database clusters, can you easily answer the following questions?
    • Is there a written record of the cluster’s purpose, topology (node count and placement), node sizing, expected HA/performance metrics and a future scaling plan?
    • Why use this particular topology and node sizes?
    • Why not use fewer nodes or smaller nodes?
    • Why not use more nodes or larger nodes?
    • Are all cluster nodes equally utilized?
  2. How do you handle unexpected traffic spikes? Imagine that your traffic reaches 150% of peak within the next 10 minutes and the spike will last for 60 minutes. Are you prepared to handle such spike?
  3. If you wanted to handle twice the amount of traffic 14 days from now, what would be your plan?
  4. If your most important production clusters were to experience a failover right now, how long would it take your applications to recover? Note that any answer to this question is considered a “guess” unless tested. Can you rely on guesswork for application uptime?
  5. If you wanted to double the number of replicas in one of your production clusters, does that involve any reconfiguration or maintenance on application side? Why or why not?
  6. Do all your clusters use a multi-location deployment for high availability (e.g. Multi-AZ in Amazon RDS)?
  7. Do all your clusters use a durable storage layer with data replicated in multiple locations?

Performance Evaluation

Recommendations

Query performance is often viewed as the most important factor determining the success of a database migration project. It’s also a critical part of the Day 2 experience with your shiny new toy. You will need to allocate time for some (often a lot of) testing and research before you become fully comfortable with the performance and scalability of the product. Before you can make any comparisons, you need to understand what your current performance metrics are and what “performance” even means in your particular context. Sometimes, figuring that out is already a non-trivial task.

High-level recommendations are as follows:
  • Develop a full and thorough understanding of the critical performance metrics that matter for you. When you use the word “fast”, do you mean low latency for individual queries or high total throughput of the entire database cluster? Some workloads are described in terms of the number of operations executed per second/minute/hour/day, while others only care about individual queries being as fast as possible.
  • For all critical performance metrics, find out what their values are in your current deployment. You need solid data points in order to make meaningful comparisons.
  • Before commencing any tests, conduct a thorough design review of your data structures (schemas) and queries under the supervision of a database expert. Identify and remove all major issues that would impact your performance results. If you decide not to conduct such review, you must assume that your designs and workloads are unoptimized and will yield sub-optimal results.
  • Always evaluate performance using real-world tests and not synthetic benchmarks (e.g. sysbench, TPC-C).
  • The test clients should be located in close proximity to the database cluster. For best results, use clients in the same network location as the cluster (e.g. same VPC and Availability Zone in case of AWS databases), but also don’t do anything you wouldn’t do in production (e.g. you may not want to run the production application in just one Availability Zone). OLTP database workloads are extremely sensitive to network latency: do not run benchmarks from remote clients (e.g. on-prem server, office desktop/laptop) unless that is exactly how you will be accessing the database in production.
  • The test query mix must be representative of production workload. It is not only important to use the right queries (derived from production workload) but also to retain the right proportions between queries. For example: if query X appears 100 times more often than query Y in production, this proportion should be retained during benchmarks. The size and structure (schema) of the test data set should be identical to that of the production data set. For example: if a production database holds 1TB of data spread across 500 tables, a benchmark on 10 tables with 10MM rows will yield incorrect results.
  • The test query mix must be large and diverse enough to accurately simulate the scale and randomness of production workload. For example, if the production database serves tens of millions of unique queries per day, it might not make much sense to pick 1% of the queries and execute them in a single-threaded loop, as it would greatly reduce the randomness of test workload.
  • In order to fully reveal the database’s performance potential, benchmarks should be executed in a multi-threaded fashion. That said, thread concurrency should not be artificially higher than what you expect in production. For example, if your production application is executing queries on 100 parallel threads, your tests should also use 100 threads so that the benchmark results are realistic.
  • The dependency between workload volume and database performance is usually non-linear. If you need to understand how the database will behave in different scenarios (e.g. under baseline workload, during a spike, with 50% more traffic), you should run multiple tests instead of extrapolating the numbers from just a single benchmark. This is especially important during scalability tests.
  • Test runs must be long enough to measure database performance in warm, steady state. You might need to run the workload for many minutes or even a few hours before steady state is reached.
  • Repeat your tests multiple times and confirm that results are stable and reproducible. You can’t begin to trust your performance metrics if they’re not in statistical control.
  • If possible, use the service-provided default database parameters, which should already be pre-tuned. Only introduce parameter changes if they have meaningful and measurable positive impact on performance without negative effects on stability. In particular, do not migrate performance-related configuration parameters from your current database setup without verifying that they bring measurable benefits in the new setup.

When comparing one database product to another, make it a fair comparison. Configure all test resources exactly the way you would configure them for production workload and remember that performance differences between products may often be explained by the durability and HA characteristics of these products. Review all the differences before drawing final conclusions and find an acceptable balance between “compromises made” and “performance achieved”. 

Comments and Common Mistakes

Tests based solely on synthetic benchmarks are a quick and easy way to assess the performance of database products. However, they cannot be used as the only tool during real-world performance evaluations. Results gathered from synthetic benchmarks are simply not representative of the real-world performance.

It might be tempting to test the database using a handful of queries cherry-picked from the production workload. For example, one could decide to run a test using one or two slowest queries (out of many thousands or millions), assuming that if these queries work well, then everything else will too. This is a false assumption.

The performance of database workloads depends greatly on the client-server communication latency. This is especially true for OLTP databases that execute large numbers of short transactions. When test clients are located far from the server, benchmark results may be severely impacted by network communication overhead.

Even if the tested query mix is representative of the production workload, the tests are often executed in a single-threaded fashion e.g. by sequentially replaying queries from a log. In such situation, the server will only have a single work item to process at a time. Consequently, server resources will remain under-utilized and benchmark results are likely to be much lower than expected. 

Last but not least, performance evaluation projects are notorious for doing “apples to oranges” comparisons. For example:
  • You may be evaluating a product that comes with high availability, lag-free replication and built-in data mirroring against another product that has neither of these features.
  • One or both products may be configured differently than you would configure them in production (e.g. backups disabled, no data mirroring).
  • You may be comparing some but not all of the critical performance metrics. For example, you may be able to squeeze more throughput out of your master instance at the expense of crash recovery performance, but is that what you want in production?

It is okay if your project requires to make compromises in certain areas in order to achieve top performance, but you need to be aware of these compromises.

Example excerpt from a customer exchange (edited for brevity):
  • Customer: “This new MySQL deployment is slower than the reference platform I’ve set up.
  • Me, after checking both setups: “Your reference setup is a single cloud instance with no data mirroring, no backups and no replicas. Is this the kind of configuration you’re running in production?
  • Customer: “No, but this gives me best performance during the benchmark.

Another example:
  • Customer: “The new database with networked/mirrored storage volumes is 2x slower than my current MySQL in Amazon EC2.
  • Me, after checking both setups: “Your current MySQL is running on EC2 i2.4xlarge instances with local ephemeral storage and the data isn’t mirrored or otherwise persistent in durable storage. Do you accept the possibility of partial or full data loss?
  • Customer: “We don’t accept data loss, we use this setup for maximum performance.
  • Me: “The performance in your case comes at the expense of durability. You’re using EC2 ephemeral storage, which has very low IO latency but it’s local to the EC2 instance and not mirrored anywhere. When the instance fails, the data stored on ephemeral volumes is lost.
  • Customer: “No, I already told you: we do not accept data loss.

Questions

Example questions for project owners:
  1. What types of workloads do you intend to run on the database (e.g. 24/7 OLTP traffic, occasional reporting queries)?
  2. For each workload type:
    • What performance metrics are most important to you? Throughput? Latency? Latency variance? Other?
    • What are the values of these metrics in your current database?
    • For each metric, what is the goal and tolerance?
  3. How is your current database platform designed? Is durability, availability and manageability intentionally engineered into the system? If not, your performance results are most likely paid for in durability/HA.
  4. Do you fully understand the characteristics and components of your workload? Do you know them well enough to design and execute a real-world benchmark that closely resembles production operations?
  5. Have you investigated your table design, query design and data access patterns with a MySQL expert? Have you removed any issues that were identified? If not, you might have unrealized optimization opportunities leading to lower application performance, higher cost or both.

Migration

Recommendations

No single migration technique will satisfy all use cases. When planning a migration, take your time to carefully review all available migration options and choose one that provides best results with minimum effort. Do not commit to the first available, best-known or most widely advertised migration method because again, no single method can serve all use cases equally well.

Once you select the migration method, prepare a step-by-step plan in the form of a checklist. For each step, describe:
  • The purpose of the step.
  • Time to complete the step.
  • Entry criteria for the purposes of coordination and safety (e.g. do not execute the step if job XYZ is running on the database).
  • Actions listed in clean and prescriptive manner (e.g. ready-to-run commands).
  • Success criteria, success validation steps.
  • Abort criteria.
  • Rollback procedure: prescriptive list of compensating actions necessary to revert the step.

If the migration requires cooperation between multiple teams, prepare per-team checklists so that different people don’t get overwhelmed by and lost between each other’s steps. Always ensure that you have a master list for coordination, and that the master list execution is supervised by a person with enough context on the entire migration story.

Run a test migration and ensure that there are no steps missing from the checklist(s). Repeat the tests until you’re fully comfortable with all the steps, including rollback procedures.

Comments and Common Mistakes

There are two common mistakes related to choosing the migration method:
  1. Selecting the best-known or most widely advertised migration solution based on an assumption that it will provide best results in all cases.
  2. Choosing a legacy migration method that is well known but provides sub-optimal results, compared to other available migration methods. Examples:
    • Performing a migration using the legacy mysqldump program instead of using a more modern tool such as Percona Xtrabackup or mydumper/myloader.
    • Performing self-managed dump-and-restore migrations while the database can be more easily migrated using managed tools offered by your service provider.

Migrations are often poorly planned and executed:
  • Many migrations are executed without a plan and without end-to-end testing of the migration procedure.
  • Migration decisions are often made without verifying that the product choices are correct in the first place. See earlier seconds on Compatibility, Feature and Performance Evaluation for details.
  • Migration plans, if they exist, often assume the success of every migration step and do not describe rollback procedures.

Questions

Example questions for project owners:
  1. Have you considered more than one migration method for your project?
  2. If you are not sure what migration methods are available, have you requested help from your service provider’s support representative?
  3. If you have already chosen the migration method, can you easily explain the specific benefits it brings into your project, compared to other available methods? In other words, can you defend your choice?
  4. Do you have a step-by-step plan for the migration?
  5. Are you familiar with the migration plan? Exercise: pick a random step from the migration plan, write it down on a post-it note, take the note to the engineering team executing the migration. Without looking at the overall plan, can they list all the steps that precede and follow the selected step?
  6. Can you safely roll back from any step in your migration plan? Exercise: pick a random step from the migration plan, assume that it fails, describe partial rollback procedure to the previous step, describe full rollback procedure to step zero.
  7. Are there any steps in your migration plan that you don’t like (due to difficulty, risk or any other reason)? Have you consulted them with your provider’s support representative?

No comments:

Post a Comment