Monday, March 21, 2016

Demonstrating crash recovery improvements in Amazon Aurora

Following earlier posts on replication and query cache improvements, this article will focus on a feature that can make Aurora more suitable for mission critical enterprise applications: near-instantaneous crash recovery.

While real-life performance is tricky to measure and may be evaluated differently in the context of different applications, downtime is a fairly straightforward thing to define and you most certainly want to avoid it regardless of what kind of service you're running. Let's take Aurora for a spin and see how it can make your life easier in this area.


This article doesn't pretend to be even remotely close to a proper technical description of MySQL startup and crash recovery process. Let's just recap the basics in order to give you a better understanding of how the tests I'll run relate to real-life situations.

The following three operations are major contributors to crash recovery time in InnoDB:
  1. Tablespace discovery/validation, which reads header pages of all InnoDB tablespaces (*.ibd files) in order to build a mapping between tablespace IDs and their names on disk (< 5.7.5) and validate the tablespaces. If you think MySQL 5.7 doesn't do that anymore, I encourage you to read this.
  2. Redo phase to persist changes that were logged in the redo log but were not yet written out to data files.
  3. Undo phase to roll back uncommitted changes.
What's most important is that the database engine is unavailable for user connections until the 2nd step (redo) is completed.

The unpleasant thing about crash recovery is that it's extremely difficult (often impossible) to predict how much time it may take. Depending on your hardware and server activity, it may be several seconds or several hours. Moreover, the critical crash recovery procedures are single-threaded, which results in server resources not being fully utilized during engine startup.

Tablespace Discovery and Validation

For a bit more background on the topic, please have a look at "MySQL 5.6 and 5.7 crash recovery performance with large number of tables".

Remember that this step is only likely to delay your recovery times if your database contains a large number of tables (tens of thousands or more) and the table was created on a server running with "innodb_file_per_table" variable set to 1 (default).

Redo Logs and Performance in MySQL

In a nutshell, whenever database makes a change to a piece of data, it immediately writes change records to a logfile on disk and it does that ahead of applying (flushing) these changes to actual data files. So the log data is written out as soon as changes are made but the actual modified/created database pages are kept in memory (in the buffer pool) for performance reasons: they may still be modified many times by future transactions so flushing them to disk every time they're changed would incur huge overhead.

When the database crashes, it may be that not all changes were yet flushed to the data files. In such case, the database will simply re-do these changes based on what it finds in the redo logs.

The redo logfiles are of fixed size and you can think about them as a circular buffer:
  • Changes are written out to the redo logs to allow for crash recovery, as described above.
  • Database must not overwrite a record in the redo log unless the associated database pages are flushed to disk. Overwriting a record associated with unflushed pages would make it impossible to recover changes made to these pages in case of a crash.
  • To "make room" in the redo log files, a database background thread is continuously flushing changes to the datafiles so that the associated locations in the redo log can be safely overwritten. The flushing can happen continuously ("fuzzy checkpointing") or the engine may decide to flush more aggressively if it starts running out of usable log space.
Per the above (very simplified) description, we're looking at the following issues:
  1. When changes are introduced, a gap will exist between the state data recorded in the logs and the contents of the data files. The maximum size of the "gap" is determined by how large the redo log files are. In reality, the gap can grow up to 80% of the log size, at which threshold MySQL will start flushing data aggressively.
  2. If the redo logfiles are small, the database can't accept too many changes ahead of flushing pages to disk. Flushing therefore becomes the ultimate bottleneck for write-bound workloads, but at least the gap is very narrow, which translates to shorter crash recovery times.
  3. If the redo logfiles are large, database can accept writes much faster at the expense of widening the gap and making crash recovery longer.
In MySQL, the size of redo logs is controlled by the innodb_redo_log_size variable and the engine uses two log files of the configured size. If you set the variable to 256MB, MySQL has 512MB of redo log space to work with.

How Aurora Does It

Aurora doesn't use a traditional file-based block storage and instead it employs a custom log-structured storage layer. In essence, the database engine simply sends change records to the storage layer and it doesn't need to care about all the stuff mentioned above. It doesn't need to flush pages from the buffer pool or worry about the "gap", it just keeps sending logs and that's it.

Due to the fact that data durability tasks are abstracted at storage layer and are handled by unicorns and stardust, Aurora also doesn't need to spend time on tablespace discovery when it starts up.

If you were persistent enough to go through all the previous paragraphs, the thing you probably want to know at this point is "does it help me or not?".

Let me show you!

As usual, the formal part goes first. Below are the specs of resources used during all tests:


  • Sysbench 0.5 using "oltp.lua" test
  • Write-only
  • Table count/size: 1 table, 10M rows
  • Thread count: 1024
  • Test duration: depending on the test
Resources: RDS MySQL

  • Version: 5.7.10 in Amazon RDS
  • Instance class: db.r3.4xlarge
  • Storage: 2TB, 20K Provisioned IOPS
  • Single-AZ, backups disabled (binary logs disabled)
  • Configuration: default except innodb_log_file_size as required by the tests

Resources: EC2 MySQL

  • Version: 5.7.11 from RHEL package
  • Instance class: h1.4xlarge
  • OS: Amazon Linux (latest)
  • Storage: 1TB, local SSD (see instance class description)

Resources: Aurora
Resources: Clients
  • Instance class: c3.4xlarge
  • Instance count: 1
  • Location: same Availability Zone, VPC and Subnet as the servers

The Test

Tablespace Discovery and Validation

The test procedure is very simple:
  • Create 1 million empty InnoDB tables.
  • Insert a few rows into one of the tables, then kill the engine to trigger crash recovery at next startup.
Startup time was measured using timestamps found in engine error logs with the "ready for connections" message marking the end of startup procedure.

In the chart below, I'm using the following labels:
  • RDS MySQL (crash, cold): Crash recovery on RDS MySQL with cold OS cache (after reboot).
  • EC2 MySQL (crash, cold): Crash recovery on EC2 MySQL with cold OS cache (after reboot).
  • EC2 MySQL (crash, warm): Crash recovery on EC2 MySQL with warm OS cache.
  • EC2 MySQL (clean): Normal startup with 1 million tables, no crash recovery.
  • Aurora (crash): Crash recovery on Aurora.
Here goes:

  • In MySQL, crash recovery with large number of tables does take time, in fact even the clean startup is not instantaneous (though it's not slow either).
  • In Aurora, six seconds was all I needed to see "ready for connections" in the logs.

Undo (Rolling Back Uncommitted Transactions)

This test was also quite simple:
  • Create a table with one integer Primary Key and one VARCHAR(32) column.
  • Insert 100 million rows into the table with no values for the character column (i.e. the column values were initially empty for all 100 million rows).
  • Perform a clean shutdown and startup.
  • Begin a transaction, then update all 100 million rows in the table with 32-byte character values: "UPDATE ... SET varchar_col = md5('something')".
  • Wait for the UPDATE to finish, then kill the engine without committing the transaction.
The test was performed on RDS MySQL and Aurora. Below chart uses the following labels:
  • Downtime: Time required for "ready for connections" message to appear in the logs.
  • Recovery time incl. undo: Time required to roll back the interrupted transaction, per error logs.

  • Downtime in MySQL is just over 2 minutes.
  • Yet again, Aurora boots up in just six seconds. It also manages to cut rollback time by 8 minutes.

Redo Log Size and Crash Recovery

Since Aurora doesn't need to flush pages from the buffer pool, this scenario is only for MySQL to demonstrate how the "gap" in redo logs influences your recovery time.

The test procedure is as follows:
  • Create a single sysbench table with 10M rows.
  • Configure the MySQL engine to use a 16GB log file size.
  • Launch a sysbench test against the database.
  • During the test, measure the gap by calculating the difference between "Log sequence number" and "Pages flushed up to" as reported by SHOW ENGINE INNODB STATUS.
  • Kill the engine when redo logs develop a certain gap.
  • Measure crash recovery time.
The procedure was repeated a few times, each time I would allow the engine to reach a bigger gap, then I would kill it and measure crash recovery time.

Pictures anyone?

  • In this particular test focused only on redo, the recovery time is clearly proportional to the size of the gap.
  • Recovery took almost 7 minutes for a gap size of ~9GB.
What's the lesson here?
  • Can you make your logs bigger to improve performance? Yes, you can, however performance does not increase linearly with the log size.
  • Also note that if you're writing all the time, you're only extending your "boost" time but you may still eventually run out of clean redo space. Unfortunately, increasing log file size is not a silver bullet that could remove all your performance problems.
  • If you're optimizing for performance, it's okay to make your logs large but only if you can tolerate longer recovery (downtime) in case of engine crash.

Miscellaneous Notes

Note that all MySQL scenarios presented here were in fact best case ones. I was working with a large instance, quite a bit of storage (2TB, 20K IOPS) and the storage volumes were warm. Your Mileage May Vary, especially if you're running smaller instances or you're restoring your MySQL from an EBS snapshot that experiences first-touch read penalty. 

Closing Thoughts

Crash recovery times can be controlled and managed in (expensive) enterprise databases such as Oracle and SQL Server. You don't typically see such features in other classes of products but that doesn't mean MySQL users don't care about downtime, right?

If you love MySQL like I do but crash recovery gives you the willies, I encourage you to play with both MySQL and Aurora and choose what works best for you.

As usual, I also encourage you to run your own tests using your own data set and workload. Synthetic tests such as the above may yield interesting ideas but they're never as good as real-life ones.

No comments:

Post a Comment