Tuesday, November 17, 2015

MySQL 5.6 and 5.7 crash recovery performance with large number of tables

It goes without saying that crash recovery of busy MySQL servers (and many other RDBMS for that matter) is not an extremely quick process. In MySQL context, one of the worst case scenarios is when the server is used for multi-tenant application hosting i.e. when the MySQL instance contains hundreds or thousands of schemas and (tens/hundreds of) thousands of tablespaces. In such scenario, the server may spend a considerable amount of time in the tablespace discovery phase, during which MySQL builds a mapping between tablespace IDs and names of actual tablespace files on disk.

MySQL 5.7 promises to put an end to tablespace discovery. The documentation lists the following improvements introduced in versions 5.7.5 and up:
  • Elimination of file system scans prior to redo log application. The MLOG_FILE_NAME redo log record provides the information necessary to identify tablespaces that have changed since the last checkpoint.
  • Only *.ibd files modified since the last checkpoint are accessed.
DBAs have long been excited about this and with the recent GA announcement of MySQL 5.7, we can finally test recovery speed in a production-ready MySQL 5.7 version.

The test

Environment and conditions:

  • All tests performed on Amazon EC2 m4.2xlarge instance with 500GB / 5000 IOPS of provisioned storage.
  • MySQL versions were compiled, installed and tested on the same machine (Debian 7.9, HVM image).
  • Tested MySQL versions: 5.6.27 and 5.7.9, both compiled from source without debugging.
  • All tests performed five times, final result is an average from all five runs.
  • Performance Schema was disabled during all tests.

Testing set:

One a schema with 200K empty tables:
create schema discovery;
create table discovery.table_1 (id int primary key auto_increment);
create table discovery.table_2 (id int primary key auto_increment);
create table discovery.table_200000 (id int primary key auto_increment);
A script with 100 single-row INSERT statements to run just before killing the server (unclean shutdown simulation):
insert into discovery.table_1 values (null);
insert into discovery.table_2 values (null);
insert into discovery.table_100 values (null);

Scenario (cold startup):

Logical flow:
  1. Start the server,
  2. Flush OS buffers and caches,
  3. Execute the INSERT script and kill MySQL immediately,
  4. Start the server and measure total startup time (until "ready for connections" message).
mysql@sandbox:$ sudo ./run_mysql.sh {version} # just a simple wrapper script for running various MySQL versions/compilations
mysql@sandbox:$ sudo echo 3 > /proc/sys/vm/drop_caches && sync
mysql@sandbox:$ mysql < insert_100.sql && pkill -9 mysqld
mysql@sandbox:$ sudo ./run_mysql.sh {version}
(get startup time from logs) 

Scenario (warm startup):

Logical flow:
  1. Gracefully shutdown the server after "cold" test (recovery done during "cold" test serves as pre-warming step for the "warm" test),
  2. Execute the INSERT script and kill MySQL immediately,
  3. Start the server and measure total startup time (until "ready for connections" message).
mysql@sandbox:$ pkill mysqld
mysql@sandbox:$ mysql < insert_100.sql && pkill -9 mysqld
mysql@sandbox:$ sudo ./run_mysql.sh {version}
(get startup time from logs) 


  • There is no separate pre-warming step here. The "cold" test itself prewarms all IBD files. If you do like to perform manual prewarming, it can be something trivial like "SELECT 1 FROM {all_tables}" or an even simpler operation like disk dump (dd) of the 1st page of each IBD file.
  • Only 100 single-row statements are used to generate write traffic before killing MySQL. This is hardly representative of a typical production workload but also ensures that redo/undo recovery phases are very short (I only wanted to test tablespace discovery here).
  • The purpose of doing "cold startup" tests are to simulate a scenario when the server is started from the same storage but on a cold system. If you think that cold tests are overly pessimistic, read this.


Startup time after crash with 200K tables can be presented as follows (average over 5 tests):

The numbers are not impressive to say the least but why is that? Would you not expect MySQL 5.7 to be significantly faster in this scenario? A quick look at process stacks and the source seems to explain everything.

First of all, in the tested scenario MySQL 5.7 still spent all its time opening and checking IBD files:

Process stack from MySQL 5.7 doing recovery

0x7fa5580dda43: _fini + 0x7fa4ffffdd03 (7ffc45766968, a0a511b68, 7ffc45766af0, ea64000, a, 0) + 190
0x010370be: _ZL17os_file_read_pageR9IORequestiPvmmPmb.constprop.124 + 0x8e (7ffc00000000, 1)
0x01037564: _Z35os_file_read_no_error_handling_funcR9IORequestiPvmmPm + 0x14 (bd7e9ced, a, 0, 117312e, 2100000200, ffffffff00000000) + 1a0
0x011fa1ea: _ZN8Datafile15read_first_pageEb + 0x5ea (eac7310, 176d2e8, 0, 7fa55850c1c8, 0, 7ffc45766cf0) + 1a0
0x011fac02: _ZN8Datafile19validate_first_pageEPm + 0x652 (7ffc45767001, 0, 100000200, 7ffc00000000, 38d88a8, 53d9588) + 160
0x011fb505: _ZN8Datafile14validate_to_ddEmm + 0x55 (fffffffffffffffe, 23e8000, 1, fffffffffffffffe, 1, ffffffffffffffff) + bf0
0x011e227a: _Z12fil_ibd_openbb10fil_type_tmmPKcS1_ + 0x60a (a68c658, 7ffc457692b0, 1007ffc45767e90, 1, 0, 186c2) + 1b20
0x011c5332: _Z39dict_check_tablespaces_and_store_max_idb + 0x1482 (0, 2, 4010000000000000, 4060000000000000, 0, 0) + 74c0
0x010e6900: _Z34innobase_start_or_create_for_mysqlv + 0x40c0 (c2c5c8896cb13356, b8, 7f0100000000, f6ded7, 19, 1fc) + 160
0x00fc2cc4: _ZL13innobase_initPv + 0xba4 (2, 38866b0, 4, 3874800)
0x00935b3e: _Z24ha_initialize_handlertonP13st_plugin_int + 0x4e (7ffc45770fc0, 1e6e560, 7ffc45770fc0, 3874800)
0x00db193c: _ZL17plugin_initializeP13st_plugin_int + 0x4c (0, daf9c0, 0, 7ffc00000001, 7fa500000000, 7fa558501410) + 3a0
0x00db64e8: _Z11plugin_initPiPPci + 0x758 (7fa5585029a0, 7fa5582e5f28, 7ffc457716d0, 7fa5582f410a, 37651b0, f45771350) + 540
0x008e8e65: _Z11mysqld_mainiPPc + 0x8f5 (0, 7ffc457718f8, 200000000, 8c61e0, 0, 482ec90af4e549ce) + ffff8003ba88e7b0
Secondly, the code makes it quite clear that still pretty much the same work has to be done, only this time you would call it "tablespace validation" rather than "tablespace discovery":

MySQL 5.6.27 source:

(at storage/innobase/log/log0recv.cc:2961)

 recv_needed_recovery = TRUE;

 ib_logf(IB_LOG_LEVEL_INFO, "Database was not shutdown normally!");
 ib_logf(IB_LOG_LEVEL_INFO, "Starting crash recovery.");
  "Reading tablespace information from the .ibd files...");


MySQL 5.7.9 source:

(at storage/innobase/srv/srv0start.cc:2272)
 The 'validate' flag indicates that when a tablespace
 is opened, we also read the header page and validate
 the contents to the data dictionary. This is time
 consuming, especially for databases with lots of ibd
 files.  So only do it after a crash and not forcing
 recovery.  Open rw transactions at this point is not
 a good reason to validate. */
 bool validate = recv_needed_recovery
  && srv_force_recovery == 0;

(at storage/innobase/dict/dict0load.cc:1497)
/** Check each tablespace found in the data dictionary.
Look at each general tablespace found in SYS_TABLESPACES.
Then look at each table defined in SYS_TABLES that has a space_id > 0
to find all the file-per-table tablespaces.

In a crash recovery we already have some tablespace objects created from
processing the REDO log.  Any other tablespace in SYS_TABLESPACES not
previously used in recovery will be opened here.  We will compare the
space_id information in the data dictionary to what we find in the
tablespace file. In addition, more validation will be done if recovery
was needed and force_recovery is not set.

We also scan the biggest space id, and store it to fil_system.
@param[in] validate true if recovery was needed */
 bool validate)

Closing notes

  • This test was by no means an attempt to estimate recovery speed in real-life scenarios. The sole purpose of the article is to present the performance of InnoDB tablespace discovery, which is typically not the longest recovery step anyway.
  • I measured that crash recovery with 200K tables resulted in 6.2GB of memory used for OS cache. The results would have been even more interesting if conducted against a smaller instance.
  • Bug report created to track this can be found here: Bug #79025.

No comments:

Post a Comment