Thursday, June 15, 2017

Quick look: DDL performance on MySQL vs Aurora with and without Fast DDL

In this part of the "quick look" series, I evaluate the performance of DDL operations in Amazon Aurora and vanilla MySQL. I'll use a few examples to demonstrate how fast the DDLs run and what impact they can have on regular workload. I'll also take the Aurora's Fast DDL feature for a spin.

Introduction

DDL operations are a well-known source of pain in MySQL-based environments. They are not very fast, they're not fully online (non-blocking), and they're difficult to run on top of regular workload without serious performance degradation. MySQL has not seen much improvement in this area for quite some time, which is we have tools such as pt-online-schema-change, now considered to be an industry standard rather than a workaround.

Amazon Aurora tries to shake things up a little bit with the Fast DDL feature, which (as of this writing) allows you to add a nullable column at the end of a table nearly instantaneously. You can read more about it here.

Of course, the Fast DDL will cover more and more scenarios over time, but in the foreseeable future there will be situations when you still need to run a "regular" DDL operation. If that's the case, let's look at what Aurora has to offer outside of Fast DDL and if it's any better than regular MySQL.

Test configuration

Server configuration (MySQL):
  • MySQL 5.7.17 in Amazon RDS
  • Compute class: r3.8xlarge
  • Storage: 2TB, 20K Provisioned IOPS
  • Configuration (platform): backups disabled, Single-AZ (demonstrates best-case performance)
  • Configuration (database): innodb_online_alter_log_max_size=2GB, other parameters set to Amazon RDS defaults
Server configuration (Aurora):
  • Aurora release 1.13
  • Compute class: r3.8xlarge
  • Configuration (platform): backups enabled by default, 6-way storage replication enabled by default, single-node cluster
  • Configuration (database):
    • innodb_online_alter_log_max_size=2GB, other parameters set to Amazon RDS defaults
    • aurora_lab_mode=1 for tests with Fast DDL
Client configuration:
  • Hardware: Amazon EC2 c4.4xlarge
  • OS: Amazon Linux 2017.03.0 (HVM)
  • Client software: Sysbench 0.5
  • Location: same AWS Region, Availability Zone and Subnet as the server
Test configuration:
  • Test mode: oltp.lua read/write, 64 threads, fixed transaction rate (500 transactions per second)
  • Data set: 1 table, 100MM rows (26GiB total)
  • All DDL operations performed multiple times to confirm reproducibility

Scenario 1: Regular DDL - Small table with no workload

Let's start with something basic and run three DDL operations against a small table, with no other workload.

Here's the table definition:
mysql> show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `s1` varchar(32) DEFAULT NULL,
  `s2` varchar(32) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=110875262 DEFAULT CHARSET=latin1
1 row in set (0.01 sec)
Approach used to generate 110MM rows of random data:
mysql> insert into t1 values (null, md5(rand()), md5(rand()));
Query OK, 1 row affected (0.03 sec)

mysql> insert into t1 select null, md5(rand()), md5(rand()) from t1;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into t1 select null, md5(rand()), md5(rand()) from t1;
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> insert into t1 select null, md5(rand()), md5(rand()) from t1;
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

(... repeat with LIMIT until 110MM rows are generated ...)
Row count and data size:
mysql> select count(*) from t1;
+-----------+
| count(*)  |
+-----------+
| 110000000 |
+-----------+
1 row in set (17.85 sec)

mysql> select table_name, round(data_length/1024/1024/1024, 2) data_gb from information_schema.tables where table_schema = 'test';
+------------+---------+
| table_name | data_gb |
+------------+---------+
| t1         |    9.88 |
+------------+---------+
1 row in set (0.00 sec)
Finally, a record of DDL operations and their timing:
## Aurora

mysql> alter table t1 add index s1 (s1);
Query OK, 0 rows affected (6 min 34.85 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table t1 add column s3 varchar(32) not null;
Query OK, 0 rows affected (9 min 3.40 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table t1 drop column s3;
Query OK, 0 rows affected (8 min 51.56 sec)
Records: 0  Duplicates: 0  Warnings: 0

## MySQL

mysql> alter table t1 add index s1 (s1);
Query OK, 0 rows affected (10 min 57.37 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table t1 add column s3 varchar(32) not null;
Query OK, 0 rows affected (15 min 8.73 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table t1 drop column s3;
Query OK, 0 rows affected (15 min 1.76 sec)
Records: 0  Duplicates: 0  Warnings: 0
Visualization of results:



Notes:
  • DDL performance is far from ideal even on a small table.
  • Aurora was significantly faster for all three operations.

Scenario 2: Regular DDL - Small table with read/write workload

In this exercise, I do the following:
  • Run a single-table sysbench read/write workload at a fixed rate of 500 transactions/s, which translates to ~6800 reads/s and 2000 writes/s.
  • Wait for the database to warm up by running the test for 600 seconds prior to executing an ALTER command.
  • While the test is running, execute ALTER command to add a column to the table.
  • Measure the time needed to complete the ALTER operation, as well as its impact on sysbench latency and transaction rate.
ALTER commands and outputs:
## MySQL

mysql> alter table sbtest.sbtest1 add column s1 varchar(32) default null;
Query OK, 0 rows affected (55 min 32.44 sec)
Records: 0  Duplicates: 0  Warnings: 0

## Aurora

mysql> alter table sbtest.sbtest1 add column s1 varchar(32) default null;
Query OK, 0 rows affected (26 min 12.77 sec)
Records: 0  Duplicates: 0  Warnings: 0
Results:

AuroraMySQL
ALTER duration00:26:1200:55:32
TPS stddev30353 (~11.8x)
Latency stddev38 ms967 ms (~25.4x)

TPS and latency in MySQL (orange dashed area indicates ALTER activity):




You might notice a gap right before the ALTER finishes. Let's zoom into that:



The transaction rate went down to zero and remained there for ~15 seconds. This is the last step of InnoDB Online DDL process, when the engine locks the table and merges it with all row changes that were introduced since the ALTER started. The longer the ALTER runs, and the more changes you introduce, the longer the merge phase. That's why InnoDB Online DDL isn't fully online. You can learn more here.

TPS and latency in Aurora (same scale on both axes):



Latency in Aurora (zoom-in):


Notes:

  • Under traffic, the difference in ALTER duration between Aurora and MySQL is even more apparent.
  • The real difference lies in consistency. In MySQL, the ALTER takes quite a toll on sysbench results. In Aurora, the workload remains largely unaffected.

Scenario 3: Aurora Fast DDL - Small table with read/write workload

In this exercise, I do the following:
  • Enable aurora_lab_mode on the test cluster.
  • Run a single-table sysbench read/write workload at a fixed rate of 500 transactions/s, which translates to ~6800 reads/s and 2000 writes/s.
  • Wait for the database to warm up by running the test for 600 seconds prior to executing an ALTER command.
  • While the test is running, execute 10 ALTER commands to add 10 columns to the table. Sleep 10 seconds between each ALTER.
  • Measure the impact of ALTER on sysbench latency. I do not measure ALTER duration, because the Fast DDL operation runs in milliseconds, by design.
Latency graph with DDL executions indicated by orange markers:


Notes:
  • Sysbench 95% latency spikes to ~80-170ms at the exact time when DDLs are run. The spike typically lasts only for a single data point (1s).
  • Once all DDLs are done, latency goes back to previous levels. The database keeps modifying pages lazily on demand (read more here) with very little impact.

Summary

We need more Fast DDL! What there is so far looks quite impressive, but that's only the beginning of a larger effort before all popular DDL scenarios are covered. In the interim, test your larger schema changes before applying them in production. For any modifications that are not covered by Fast DDL, see if you can execute them with regular ALTER or if you're still better off using the de-facto industry-standard Percona tools.

No comments:

Post a Comment