Monday, December 14, 2015

Understanding and profiling MySQL execution with Callgrind, Pstack and Perf

You may sometimes hear complaints about MySQL not providing good enough tools for profiling and execution analysis. A few years ago I would have agreed with such opinions, thankfully MySQL developers have made huge efforts to improve the situation in recent major versions. MySQL DBAs now have some great native diagnostic tools at their disposal... which is totally not what this article is about :)

Native MySQL tooling (whatever it might be) is just the tip of the iceberg and if you want to be a better troubleshooter, SysAdmins are the first people you should talk to. Their toolboxes are full of awesomeness and the tools they use have one significant advantage over MySQL tools: they can analyze server execution holistically, regardless of the MySQL version you may be using.

In this article, we will have a look at three OS-level tools: pstack, perf and callgrind (Valgrind tool).


Introduction

Tools described here can be used to profile and trace server execution. Here's my subjective list of applicable use cases: 
  • Working with old MySQL versions that do not provide sophisticated diagnostic tools,
  • Diagnosing MySQL components that are not well instrumented by performance schema,
  • Low level troubleshooting and performance analysis,
  • Profiling database workload in test environments,
  • Learning MySQL server internals.

Issue identification

In this article I'm going to use the three aforementioned tools to diagnose a busy MySQL server. Right now I'm not going to tell you what the issue is, all you need to know is that the server is doing "something" and is chewing up CPU and IO.


MySQL is a multi-threaded application so first let's try to figure out what kind of usage pattern we're looking at: multi-threaded work from multiple clients or a single, heavy operation. To that end, I'll launch pidstat and check CPU usage on individual server threads:
root@sandbox:~/tmp# pidstat -p `pidof mysqld` -t 2
Linux 3.2.0-4-amd64 (sandbox)  12/13/2015  _x86_64_ (2 CPU)

04:23:14 PM      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
04:23:16 PM      1904         -   43.00    5.50    0.00   48.50     0  mysqld
04:23:16 PM         -      1904    0.00    0.00    0.00    0.00     0  |__mysqld
04:23:16 PM         -      1905    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1906    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1907    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1908    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1909    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1910    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1911    0.00    0.50    0.00    0.50     1  |__mysqld
04:23:16 PM         -      1912    0.00    0.00    0.00    0.00     0  |__mysqld
04:23:16 PM         -      1913    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1914    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1916    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1917    0.00    0.00    0.00    0.00     0  |__mysqld
04:23:16 PM         -      1918    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1919    0.00    0.50    0.00    0.50     1  |__mysqld
04:23:16 PM         -      1920    0.00    0.00    0.00    0.00     0  |__mysqld
04:23:16 PM         -      1921    1.00    1.00    0.00    2.00     0  |__mysqld
04:23:16 PM         -      1922    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1923    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1924    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      1925    0.00    0.00    0.00    0.00     1  |__mysqld
04:23:16 PM         -      2008   41.50    4.50    0.00   46.00     1  |__mysqld
04:23:16 PM         -      2079    0.00    0.00    0.00    0.00     0  |__mysqld
^C
The observation we can already make is that our server is probably bogged down by a single heavy statement: nearly all CPU usage comes from thread 2008.

Pstack

Let's now use pstack to obtain some information about the activity on thread 2008. What pstack does is it produces a current call stack from a given process ID (or thread ID in this case). It does not aggregate, summarize or otherwise report process activity over time but is still very useful as it executes relatively quickly and can be used against any existing thread with little performance penalty.

The usual technique is to obtain several stacks from the interesting process and look for common patterns. This is not-super accurate but can often help you get your bearings in a matter of seconds.

Here goes our stack:
root@sandbox:~/tmp# pstack 2008

2008: bin/mysqld --defaults-file=my.cnf
(cut)
0x00993f49: _Z21sync_array_wait_eventP12sync_array_tm + 0xa9 (1e, 8, a55, cb23e0, 2, 3bf) + 50
0x0099513e: _Z19rw_lock_x_lock_funcP9rw_lock_tmPKcm + 0x22e (0, 41213cc000, 7f9b82ba5546, 7f9bf1141600, 0, 1) + b0
0x009e67b8: _Z16buf_page_get_genmmmmP11buf_block_tmPKcmP5mtr_t + 0x1028 (cb23e0, a55, 7f9b1ac9aaf0, 7f9baea00e40, 0, 0) + 450
0x009cae3c: _Z27btr_cur_search_to_nth_levelP12dict_index_tmPK8dtuple_tmmP9btr_cur_tmPKcmP5mtr_t + 0x2cc (0, cb23e0, a55, 7f9b1ac9aaf0, 100, c0) + 6d0
0x0095cb4c: _Z27row_ins_sec_index_entry_lowmmP12dict_index_tP16mem_block_info_tS2_P8dtuple_tmP9que_thr_t + 0x23c (0, 90f9e08, 0, 15, 6c60dfca4, 90fa498) + 30
0x0095eec1: _Z23row_ins_sec_index_entryP12dict_index_tP8dtuple_tP9que_thr_t + 0x101 (90f9e08, 90f9ad8, 9100ca8, 90fa408, 90f9ad8, 0) + 80
0x0095f278: _Z12row_ins_stepP9que_thr_t + 0x238 (90f9ad8, 90fd7d0, 90ffa58, 90fef68, 90f9ad8, 90f9c98) + 50
0x0096a8b0: _Z20row_insert_for_mysqlPhP14row_prebuilt_t + 0x2f0 (0, 1c6a, b8be00, 40, 7f9b1ac9b1e0, 90f4710) + 20
0x008e8cb8: _ZN11ha_innobase9write_rowEPh + 0x148 (100000003, 7f9c38cea8c0, 90fd9e0, 7f9c38bdec40, 405f3769914, b057d0) + 30
0x005e600c: _ZN7handler12ha_write_rowEPh + 0x6c (0, 0, 90f8dc8, 90f8da8, 0, 90eb410) + 40
0x006dc74b: _Z12write_recordP3THDP5TABLEP9COPY_INFOS4_ + 0x7b (130e760, 130e720, 5c, 0, 0, 0) + 660
0x0081b02c: _Z10mysql_loadP3THDP12sql_exchangeP10TABLE_LISTR4ListI4ItemES8_S8_15enum_duplicatesbb + 0x1cdc (0, 0, 0, 0, 0, 0) + c20
0x006f1772: _Z21mysql_execute_commandP3THD + 0x18d2 (7f9b1ac9c610, 21fa400, 3, 90eb230, 7f9b1ac9c860, 21fbd00)
0x006f47e0: _Z11mysql_parseP3THDPcjP12Parser_state + 0x330 (0, 0, 0, 0, 0, 0) + 7f0
0x006f5574: _Z16dispatch_command19enum_server_commandP3THDPcj + 0xd04 (21fa400, 7f9cd4d2ae40, 21fa400, 1)
0x006c95fd: _Z24do_handle_one_connectionP3THD + 0x1ed (905a340, 21fa400)
0x006c9652: handle_one_connection + 0x42 (0, 0, 13085c0, 7f9b1ac9d9c0)
I pulled multiple stacks and I noticed the following:
  • A "mysql_load" function appears in every stack, which means we're looking at a LOAD DATA statement running on the server,
  • Nearly all stacks include function "row_ins_sec_index_entry" (called in "write_record"). This tells us that the majority of the time is being spent updating secondary index trees.
At this point we already know quite a lot about our workload: the server is executing a LOAD DATA INFILE statement and is busy inserting rows into a table. The table is probably heavily indexed.

Perf

Pstack is awesome but using it for profiling is like trying to watch a movie with your eyes closed 99% of the time. Now we're going to take a look at our server using another tool, perf. Perf is a dynamic tracing tool capable of recording information about CPU events and low-level program execution. Perf can be invoked and terminated ad-hoc and it can investigate an already running process, which is what we're going to do next:
root@sandbox:~/tmp# perf record -g -a -p `pidof mysqld` -o perf.out
^C
[ perf record: Woken up 14 times to write data ]
[ perf record: Captured and wrote 3.863 MB perf.out (~168786 samples) ]
This sample command will:
  • perf record -p `pidof mysqld`: Start capturing perf data from an already running mysqld process,
  • -g: Record call-graph information for producing call stacks (who calls what in terms of MySQL source code functions)
  • -a: Record activity on all CPUs,
  • -o perf.out: write data to a file called "perf.out".
Perf keeps recording until interrupted with CTRL+C. In this case I had it running for a few minutes.

Output files from "perf record" can be analyzed with "perf report". Here's a screenshot from "perf -g -i perf.out":


Notes:
  • It may look a bit cryptic if you're not too familiar with MySQL sources, but the point is that perf can give you information about actual function calls and report how much CPU time they  consumed  (relative to other calls).
  • The view above is the default one, with all items folded.
  • The view is interactive, you can fold/unfold items and zoom into threads if you wish.
Another example of "perf report" view with selected items unfolded:


As you can see, perf is able to report call traces (thanks to the -g switch we used) so you can understand not only which functions are taking the most time but also where they were called from.

The above view tells us that the server is busy doing two main operations:
  • Flushing buffer pool pages ("buf_flush_list", see the second unfolded item),
  • Inserting into clustered and secondary index trees ("row_ins_sec_index_entry_low", "row_ins_clust_index_entry_low" in multiple places). We already knew that inserts into index trees may be an issue and perf confirmed that for us.
If the above view is difficult to read, you can also instruct "perf report" to produce a more user-friendly reverse call graphs ("perf report with -G switch"):


Additional notes on perf:
  • The tool can do real-time profiling on a live system if you invoke it as "perf top". The visual output format is pretty much the same, jus that you don't need to record data up front.
  • If program sources are available, you can even use perf to annotate function calls with the relevant source code portions.

Callgrind

Callgrind is a tool from the Valgrind instrumentation framework and it's not the first Valgrind tool I'm describing. In a recent blog post I wrote about Massif (a heap profiler) so you may want to bookmark it for later: profiling MySQL memory usage with Valgrind / Massif.

Unlike pstack and perf, valgrind/callgrind is not a tool you could use for ad-hoc analysis on a production server. Valgrind is essentially a "processor simulator", which has two main implications:
  • The program you're diagnosing has to be explicitly launched within Valgrind,
  • Simulation comes with severe performance penalty.
Still, there are many situation when you may want to controllably profile and simulate a statement or perhaps a piece of your workload. So how do we use callgrind?

This is how I run MySQL normally:
bin/mysqld --defaults-file=my.cnf
And this is how I run it in callgrind:
valgrind --tool=callgrind --callgrind-out-file=/home/mysql/callgrind/profile-5.6.27 bin/mysqld --defaults-file=my.cnf
Once we have a process (or multiple processes) running in callgrind, the "callgrind_control" tool can be used to list or control these processes. For example:

To list all processes running under valgrind:
mysql@sandbox:~/callgrind$ callgrind_control
PID 3003: bin/mysqld --defaults-file=my.cnf

To zero performance counters for process 3003 (useful for selective profiling): 
mysql@sandbox:~/callgrind$ callgrind_control -z 3003
PID 3003: bin/mysqld --defaults-file=my.cnf
sending command  to pid 3003
  OK.

Once the profiled program finishes (or is interrupted), you can use "callgrind_annotate" to read the output file and produce a textual summary of function calls and their costs. There's also a KCacheGrind GUI program that can be used for a more user-friendly presentation of results. I'm going to use it here.

I executed the same LOAD DATA INFILE statement in MySQL running under callgrind. Once the statement finished, I stopped the server and opened the callgrind output file in KCacheGrind to generate a visual call graph. The tool has many other options, see web page for details. 

Here's the root of the graph:


Observations:
  • Nearly 70% of execution cost is associated with our client connection running LOAD DATA statement,
  • The rest of the CPU time was spent in two internal server threads: the main server thread (performed a log buffer sync 588 times) and the page cleaner thread (performed a flush operation 765 times). A separate page cleaner thread was introduced in MySQL 5.6.2, before that all page flushing would have been done by the main server thread.
Going deeper into the user thread, we run into the "mysql_load" function responsible for handling LOAD DATA statement. We already saw that function in previous tools but here it is graphed nicely:


Observations:
  • The "write_record" function was called 500.000 times. Indeed, the input file had 0.5M records in it.
  • There were 2.5M invocations of "read_field" function, which means that we read five fields per one row of input. That's about right because the table I'm using has five columns.
  • Reading fields from the input file does take some time (~15%) but we need to keep digging into "write_record" to find out where most of MySQL execution effort is going (~52%).
Further down the call graph:


Here we go! Something we already predicted with pstack and profiled with perf!
  • We recorded roughly half a million "row_ins_clust_index_entry_low" calls responsible for inserting clustered (primary) index entries. We already know the input file had 0.5M records so this is not a surprise.
  • We recorded a whopping 2 million calls on "row_ins_sec_index_entry_low", the function responsible for inserting secondary index records. We can now calculate that the table has four secondary indexes: 2M calls for 0.5M table records = 4 calls per record.

Summary

Using pidstatpstack, perf and callgrind/KCacheGrind we did the following:
  • Identified the workload pattern (a single heavy statement),
  • Identified the type of statement (LOAD DATA INFILE),
  • Identified a likely bottleneck (secondary keys slowing down the import),
  • Confirmed that secondary key inserts do in fact account for the majority of the execution time,
  • Learned the number of rows in the input file and the number of secondary indexes on the table.
We did all that without even seeing the table or the table structure. In fact we didn't even log into MySQL. The tools we used are capable of doing much more cool stuff but I'm afraid this article is already getting too long :)

We're done here so let me show you the table:
mysql> show create table import.table1\G
*************************** 1. row ***************************
       Table: table1
Create Table: CREATE TABLE `table1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `s1` varchar(255) DEFAULT NULL,
  `s2` varchar(255) DEFAULT NULL,
  `s3` varchar(255) DEFAULT NULL,
  `s4` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `s1` (`s1`),
  KEY `s2` (`s2`),
  KEY `s3` (`s3`),
  KEY `s4` (`s4`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
And the statement:
load data infile '/home/mysql/tests/callgrind/table1_500K.dat' into table import.table1;

See? This was just a very simple example but still, all we found was true. I hope that this article, although very generic, was still interesting enough to trigger your curiosity. Now go have some fun in your own environment :)

1 comment: