Tuesday, September 3, 2013

Tracking Down High IO Wait in Linux


Tracking down performance issues in large distributed systems is inherently complicated. Why is the application slow? Where is the bottleneck? In my experience, one of the more insidious culprits is known as high IO wait. A place where, in the words of Dr. Seuss, everyone is just waiting.

The first indication of a high IO wait issue is normally system load average. The load average is computed based on CPU utilization, and includes the number of processes using or waiting to use the CPU, and, importantly on Linux, process that are in uninterruptible sleep. The load average can be interpreted on a basic level as being a CPU core at full utilization has a system load average of one. So, for a quad-core machine, a system load average of 4 would mean that the machine had adequate resources to handle the work it needed to do, but just barely. On the same quad-core system, a load average of eight would mean that if the server had eight cores instead of four, it would have been able to handle the work, but it is now overloaded. Maybe.

If the system is showing high load average, but the CPU system and user utilization is low, it is time to start looking at IO wait. IO wait shows up in system load on Linux because one or more of the cores is busy waiting on something having to do with either disk, or network input or output to finish before it can continue. These processes can be found using ps aux and looking for processes with a state of “D”.

Finding processes waiting for IO is one thing, verifying that they are the cause of high IO wait is a separate task. For this, start out with iostat -x 1 which will show your physical storage devices and which ones are being used.
 [username@server~]$ iostat -x 1
 
 Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
 cciss/c0d0        0.08     5.94  1.28  2.75    17.34    69.52    21.60     0.11   26.82   4.12   1.66
 cciss/c0d0p1      0.00     0.00  0.00  0.00     0.00     0.00     5.30     0.00    8.76   5.98   0.00
 cciss/c0d0p2      0.00     0.00  0.00  0.00     0.00     0.00    58.45     0.00    7.79   3.21   0.00
 cciss/c0d0p3      0.08     5.94  1.28  2.75    17.34    69.52    21.60     0.11   26.82   4.12   1.66
From the output above, it is clear that the device /dev/cciss/c0d0p3 is waiting for far too long. However, we do not have that particular device mounted, it is actually an underlying device for LVM. If you are using LVM for your disk, and you should, the output of iostat might be a bit confusing. LVM uses the device mapper subsystem to map the filesystems that are mounted back to the physical devices, so iostat might show several device paths like /dev/dm–0 and /dev/dm–1. Since the output of df -h does not show the device mapper paths, and instead prints the LVM path, the easiest way to get the lvm devices is to add the -N flag to iostat.
 [username@server~]$ iostat -xN 1
 
 Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
 
 vg1-root          0.00     0.00  0.09  3.01     0.85    24.08     8.05     0.08   24.69   1.79   0.55
 vg1-home          0.00     0.00  0.05  1.46     0.97    11.69     8.36     0.03   19.89   3.76   0.57
 vg1-opt           0.00     0.00  0.03  1.56     0.46    12.48     8.12     0.05   29.89   3.53   0.56
 vg1-tmp           0.00     0.00  0.00  0.06     0.00     0.45     8.00     0.00   24.85   4.90   0.03
 vg1-usr           0.00     0.00  0.63  1.41     5.85    11.28     8.38     0.07   32.48   3.11   0.63
 vg1-var           0.00     0.00  0.55  1.19     9.21     9.54    10.74     0.04   24.10   4.24   0.74
 vg1-swaplv        0.00     0.00  0.00  0.00     0.00     0.00     8.00     0.00    3.98   1.88   0.00
Parts of the output from the iostat commands above was clipped for brevity. Each of the filesystems listed shows an unacceptable level of IO wait, seen in the tenth column labeled “await”. The /usr filesystem shows a bit higher wait time than the others, so lets start there. Running the command fuser -vm /opt will show us a list of processes accessing the filesystem, and the process owner.
 root@server:/root > fuser -vm /opt
 
                      USER        PID ACCESS COMMAND
 /opt:                db2fenc1   1067 ....m db2fmp
                      db2fenc1   1071 ....m db2fmp
                      db2fenc1   2560 ....m db2fmp
                      db2fenc1   5221 ....m db2fmp
I clipped the output again, for brevity, but for this server I am showing 112 DB2 processes running that are currently accessing /opt. Looks like we found our culprit. In this server, our database is configured to use the much faster SAN access, and only the operating system is supposed to be using the local disk. Seems like a call to the DBA is in order to see what can be done about this setup.
One last note about LVM and the device mapper. The output of iostat -xN will show your logical volumes, but it is worth knowing that the device mapper maps can also be seen by running ls -lrt /dev/mapper. The sixth column of the output corresponds to the DM- number shown in iostat and other device commands.
Sometimes there is nothing that can be done at the operating system or application layer, and there is no other option but to look into getting a faster disk. Luckily we are at a stage where fast disk access via SANs or SSD drives are going down in price. A final caveat, this example works sometimes for some things. As always, deep understanding of your environment including physical, virtual, and application layer architecture is the best tool for diagnosing problems. If there are other methods or ways that I could improve this workflow, I would love to hear about them in the comments.

No comments: