Tales of a Stressed Kernel July 27, 2013
Spending most of our time (as developers) in the high-level world, it's easy to occasionally forget the true nature of our systems and how fragile they really are. Well, perhaps fragile is not the right word here, more like intricate or perhaps chaotic - you cannot fully predict the effect of one operation on the rest of the system. In this case, a simple file copy got our machines to freeze for many long seconds... Huh?
Our machines are given loads of RAM but no swap space, to ensure deterministic memory access times. The memory
is pre-allocated to different processes based on a configuration file, with a small portion reserved for
the kernel. Memory is allocated from
/dev/shm (mounted as tmpfs),
where it's also exposed as files.
When a process crashes we may need some of those
/dev/shm files for debugging, so we have a tool that runs
whenever there's a crash to collect system info. Among other things, it used to copy some of these shared-memory
files to disk (using a simple
shutil.copy()). No surprise there. But every once in a while, and
strongly-correlated to times when this collector was running, some time-critical processes timed-out when writing
to the disks for no apparent reason, leading to catastrophic results. We've spent about a month trying to
pin-point what's taking so many system resources. Many things came to mind: CPU consumption?
/proc files that got the kernel busy? Running SCSI or other hardware commands?
Jamming the IO bus with data from the copy?
The mystery was solved just last Thursday, when I realized copying files from
/dev/shm caused all sorts
of system-wide hiccups -- not only timeouts when writing to other disks. It seemed that existing processes did
get runtime, but no new processes could be forked. Sometimes running
ls took ~20 seconds. Other times
simple (non-file system) tools like
date hanged for a while. When it was apparent it's system-wide, the
explanation was quite obvious: the kernel ran out of memory. Since there's no swap space, there's nothing it
could do and kernel threads just blocked until there was enough room for their allocations.
But why would the kernel get so low on memory? After all, we're copying files from memory (tmpfs) to the disk. Well, that's seems like a bug:
I think I've finally figured this out. It's a kernel bug -- I'm guessing that under normal circumstances, the "cached" column in the free command "doesn't count" towards how much memory the system thinks it's using. After all, it's just cached copies of stuff that should be elsewhere, and if you run out of memory, you can safely dump that, right? Unfortunately, /dev/shm is counted under cached rather than used memory (as I discovered in an earlier post).
Simplistically, file copy is a simple loop that reads a chunk of data from the source file and writes it
to the destination file, until it transfers everything. When we read from a file, the kernel needs to
allocate a kernel-space buffer and copy it to userland. And when we write it back to the destination file,
the kernel first copies the userland buffer into kernel-space and links is to the device's queue (to be
evicted at the driver's discretion). The
write() call returns as soon as the kernel places the
buffer into the queue, so it might "pile up" there for some time before actually being evicted,
depleting kernel memory.
Ugggh. A simple copy brought our system to a halt. The solution was just as simple -- we don't want
(and neither do we need) to use kernel buffers here. The source file already resides in memory. Instead of
read()ing it, we can just
mmap() the whole of it. And as for the destination file, we open it with
O_DIRECT, so as not to use kernel buffers along
the way. I christened this new tool
The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances.
Well, Linus, at least you were kind enough to let it stay :)