Tales of a Stressed Kernel
July 27, 2013

Spending most of our time (as developers) in the high-level world, it’s easy to occasionally forget the true nature of our systems and how fragile they really are. Well, perhaps fragile is not the right word here, more like intricate or perhaps chaotic - you cannot fully predict the effect of one operation on the rest of the system. In this case, a simple file copy got our machines to freeze for many long seconds… Huh?

Our machines are given loads of RAM but no swap space, to ensure deterministic memory access times. The memory is pre-allocated to different processes based on a configuration file, with a small portion reserved for the kernel. Memory is allocated from /dev/shm (mounted as tmpfs), where it’s also exposed as files.

When a process crashes we may need some of those /dev/shm files for debugging, so we have a tool that runs whenever there’s a crash to collect system info. Among other things, it used to copy some of these shared-memory files to disk (using a simple cp or shutil.copy()). No surprise there. But every once in a while, and strongly-correlated to times when this collector was running, some time-critical processes timed-out when writing to the disks for no apparent reason, leading to catastrophic results. We’ve spent about a month trying to pin-point what’s taking so many system resources. Many things came to mind: CPU consumption? Accessing special /proc files that got the kernel busy? Running SCSI or other hardware commands? Jamming the IO bus with data from the copy?

The mystery was solved just last Thursday, when I realized copying files from /dev/shm caused all sorts of system-wide hiccups – not only timeouts when writing to other disks. It seemed that existing processes did get runtime, but no new processes could be forked. Sometimes running ls took ~20 seconds. Other times simple (non-file system) tools like date hanged for a while. When it was apparent it’s system-wide, the explanation was quite obvious: the kernel ran out of memory. Since there’s no swap space, there’s nothing it could do and kernel threads just blocked until there was enough room for their allocations.

But why would the kernel get so low on memory? After all, we’re copying files from memory (tmpfs) to the disk. Well, that’s seems like a bug:

I think I’ve finally figured this out. It’s a kernel bug – I’m guessing that under normal circumstances, the “cached” column in the free command “doesn’t count” towards how much memory the system thinks it’s using. After all, it’s just cached copies of stuff that should be elsewhere, and if you run out of memory, you can safely dump that, right? Unfortunately, /dev/shm is counted under cached rather than used memory (as I discovered in an earlier post).


Simplistically, file copy is a simple loop that reads a chunk of data from the source file and writes it to the destination file, until it transfers everything. When we read from a file, the kernel needs to allocate a kernel-space buffer and copy it to userland. And when we write it back to the destination file, the kernel first copies the userland buffer into kernel-space and links is to the device’s queue (to be evicted at the driver’s discretion). The write() call returns as soon as the kernel places the buffer into the queue, so it might “pile up” there for some time before actually being evicted, depleting kernel memory.

Ugggh. A simple copy brought our system to a halt. The solution was just as simple – we don’t want (and neither do we need) to use kernel buffers here. The source file already resides in memory. Instead of read()ing it, we can just mmap() the whole of it. And as for the destination file, we open it with O_DIRECT, so as not to use kernel buffers along the way. I christened this new tool mmapcp.

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances.

– Linus

Well, Linus, at least you were kind enough to let it stay :)