Finding memory management errors in the Nanos Unikernel

If you are a programmer, you probably have had the experience of trying to debug a crash or data corruption that seems like it should be impossible. Your unit tests pass and you've inspected the code and you can't find any errors. You use the debugger to see that the data is loaded correctly, but your watchpoint isn't breaking on the corruption, or perhaps the memory is modified so often that it's not practical to use watchpoints. You might even find where the memory is getting corrupted but it's not actually the root cause of the problem. Worst of all, if the problem is hard to reproduce then your bug report might be open a while, frustrating you and your users.

Bugs with these symptoms often fall into the class of memory management problems. There are several different root causes that have very similar symptoms but they all are caused by an error in the way memory is used.

  • Buffer overrun: something writes data outside of the bounds of a buffer, possibly into an adjacent one. This is usually caused by incorrect or missing bounds checking by whatever is writing into the buffer.
  • Use after free: The buffer memory has been freed but a reference to the buffer still exists somewhere. Accessing this stale reference might result in a crash or it might overwrite another buffer that reused the free memory.
  • Use of uninitialized memory: memory is read before it is initialized to a default value. This is often an easy one to find, but it can hide if the memory allocation is usually, but not always, zeroed.

While there are programming languages or tools that either solve or identify these problems, Nanos is a kernel written in C with its own memory allocators and cannot use many of these tools or libraries which are usually made for userspace programs. In fact, since the operating system interacts with hardware (virtual or not), some of these errors can still happen regardless of language and tooling if, for example, the hardware is programmed incorrectly! The operating system also has to deal with various bootloaders and firmware that could initialize hardware differently or even incorrectly, further complicating when and where these types of bugs manifest. So how does Nanos deal with these types of problems? First, it works to prevent some of the problems by using useful abstractions, such as a buffer data structure that gets used where many C programs use plain arrays. This buffer abstraction is a resizeable array with a length that has operations for access or modification. Other data structures are built on top of the buffer so they all benefit from the buffer operations for always getting the correct buffer pointer and buffer size.

However, even with good data structures and programming practices, errors can still occur. Recently, we added a troubleshooting tool to Nanos to help us identify these problems in the form of a debug heap wrapper. What exactly does that mean? Well, one of Nanos' great features is the way memory heaps are handled. Generally speaking, a heap is a large block of memory from which dynamic memory allocations are made, memory that is shared by all the threads in a program. In a typical C program, the "malloc" and "free" function calls get the memory from, or return it to, a single heap, provided by the operating system. Since Nanos is the operating system, it has to handle how memory is allocated on its own. While many operating systems only use a single heap for any kernel allocations, Nanos has a simple heap interface that lets it use any number of unique heap implementations.

Let's look at the source code to see how this works.

struct heap {
        struct table metadata;
        u64 (*alloc)(struct heap *h, bytes b);
        void (*dealloc)(struct heap *h, u64 a, bytes b);
        void (*destroy)(struct heap *h);
        bytes (*allocated)(struct heap *h);
        bytes (*total)(struct heap *h);
        bytes pagesize;

As you can see, the heap struct consists mainly of function pointers. The alloc and dealloc pointers are the most important and are equivalent to malloc and free in the standard library. Note that the dealloc call takes the length of the allocation as an argument, so a heap is not required to keep per-allocation metadata. The destroy pointer tears down the heap, and the allocated and total function pointers are used for accounting.

Each kernel subsystem is initialized with a pointer to which heap to use, which means that different parts of the kernel can allocate memory differently, depending on the needs of the subsystem. Additionally, because these heaps use a generic interface and are passed as parameters, we can make a wrapper that does some extra work for allocation or deallocation, yet still uses the parent heap (whatever it may be) for the actual allocations. This is exactly what our new debug heap wrapper does: it still gets the memory from the heap that it wraps, but it writes some extra information and patterns to that memory when it is allocated, and it checks this information when the memory is deallocated.

Next, let's look at a little snippet from pagecache initialization to see how we can use the heap interface.

void init_pagecache(heap general, heap contiguous, heap physical, u64 pagesize)
    pagecache pc = allocate(general, sizeof(struct pagecache));

    pc->total_pages = 0;
    pc->page_order = find_order(pagesize);
    pc->h = general;
    pc->contiguous = contiguous;
    pc->physical = physical;
    pc->zero_page = allocate_zero(contiguous, pagesize);

The init_pagecache function is called during Nanos initialization, and it takes three different kinds of heaps. The general heap is for small, utility allocations such as that allocation of the pagecache structure. The contiguous heap is for generating virtual addresses and the physical heap is for physical addresses. Each of these heap pointers is saved in the pagecache structure for use by other pagecache functions.

Now, consider what a heap wrapper would look like—it should return a heap pointer and it should take as an argument a parent heap pointer that it will wrap around. That's pretty much what our debug heap wrapper prototype looks like, with the addition of a metadata heap pointer for its own use and a padding size for alignment requirements:

heap mem_debug(heap m, heap p, u64 padsize);

So, if we wanted to debug only the miscellaneous allocations in the pagecache, we only need to change one line of code.

        pc->total_pages = 0;
        pc->page_order = find_order(pagesize);
-       pc->h = general;
+       pc->h = mem_debug(general, general, 0);
        pc->contiguous = contiguous;
        pc->physical = physical;
We pass a 0 for padsize because we don't care about alignment in this case. Note that memory still comes from the same heap as before, but the mem_debug wrapper is performing additional checks and operations on that memory because its alloc and dealloc functions are called now.

So what does the debug heap wrapper actually do? For the alloc function, the debug heap wrapper asks the parent heap for the requested size plus some extra memory to go on either side of the requested allocation. This extra space is used to store a header and create a red zone that can be used to detect overruns.

The header stores, among other things, the requested length of the original request and the return address of the function that called for the allocation. The red zone space is filled with a recognizable pattern (the number 0xcafefade) so that it can be easily identified when looking at memory in a debugger. Finally, a different pattern (0xfeedbeef) is written over the requested memory area so that if some code tries to use this memory without initializing it first, then it should error or crash upon use. The uninitialized pattern also shows up in memory while debugging which can help identify the problem.

When the user of the memory finishes and deallocates the memory, the wrapper performs the additional checks before returning the memory to the heap. The header is checked to make sure it still looks correct. In Nanos, deallocation calls require the length of the original allocation request and that argument gets compared to the length saved in the header to make sure they are identical. The red zone areas are then checked to make sure the pattern in them has not been overwritten. If any of these checks fail, the wrapper forces Nanos to crash with a message identifying which check failed. Finally, the last step on deallocation is to write a special pattern (0xdeaddead) over the entire area. If there is any code that still has a reference to this memory, it should error or crash with this identifiable pattern as the culprit, identifying it as a possible use-after-free condition.

Let's look at how this works in practice. Here's an example of a real crash that happened when using the debug heap wrapper to check the miscellaneous heaps in the pagecache. See if you can spot the problem.

cpu: 0
interrupt: 000000000000000d (General protection fault)
    frame: ffff800000160000
error code: 0000000000000000

    rax: ac1cf6b1de570170
    rbx: ffff800001002f40
    rcx: ffff8000021fffe0
    rdx: 0000000000000000
    rsi: 0000000000000000
    rdi: deaddeaddeaddead
    rbp: ffff80000015fdb8
    rsp: ffff80000015fc68
    r8: ffffffff800d7c70        (bootstrap_region + 0000000000001bf0/0000000000200000)
    r9: 0000000000000040
    r10: ffffffff800d7c70        (bootstrap_region + 0000000000001bf0/0000000000200000)
    r11: 0000000000000000
    r12: ffff800001002640
    r13: ffff800000a29e40
    r14: 0000000000000001
    r15: ffff800000a00000
    rip: ffffffff8001c390        (pagecache_write_sg_finish + 00000000000005a0/0000000000000662)
rflags: 0000000000010046
    ss: 0000000000000010
    cs: 0000000000000008
    ds: 0000000000000000
    es: 0000000000000000
fsbase: 0000000000000000
gsbase: 0000000000000000

frame trace:
ffff80000015fdc0:   ffffffff8002d8dc    (merge_join + 000000000000002c/0000000000000077)
ffff80000015fde0:   ffffffff8001ba73    (pagecache_write_sg + 0000000000000493/00000000000005ca)
ffff80000015ff20:   ffffffff800448fc    (fallocate + 000000000000021c/000000000000028f)
ffff80000015ff80:   ffffffff80062dca    (syscall_debug + 000000000000026a/00000000000005f8)
ffff80000015ffd0:   ffffffff800633a0    (configure_syscalls + 0000000000000000/00000000000002e1)

You probably noticed that the rdi register contains "deaddeaddeaddead," and that means that this instruction, at the address in the rip register, was trying to access memory that had been freed. In this particular case, the code was trying to jump to a function pointer stored in structure which had just been freed right before the offending instruction. We didn't notice the problem before because no error was occurring—yet. We just saved a lot of future headache with very little effort!

As you probably guess, performing these checks for every allocation and deallocation can be costly in terms of performance. In other operating systems or programs that use a single heap, a feature like this can bring performance to a crawl. However, the design of Nanos heaps gives us another advantage: since each subsystem is initialized with its own heap, we can just add the wrapper to only the subsystems we care about debugging and it will not affect the performance of the rest of the operating system.

The design of Nanos makes finding memory management errors much easier than many other systems. We can create a heap wrapper that only requires implementing two functions, allocate and deallocate, that provides checks for multiple kinds of programming errors related to memory management: buffer overruns, use of uninitialized memory, and use of memory after deallocation. We can also target which subsystems we want to debug to avoid causing slowdowns in unrelated systems and keep overall performance high. All of this is accomplished in less than 200 lines of code! This is another great example of how the design principles of Nanos enable a small team to tackle a complicated software project like an operating system with speed and quality.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.