Inflating the Buddy Memory Allocator

We recently added virtio balloon support to firecracker. We've had it forever with qemu but one of our customers had pointed out that we were missing support for it in firecracker. As part of this work we ended up implementing a new memory allocator that uses the buddy technique.

Why Use a Custom Allocator Anyways?

The short answer is that we're an operating system so there isn't one unless we create one. You can't use malloc to implement malloc. We already had quite a few different heaps available on our system but we found that the id heap implementation we were using wasn't aggressive enough with memory reclamation when inflating the ballon under firecracker.

While your application level allocation can be tended to via garbage collection, reference counting, or RAII - when dealing with heaps at the operating system level you're going to be dealing with raw memory. Even though some people seem to think that you can completely eradicate this through "memory safe languages", at the end of the day operating on raw memory is how computers actually work.

The operations involved in obtaining blocks from and returning them to the free storage lists are very fast, making this scheme particularly appropriate for list structure operations and for other situations involving many sizes of blocks which are fixed in size and location. This is in fact the storage bookkeeping method used in the Bell Telephone Laboratories Low Level List Language (LLLLLL or L⁶, pronounced "Lsixth").
- Kenneth Knowlton, October, 1965 / Communications of the ACM

Questions That Come Up When Writing a Custom Allocator

An allocator has many different design properties that influence the type of allocator you might end up writing and to be clear - we're not necessarily implementing a new one - we're just implementing an existing type. These questions include determining the size of allocation, the lifetime of the allocation, and usage of the allocation and they can all heavily influence which type you make.

For instance, you might be wondering, if you have a custom allocator, how do you know where free chunks are in your heap? You have to track them somehow. Even if you know where free chunks are - how do you pick amongst the many different chunks? Space efficiency and time complexity are key here. Similarily, when freeing memory how do you know how much space to mark as "free" at a given time? If you haven't personally called free in a while let me remind you of what the function declaration in section three looks like:

void free(void *ptr);

You'll notice that there is no size that gets passed in here. Therefore you need to know based on the pointer alone how much memory needs to be freed. One way is to include a header before the chunk of memory we are allocating describing the size.

Similarly there are questions to consider when it comes to picking space. You could do "first fit" where you find the first hole that is large enough or you could go for a "best fit" that results in the smallest hole after allocation.

Security

Malloc implementations have to pay close attention to security as well. You wouldn't want to free space that is already free (a double free) as it could corrupt the underlying heap. It could, for example coalesce two regions together allowing an attacker to overwrite a neighboring region. You also wouldn't want to have an issue where you start writing below the heap into your bss.

Types of Memory Allocators

There are many many different types of memory allocators that one could choose to implement and they all have varying tradeoffs - much like why we have so many different heaps in nanos today.

Two common features that are worth comparing when choosing an allocator include both fragmentation and cost of allocation.

Fragmentation

When you think of fragmentation you might think of disk fragmentation, however, the same thing occurs with memory. You can categorize fragmentation as external and internal. For instance, external fragmentation describes holes left between separate allocations while internal fragmentation deals with the amount of allocation requested vs the amount of allocation actually allocated. So if you have a fixed-sized allocator that allocates in chunks of 16k but you only use 10k you can start getting internal fragmentation. Why do this you ask? This happens because many allocators have minimum fixed-size allocations. On the flip side, with external fragmentation, you could have a heap that has a handful of 16k chunks but there is 8k of space between chunk one and chunk two.

-----------------------------
|          heap             |
-----------------------------

-----------------------------
| used |       free         |
-----------------------------

-----------------------------
| used |    free     | used |
-----------------------------

-----------------------------
| used | used | free | used |
-----------------------------

You can see how external fragmentation happens in this diagram.

Efficiency

You can compare allocators via memory efficiency as well. Generally speaking we need/want something fast. Something that looks for free chunks might be better suited via binary search vs linear search for example.

Allocator Types

Bump Allocator

The simplest most straight-forward implementation might be the bump allocator. It just advances the pointer whenever you get a new request for memory. A major downside of this approach is that there is no freeing of memory though. Sometimes this is referred to as an arena or linear allocator.

A simple implementation of this might look like the following:

package main

import (
	"fmt"
	"unsafe"
)

type Bump struct {
	buf  []byte
	ptr  uintptr
	size uintptr
}

func (b *Bump) Alloc(size uintptr) unsafe.Pointer {
	if b.ptr+size > b.ptr+b.size {
		return nil
	}

	ptr := b.ptr
	b.ptr += size
	return unsafe.Pointer(ptr)
}

func main() {
	sz := 1024

	buf := make([]byte, sz)
	var b = &Bump{
		buf:  buf,
		ptr:  uintptr(unsafe.Pointer(&buf[0])),
		size: uintptr(sz),
	}

	alloc1 := b.Alloc(10)
	fmt.Printf("alloc1: %v\n", alloc1)

	alloc2 := b.Alloc(20)
	fmt.Printf("alloc2: %v\n", alloc2)
}

Freelist Allocator

Then we have the freelist allocator - it is backed by a linked list but the linked list is part of the allocations. This can hold information such as whether the chunk is free/allocated and it's size.

As you might be able to guess by it's name it is simply a linked list of free memory - eg, a "freelist". A simple struct might look like this:

type struct FreeList {
    int sz
    nxt *FreeList
  }

This is fast like a bump allocator but can actually do the all important task of freeing memory, because despite how cheap ram is today it is still finite.

One of the downsides of freelists, since the allocations can be of varying size is that you can build up quite a bit of internal fragmentation.

Slab Allocator

Next up we have the slab allocator. This can be built on top of your freelist and sometimes is actually built on top of a buddy allocator (like in linux). Slabs are usually composed of sets of slabs of varying sizes but each slab itself is the same fixed size whereas in our buddy system we can have different sizes.

Buddy Allocator

The allocator that we added here is the buddy allocator. Buddy allocators have a binary tree that splits memory into powers of two and a bitset. It recursively walks the tree to find space and when freeing memory it sees if the buddy is also free to coalesce it. This has the added advantage that you can spilt up requests into different sized chunks.

Our buddy tree follows a basic binary tree dividing by two down to the granular size. We round up if need to allocate something that doesn't fit into the current size. So the tree might look something like this:


                                 -------
                                 | 256 |
                                 -------

              -------                             -------
              | 128 |                             | 128 |
              -------                             -------

     ------            ------             ------           ------
     | 64 |            | 64 |             | 64 |           | 64 | 
     ------            ------             ------           ------

------  ------     ------  ------     ------  ------    ------ ------
| 32 |  | 32 |     | 32 |  | 32 |     | 32 |  | 32 |    | 32 | | 32 |
------  ------     ------  ------     ------  ------    ------ ------

The buddy allocator has higher scalability than some of the other methods we've described since it doesn't rely on things like linear search to select a free area. This also, by extension, means less cpu.

Both the linux kernel and the well-regarded jemalloc use the buddy system as well.

Other variations of buddy allocators can include:

fibonacci buddy system
binary buddy system
weighted buddy system
tertiary buddy system

Other well known allocators

Other well known allocators include the following:

Doug Lea's mallloc (dlmalloc)
jemalloc
mimalloc
snmalloc
rpmalloc
hoard

Our Implementation

Our implementation resulted in lower memory fragmentation, higher memory re-use (one of the original goals), and higher scalability. For our implementation we also used a MRU. Basically when there are multiple minimum-sized memory areas available to satisfy an allocation request, the allocator can select the most recently used area.

The feature request that prompted this new allocator work was actually much smaller and far less complicated to "get working" but it ended up exposing the need to refactor how we were allocating and deallocating memory within the balloon and the end result is that everything else became much faster and used less memory so it was a win win all around.

This is a perfect example of the type of deep work that we like to do with our customers. If you've been playing around with ops and were thinking "if only they added support for X" reach out as we might be able to make it faster than you think and it too might lead to other very beneficial results.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.