Memory Ballooning, Unikernels, QEMU, and Firecracker Oh My!

You might've seen the new streaming memory metrics in ops desktop recently. You might be wondering how that was made. It was done with a balloon driver.

We've had a balloon driver for quite some time but never really hooked it up to local ops runs as most of our users deploy to the big clouds vs running their own infrastructure. It is also not really something that you would use day to day in dev. As for the memory stats collection itself many users and customers in the cloud will default to something like cloudwatch metrics or radar or Datadog or something of that sort so they never really required use of it for metrics purposes. We ended up adding it for local instances cause we wanted to show nice little graphs in ops desktop which would allow us to track memory utilization by the second (or even at lower granularity if we really wanted to).

In ops we we use QMP so we can setup our stats polling using these commands:

commands := []string{
      `{ "execute": "qmp_capabilities" }`,
      `{ "execute": "qom-set", "arguments": { "path": "/machine/peripheral-anon/device[` + devid + `]", "property": "guest-stats-polling-interval", "value": 2}}`,
      `{ "execute": "qom-get", "arguments": { "path": "/machine/peripheral-anon/device[` + devid + `]", "property": "guest-stats" } }`,
    }

Then later on if we want to change the size of the balloon we could issue something else like this:

{ "execute": "balloon", "arguments": { "value": 536870912 } }

What is a balloon driver

First off - what actually is a balloon driver? It essentially allows you to dynamically add or remove memory to a guest unikernel without stopping, pausing or migrating the guest. Rather than assigning a static amount, say, 2G of ram to the guest when you boot the instance up you can assign 2G, but if your app only needs 256mb the guest can tell the host to use it if it's not in use. When the balloon expands, pages in the balloon get unmapped from the guest. This process reclaims memory for the host (eg: the cloud's system) as it sees fit.

This process also allows clouds to perform memory oversubscription as you can have the balloon be inflated and only deflate when the guest actually needs it.

Now, depending on the application this can be easier said than done. For instance, languages with garbage collectors or applications that evade the GC, might not get the same benefits as applications that have tighter control on the memory so your mileage might vary but if you're a platform provider, whether public or just for an internal engineering organization at scale you might get a lot of benefits.

To give you a concrete example if you accept a file upload from a http endpoint and it writes to disk in one function and cleans every resource up correctly you might see an increase in ram for a few minutes and then later on the GC will have marked and swept the large upload allocations away. However, if you allocate something on the heap and it goes out of scope it might never get reclaimed.

With the stats command in ops we can watch this growth:

➜  ~ ops instance stats
+-------+-----------------+---------------+
|  PID  |      NAME       |    MEMORY     |
+-------+-----------------+---------------+
| 52524 | memg-1733677598 | 46mb / 2143mb |
+-------+-----------------+---------------+
➜  ~ ops instance stats
+-------+-----------------+----------------+
|  PID  |      NAME       |     MEMORY     |
+-------+-----------------+----------------+
| 52524 | memg-1733677598 | 806mb / 2143mb |
+-------+-----------------+----------------+

Or, like how ops desktop consumes it through json:

➜  ~ ops instance stats --json | jq
[
  {
    "ID": "52524",
    "Name": "memg-1733677598",
    "Status": "Running",
    "Created": "13 seconds ago",
    "PrivateIps": [
      "192.168.68.130"
    ],
    "PublicIps": [],
    "Ports": [
      ""
    ],
    "Image": "/Users/eyberg/.ops/images/memg",
    "FreeMemory": 2097,
    "TotalMemory": 2143
  }
]

The memory reclamation benefit is typically not so beneficial for the end user if they are running on the cloud but if they are orchestrating their own cloud using something like firecracker it can be helpful.

You might find it weird that the host can't just use memory not in use - after all - doesn't it know what is in use and what is not? Not really - and that is by design. Remember that the virtual machine is a representation of a machine - as far as the hypervisor is concerned the entire thing is just one black box that is consuming 2g of ram. That is why you typically have to run a "guest agent" on the clouds to get memory stats and such. However there are non-technical reasons as well. You don't really want Amazon or Google just poking at your sensitive database workloads do you? Seeing all those passwords in clear text? In fact this is such a large concern that we now have even newer technologies such as SEV-SNP 'secure nested paging' that go above and beyond what normal hypervisors do today to thwart against the likelikhood of a malicious hypervisor snooping in on your workloads. You might trust Google but how about the people that broke into Google?

SSL Added and Removed Here! :)

How Does Firecracker Ballooning Differ from QEMU

Like many things in our world - different deployment targets typically involve having to handle things differently. You might've seen this when deploying to arm based instances and discovering that we use UEFI there or where we have to use different networking drivers such as gvnic to talk to certain google cloud instances.

Firecracker differs from qemu in that it explicitly doesn't support PCI. This is done to have a faster boot time, however means there are a lot of things that have to work differently. For ballooning that means we need to MMIO (memory mapped i/o) transport.

A lot of people might know that, as of today, GPU workloads aren't done in firecracker because of the required PCI support, which is why projects like cloud hypervisor exist. This is another feature that has to be dealt differently because of that.

The lack of GPU support and memory ballooning isn't the only thing that differs on firecracker from the lack of PCI. You also can't do device hotplugging. IRQs are passed to the kernel at boot time. For doing things like GPU pci passthrough there's an implication that you'd need to pin memory as well which breaks the ability to do memory oversubscription. While there are some tracking issues in firecracker that are talking about optionally enabling this you can understand the reluctance here.

To use the balloon on firecracker you must assign it via config like so:

 "balloon": {
    "amount_mib": 0,
    "deflate_on_oom": false,
    "stats_polling_interval_s": 1
  },

From there you can get stats:

#!/bin/sh

curl --unix-socket /tmp/firecracker.socket -i \
    -X GET 'http://localhost/balloon' \
    -H 'Accept: application/json'

You can also inflate or deflate the balloon by sending a PATCH request:

#!/bin/sh

curl --unix-socket /tmp/firecracker.socket -i \
    -X PATCH 'http://localhost/balloon' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d "{
        \"amount_mib\": "512"
    }"

PCI vs MMIO

It should be noted that PCI is not the end-all-be-all here. It is just one option. What is quite interesting is that many embedded systems in the past, which obvoiusly includes ARM devices, didn't support PCI either because of power/size and other constraints. Thus the concept of a device tree can be found on systems such as this.

Firecracker has a handful of drivers through virtio: virtio-blk, virtio-net, virtio-vsock and virtio-balloon and virtio itself actually has three transport layers you can use:

  • pci bus
  • mmio (firecracker uses this)
  • channel i/o

MMIO differs from PCI in a variety of ways. One way is that PCI does device discovery but for MMIO the guest os needs to know the location of registers and interrupts.

MMIO reads/writes trigger a VMExit which is one contributing factor to why we say firecracker trades a faster boot time for a slower runtime. Virtio PCI is generally much faster on block with reads/writes, on net with tx/rx. The LOC involved in virtio-pci vs virtio-mmio is also substantially different. Having said that there is interest in improving the mmio transport and there is a whole slew of options to talk to GPUs in a different manner so only time can tell which direction this ends up going.

For now - armed with this new knowledge on how to operate memory balloons now you can go build that internal PaaS using firecracker and unikernels that you've been dreaming of. Perhaps in the next post we'll go into more details on some of the optimizations we did - stay tuned.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.