Measuring Unikernel Cold Boot Time With Perf

A while back ago CDNs would market on their TTFB - that is the 'time to first byte'. It sounds like a really important number and the lower you can make it the more impressive it looks. At the end of the day though it doesn't really matter cause a single image can be tens of kilobytes or even megabytes in size. A single page can have 80 some odd requests and transfer multiple megabytes in resources. So while it is interesting it's a bit of a vanity metric.

Boot time is one of those vanity metrics too. This came about because of all the various serverless vendors taking flack from their users having to "warm" their workloads to get around "large cold boot times". (Pro tip: skip serverless and go straight to unikernels.)

Cold boot time is a vanity metric simply because there are so few workloads that can actually make use of a lower number. It doesn't matter how fast your kernel can boot if your app takes longer. In fact the majority of apps will have "real cold boot times" measured in seconds (eg: setup a db connection, load some static assets, etc.) and that doesn't even account for the fact that most workloads want to run for more than a few seconds anyways (like hours or days), be it a persistent webserver or even a recurring one-shot cronjob.

I recently took a un-optimized Nanos workload and observed it booting in around 60ms. Does it really make a difference if I shaved another 20ms off or added 20ms on?

I know it won't make a lick of difference if I'm trying to boot elasticsearch, or any of the new fancy python ai frameworks because it simply won't be able to take advantage of it.

Remember that $96,000 vercel bill?

Some engineers might be dreaming of a function-as-a-service situation where low boot times would be useful but I think the larger dev community is starting to catch on to the fact that 'serverless' can turn out to be really expensive. Remember that $96,000 vercel bill? No thank you. Even if your workload can run in 50-100ms chunks - does it make sense to spawn for each chunk or just keep a persistent server around? Most cloud providers will give you a full always-on vm, running a unikernel for ~$4/month and will scale infinitely better than many of the serverless providers can. That's before you take into account their free credits which might mean you pay *nothing*.

Finally, when it comes to kernels, if you are trying to go for a super low number (such as <10ms) you are probably starting to carve out a ton of essential capabilities that normal workloads might have so it would be wise to figure out what that list actually is. We're talking how many levels of page tables, what drivers being used for what cloud, how much ram and disk you have allocated, kaslr, pci-passthrough, etc.

I Still Want to Measure the Cold Boot Time

Ok. Then let's detail how we can do that. A naive approach might start with something like

time ops run myprogram

However, there are multiple problems with this approach. First off you are measuring both the building and the running the of the image as 'ops run' combines the two operations together. We can safely assume the image is already built, and again, to reiterate - the larger that image the longer it will take to boot. Have a node.js application clocking in at 5 gig of disk with thousands of files? You aren't going to reap from a fast cold start.

What if we just did a:

time ops instance create

Again, this isn't really measuring what we want. Ops itself typically doesn't run prod workloads - the clouds do. Ops will fork off a qemu instantiation to do the actual running of things so that's what we want to be measuring.

First off you're going to need to obtain perf. No, not the one from apt-get - that is missing libtracevent, a vital component in what we want. Make your way to and download the source for whatever linux kernel you are running. Then walk into the tools/perf directory and run make there. You can verify your version is correct by looking at:

perf version --build-options

Next we are going to tag nanos with a few writes to the i/o port for perf to pick up to indicate our beginning and end time:

eyberg@venus:~/go/src/$ git diff
diff --git a/platform/pc/boot/stage1.s b/platform/pc/boot/stage1.s
index 0e36f15f..0069967e 100644
--- a/platform/pc/boot/stage1.s
+++ b/platform/pc/boot/stage1.s
@@ -8,6 +8,9 @@ init:
        bits 16
        org base

+   mov al, 10
+   out 0xf4, al
         ;; serial setup
        mov ax, 0x00e3          ; AH = 0, AL = 9600 baud, 8N1
        xor dx, dx
diff --git a/src/unix/thread.c b/src/unix/thread.c
index 1a36438e..d1ca9537 100644
--- a/src/unix/thread.c
+++ b/src/unix/thread.c
@@ -1,6 +1,7 @@
 #include <unix_internal.h>
 #include <ftrace.h>
 #include <gdb.h>
+#include <x86_64/io.h>

 BSS_RO_AFTER_INIT thread dummy_thread;

@@ -215,6 +216,14 @@ static void thread_cputime_update(thread t)

 static void thread_pause(context ctx)
+static boolean initialized = false;
+    if (!initialized) {
+        out8(0xf4, 200);
+        initialized = true;
+    }
     if (shutting_down & SHUTDOWN_ONGOING)
     thread t = (thread)ctx;

Note that there are many different optimizations that we could consider here such as booting under firecracker instead - of course if you do that you can't use GPUs without resorting to hacks. Remember when I mentioned earlier that to get to the really small boot times you have to start carving out a bunch of stuff you may want/need?

In this example here i'm writing to 0xf4 the values of 10 and 200 which translate to 0xa and 0xc8 respectively. I'm using this small program as a test. If you use an interpreted language that is linked to twenty different libraries your results are going to be pretty different.

#include <stdio.h>
#include <unistd.h>

int main() {




Next we want to run it with 'ops run' so it will create the image for us and we can do 'ps aux | grep qemu' to get the qemu command we want to plug into this next script:


perf kvm record -a -e kvm:kvm_entry -e kvm:kvm_pio -e sched:sched_process_exec \
            -o $PERF_DATA &

sleep 5

qemu-system-x86_64 -machine q35 -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x3 \
  -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x3.0x1 \
  -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x3.0x2 \
  -device virtio-scsi-pci,bus=pci.2,addr=0x0,id=scsi0 \
  -device scsi-hd,bus=scsi0.0,drive=hd0 -vga none -device isa-debug-exit -m 2G -device virtio-rng-pci \
  -machine accel=kvm:tcg -cpu host -cpu max \
  -drive file=/home/eyberg/.ops/images/main,format=raw,if=none,id=hd0 \
  -device virtio-net,bus=pci.3,addr=0x0,netdev=n0,mac=12:99:76:9d:0d:6a \
  -netdev user,id=n0 -display none -serial stdio

kill $PERF_PID

Then we'll convert our results to plain text:

perf script -i >

and now you can see your values:

eyberg@venus:~/qemu-boot-time$ grep 0xf4
 qemu-system-x86 3554013 [002] 16464943.610700:              kvm:kvm_pio: pio_write at 0x60 size 1 count 1 val 0xf4
 qemu-system-x86 3554013 [002] 16464943.623384:              kvm:kvm_pio: pio_write at 0xf4 size 1 count 1 val 0xa
 qemu-system-x86 3554013 [002] 16464943.692615:              kvm:kvm_pio: pio_write at 0xf4 size 1 count 1 val 0xc8

(16464943.692615 - 16464943.623384) * 1000 = ~69ms which seems about right to me (on this machine i'm on)

So now you know how to measure the boot time of the nanos unikernel. Who knows, maybe there are use-cases for making ultra-low boot times - if you have any ideas - let us know!

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.