Invest in NanoVMs!

Nanos on the 64-bit RISC-V Architecture


A New and Open Computer Architecture

The vast majority of microprocessors in the world today use either Intel processors descended from the 8086 from 1978 or ARM processors which have been around since the late eighties. Both types of processors have applications where they are well suited, and both have evolved dramatically over the years. However, they also have downsides. Intel processors, while extremely powerful, are complex, expensive, and power-hungry. On the other side, ARM processors are very efficient due to their simpler design and history. This makes them perfect for small, low-power computing applications, yet licensing fees must be paid for each IP core which makes them expensive to scale to large numbers. Both architectures also bear the engineering weight of their past, making it more difficult to adapt to the changing computing landscape as more computing is moved to the cloud.

In 2010, a group at UC Berkeley created and freely published the design for a new processor architecture, dubbed RISC-V, denoting the fifth generation of academic cooperative processor design. While the design itself was not novel, it had the benefit of over 40 years of RISC history to learn from and most importantly it was open and unencumbered by patents and royalties. Over the next several years, the project quickly gained attention as the specification matured and expanded beyond academic purposes as major industry players took notice. The open design and a desire for this architecture to fill a wide variety of applications led to the project taking on a modular approach. The ISA is separated in pieces: a base set of instructions that all processors must support, but also optional extensions that add new instructions and functionality. A small microcontroller design might just use the base instruction set while an application specific processor design could include a proprietary extension to handle a special task. To meet some common architectural needs, the RISC-V committee worked with the industry to develop a number of approved standard extensions and continues to work on more, including the recently approved Hypervisor extension. Since the open architecture is just a specification, anyone is free to design a new extension or create an implementation of a RISC-V platform that suits their needs. Users will eventually be able switch between different RISC-V hardware vendors without having to change their software toolchain, enabling rapid exploration of what RISC-V platform works best for them. An open architecture enables choice, competition, and innovation that has been missing from the existing CPU market.


RISC-V and Nanos

Now that we know a little about RISC-V as an architecture, you might be wondering what it has to do with a unikernel designed for cloud environments. There is currently very little actual RISC-V hardware on the market at the moment, as the architecture is still very new and rapidly evolving and it takes time to create a new platform (remember, RISC-V is a specification, not an implementation). However, the RISC-V committee recently ratified a new Hypervisor specification, creating a processor mode designed to efficiently handle multiple RISC-V virtual machines. It is too new to be in any hardware or even emulators yet, but it offers the future possibility of hardware explicitly designed for the cloud and operating systems like Nanos.

If there is no hardware to use, how can we explore and learn about it? Fortunately, Linux and QEMU have had RISC-V support for several years and there is a lot of documentation on how to make that happen. If you are interested in running Linux on QEMU for program development or experimenting, you can check out the Debian port for RISC-V. However, we do not actually need a RISC-V Linux VM to try out Nanos on this new architecture; we can do it with QEMU and a cross-compiler.

Before we get too far into what packages are required, we should be more specific about what platform we are targeting for Nanos. Nanos is a 64-bit OS, so we need the 64-bit RISC-V architecture, known as riscv64. We need a general purpose computer with instructions for floating point, atomic, and fencing operations. RISC-V extensions are denoted by letters, but the committee has designated one letter, 'G' to refer to a general purpose CPU with the above mentioned extensions, so we specifically require a riscv64g processor. A computer is more than just a processor though, so we want a target platform that fits our virtualization needs. QEMU provides a platform for use with virtual machines called virt. The virt platform contains simple virtualized hardware in addition to a RISC-V CPU, and it lets us use virtio for network and disk devices which Nanos already has drivers for. The other platform devices are very easy to use.


Preparing the environment

The riscv64 spec has changed a lot in the last few years, so we need a relatively new version of gcc to be able to compile properly. The following steps use Ubuntu 20.04, so Debian or older versions of Ubuntu may not have new enough packages. If this is the case you will have to build your own RISC-V toolchain.


Cross-Compiler

The first thing we need is our gcc cross-compiler and binutils. On a recent Ubuntu system you can install these with:

sudo apt-get update && sudo apt-get install gcc-riscv64-linux-gnu

Building QEMU

Running a riscv64 Nanos on QEMU requires a very new version, 6.0.0 or better. If your Debian or Ubuntu is new enough, you might not need to build qemu and can install the 'qemu-system-misc' package.

Install dependencies for the QEMU build:

sudo apt install ninja-build libglib2.0-dev libpixman-1-dev

Now let's pull a QEMU tree, with only one level of commit history, and check out the v6.0.0 tag:

git clone --depth 1 --branch v6.0.0 https://gitlab.com/qemu-project/qemu.git qemu-v6.0.0
cd qemu-v6.0.0/
mkdir build
cd build

Configure and build. Note that you may wish to add other targets to "--target-list", separated by commas.

../configure --target-list=riscv64-softmmu --prefix=/usr/local
make
sudo make install

Your QEMU build is now installed. Be sure that /usr/local/bin is in your $PATH before proceeding.


Building Nanos for the "virt" QEMU machine type

In a native build of the Nanos kernel, or when staging a program executable using ops, common dependencies like shared libraries and configuration files are pulled from the host system. This isn't going to work when building on a host that's a different architecture or OS than the target.

When cross-building for another architecture, we'll need a path to source such dependencies. The NANOS_TARGET_ROOT environment variable supplies this path to the Nanos makefiles. You can use a root image of your own Linux/riscv64 installation or download and use the minimal Debian risc64 root image that we provide for our CI tests:

wget https://storage.googleapis.com/testmisc/riscv64-target-root.tar.gz
mkdir riscv64-target-root
cd riscv64-target-root
sudo tar --exclude=dev/* -xzf ../riscv64-target-root.tar.gz
export NANOS_TARGET_ROOT=`pwd`

Now we're ready to clone a Nanos tree and build it for the virt QEMU machine type. In the Nanos source, you'll see a 'virt' platform and a 'riscv-virt' platform. The 'virt' platform refers to the Nanos ARM port which also uses a similar QEMU machine with virtualized devices, but we want the 'riscv-virt' platform. We use the PLATFORM variable to indicate the target platform, which also implies the target architecture (ARCH). The build will check the host architecture and, if it differs from that of the target, automatically set CROSS_COMPILE to "$ARCH-linux-gnu-". CROSS_COMPILE can be overridden to a different prefix if necessary: if you had to build your own RISC-V toolchain, you will need to set CROSS_COMPILE to "riscv64-unknown-linux-gnu-". TARGET specifies the test program to build; we'll start with "hw", which is a simple hello world program written in C.

git clone http://github.com/nanovms/nanos nanos-riscv
cd nanos-riscv
make PLATFORM=riscv-virt TARGET=hw

We can run the instance under QEMU with emulation using the 'run-noaccel' make target.

make PLATFORM=riscv-virt TARGET=hw run-noaccel
[...]
qemu-system-riscv64 -machine virt -m 1G  -kernel /home/justin/src/nanos/output/platform/riscv-virt/bin/kernel.img -display none  -serial stdio -drive if=none,id=hd0,format=raw,file=/home/justin/src/nanos/output/image/disk.raw -device virtio-blk-pci,drive=hd0 -no-reboot   -device virtio-net,netdev=n0 -netdev user,id=n0,hostfwd=tcp::8080-:8080,hostfwd=tcp::9090-:9090,hostfwd=udp::5309-:5309 -object filter-dump,id=filter0,netdev=n0,file=/tmp/nanos.pcap -cpu rv64
OpenSBI v0.9                                                                                                                                                                                                                       
   ____                    _____ ____ _____                                                                                                                                                                                        
  / __ \                  / ____|  _ \_   _|                                                                                                                                                                                       
 | |  | |_ __   ___ _ __ | (___ | |_) || |                                                                                                                                                                                         
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |                                                                                                                                                                                         
 | |__| | |_) |  __/ | | |____) | |_) || |_                                                                                                                                                                                        
  \____/| .__/ \___|_| |_|_____/|____/_____|                                                                                                                                                                                       
        | |                                                                                                                                                                                                                        
        |_|                                                                                                                                                                                                                        
                                                                                                                                                                                                                                   
Platform Name             : riscv-virtio,qemu                                                                                                                                                                                      
Platform Features         : timer,mfdeleg                                                                                                                                                                                          
Platform HART Count       : 1                                                                                                                                                                                                      
Firmware Base             : 0x80000000                                                                                                                                                                                             
Firmware Size             : 100 KB                                                                                                                                                                                                 
Runtime SBI Version       : 0.2                                                                                                                                                                                                    
                                                                                                                                                                                                                                   
Domain0 Name              : root                                                                                                                                                                                                   
Domain0 Boot HART         : 0                                                                                                                                                                                                      
Domain0 HARTs             : 0*                                                                                                                                                                                                     
Domain0 Region00          : 0x0000000080000000-0x000000008001ffff ()                                                                                                                                                               
Domain0 Region01          : 0x0000000000000000-0xffffffffffffffff (R,W,X)                                                                                                                                                          
Domain0 Next Address      : 0x0000000080200000                                                                                                                                                                                     
Domain0 Next Arg1         : 0x00000000bf000000                                                                                                                                                                                     
Domain0 Next Mode         : S-mode                                                                                                                                                                                                 
Domain0 SysReset          : yes                                       
                                                                                                                                                                                                                                   
Boot HART ID              : 0                                                                                                                                                                                                      
Boot HART Domain          : root                                                                                                                                                                                                   
Boot HART ISA             : rv64imafdcsu                                                                                                                                                                                           
Boot HART Features        : scounteren,mcounteren,time                                                                                                                                                                             
Boot HART PMP Count       : 16                                                                                                                                                                                                     
Boot HART PMP Granularity : 4                                                                                                                                                                                                      
Boot HART PMP Address Bits: 54                                                                                                                                                                                                     
Boot HART MHPM Count      : 0                                                                                                                                                                                                      
Boot HART MHPM Count      : 0                                                                                                                                                                                                      
Boot HART MIDELEG         : 0x0000000000000222                                                                                                                                                                                     
Boot HART MEDELEG         : 0x000000000000b109          
en1: assigned 10.0.2.15
hello world!
args:
   hw
   poppy

And we can demonstrate some connectivity with a little Go-based webserver:

make PLATFORM=riscv-virt TARGET=webg run-noaccel
[...]
qemu-system-riscv64 -machine virt -m 1G  -kernel /home/justin/src/nanos/output/platform/riscv-virt/bin/kernel.img -display none  -serial stdio -drive if=none,id=hd0,format=raw,file=/home/justin/src/nanos/output/image/disk.raw -device virtio-blk-pci,drive=hd0 -no-reboot   -device virtio-net,netdev=n0 -netdev user,id=n0,hostfwd=tcp::8080-:8080,hostfwd=tcp::9090-:9090,hostfwd=udp::5309-:5309 -object filter-dump,id=filter0,netdev=n0,file=/tmp/nanos.pcap -cpu rv64
[...skipping the opensbi messages...]
en1: assigned 10.0.2.15
Server started on port 8080
en1: assigned FE80::5054:FF:FE12:3456

...and then hit it with some requests using ApacheBench:

$ ab -dSqln 100 http://127.0.0.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient).....done


Server Software:        
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /
Document Length:        Variable

Concurrency Level:      1
Time taken for tests:   0.399 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      12790 bytes
HTML transferred:       1090 bytes
Requests per second:    250.66 [#/sec] (mean)
Time per request:       3.989 [ms] (mean)
Time per request:       3.989 [ms] (mean, across all concurrent requests)
Transfer rate:          31.31 [Kbytes/sec] received

Connection Times (ms)
              min   avg   max
Connect:        0     0    0
Processing:     2     4   89
Waiting:        1     3   80
Total:          2     4   89

The performance numbers aren't very impressive, but this is a fully emulated virtual machine of a brand new architecture!


Technical Details

There is already a lot of information available about general usage of a RISC-V processor but not so much about the details of getting an operating system running on the architecture. Here are some of the technical particulars involved in getting Nanos running on the QEMU RISC-V virt machine. The focus is more on how RISC-V solves specific architectural problems rather than how an operating system works in general.

Bootstrapping

One of the biggest steps in porting to an operating system to new architecture is getting the machine from the initial boot state to some state that will allow your software to run. This mostly means that the cpu and memory must be set up in a certain way such that the kernel code can access all the hardware features and memory it needs to complete initialization. If you are familiar with the traditional x86 booting process, you probably already know that the cpu boots in a 16-bit mode compatible with its early eighties ancestors. Software must configure and move the cpu through a few modes in order for it to be able to run a modern 64-bit operating system. RISC-V, thankfully, is not nearly as complicated. However, there are a few different ways we can handle it.

Before going over our choices, we need to understand RISC-V privilege modes. Most computer architectures have some concept of hierarchical hardware protection, often called rings. These rings allow some software to have hardware privileges that other software does not, providing protection that lets a normal user process to run without accidentally destabilizing the system or damaging hardware. The software running at one privilege level must request help from a higher privilege level in order to gain access to protected resources. In RISC-V there are four possible privilege modes, from highest to lowest: machine, hypervisor (new), supervisor, and user. The RISC-V specification allows a hardware platform to implement particular subsets of those four modes, so all modes will not be available on every platform. The QEMU virt machine we are using implements three modes: machine, supervisor, and user. Conventionally, a regular program will run in user mode, and the operating system will run in supervisor mode. The processor, however, boots into machine mode.

We need special software to switch from machine mode to supervisor mode. We could write this ourself, as the steps to switch to machine mode are pretty simple. However, due to the fact that some interrupts are always handled in machine mode, we would also need to write code to handle those. We will cover interrupts in more detail later, but for now understand that we need a machine mode interrupt handler that is independent from the operating system's interrupt handler in supervisor mode, and it needs to, in turn, notify the operating system about the interrupt that just happened. The operating system then has to communicate with machine mode to be able to acknowledge the interrupt. In short, we cannot just hand control of the machine over to supervisor mode; there must always be some code running in machine mode to provide some basic services, and why write that ourselves when something already exists to do that for us?

What we need for machine mode is basically a BIOS, and there is an official open-source solution called OpenSBI, or Open Supervisor Binary Interface. The eponymous interface is a well-defined set of services provided by the BIOS for the software in a lower privilege mode, which in this case is Nanos. In the future when the hypervisor mode is supported, the BIOS could boot hypervisor software into hypervisor mode, which can then in turn boot operating system images into supervisor mode. For our purposes, however, the effective result is that OpenSBI will do some initialization, switching to supervisor mode and then jump to a particular location in memory. We could have it load a bootloader like U-Boot, and this is in fact what usually happens for RISC-V Linux machines. Our use case is much simpler, particularly since we will take a shortcut to get our kernel image into memory. U-Boot is very useful if we need to read an image from disk into memory, but we are going to use the -kernel flag of QEMU to point to our kernel image and let QEMU load the elf file into the VM's memory when the machine is created. QEMU tells the OpenSBI dynamic firmware to jump to our _start symbol.

Nanos First Steps

We are now in supervisor mode with the program counter pointing to our _start symbol but the absolute addresses in our binary will not work yet. If we tried to call a function or access a variable that requires an absolute address we would generate an address exception, halting execution. The translation table of absolute virtual addresses to a physical address does not exist yet. The Nanos kernel, like most kernels, is linked such that the code starts at a very high virtual address. For Nanos on RISC-V we define that to be 0xffffffff80000000. The high address is used so that the kernel can define an address range that a user program cannot access, allowing the kernel to access its own memory as well as the user program's memory when necessary. The QEMU RISC-V virt platform starts the beginning of physical memory at address 0x80000000, and we've chosen our physical load address to be 0x80200000. Therefore, we need to create a memory map such that the virtual address 0xffffffff80000000 maps to physical address 0x80200000. We create an 39-bit page table (more about that later) that identity maps the first 3GB of address space to match the existing address layout as well adding our new kernel virtual address mapping. We tell the system to use our new page table by setting the special satp register, set our stack register to point at an unused memory region and then jump to a C function called start in our newly set up virtual address space.

This start function does not do much except print some debugging messages and then call another function called init_mmu which does the proper setup of memory and memory ranges with a new page table. This may seem redundant on the surface, as we just set up a page table. Why would we not call init_mmu from our _init entry point? Remember that the addresses of variables and functions are located in our kernel virtual address space, and sometimes we will need to access them by their absolute address, which is not possible when we entered _start. The simple page table there lets us access code and variables to set up a more complex page table in init_mmu. After setting up a new 48-bit page table, the code jumps to function to set up the heaps and stack that the kernel will use from now on.

RISC-V Page Tables

The RISC-V architecture allows for a simple and flexible page table system. According to the most recent specification a 64-bit RISC-V platform can support as many as four types of page tables, called Sv32, Sv39, Sv48, and Sv57. The numbers correspond to how many addressable bits each type has, and each type effectively adds a new level of pages on top of the smaller type. The RISC-V page tables have a convenient property in that any level can be a leaf. For example, the top level of an Sv39 page table can map 512GB of memory with a single 4KB page, assuming the each map entry can be aligned to a gigabyte. This property is what lets us map all of the memory we needed in _start with only two 4KB pages. A top level page that identity mapped the low address bits with an additional entry pointing to a second level page table to map the high kernel virtual addresses. That second page table is also 4KB in size and maps the kernel address space in 2MB pieces. The total amount of memory we mapped was 4GB with only 8KB worth of page tables. The init_mmu function sets up a 48-bit page table in order to get more addressable space for machines with very large amounts of memory, but it does come at the cost of slightly more expensive virtual address lookups.

Device Discovery

The QEMU RISC-V virt machine uses a DeviceTree blob to store information about the machine. This is how most embedded or SoC platforms discover what devices are on the platform. Nanos currently does not have the ability to parse the blob, but since we are only targeting one platform we can just hard code the device locations for now. The biggest disadvantage of not parsing the devicetree blob is that we do not know how much memory is in the machine so we just assume 1GB of memory until we can parse it. We enumerate the PCI configuration space to discover and configure the virtio PCI devices that we had QEMU include in our VM. This process is the same as on other platforms, except for one quirk in which QEMU reports that the PCI devices are MSI-capable, but the RISC-V virt platform currently lacks a programmable interrupt controller that can handle it. The PCI devices are "hard-wired" to certain interrupt numbers depending on their PCI bus address.

Interrupt and Exception Handling

Processing a trap in supervisor mode on RISC-V involves looking at the scause, sepc, and stval control registers. The scause register distinguishes between two kinds of traps: interrupts and exceptions. The register's most significant bit tells us which kind it is. The architecture allows you to pick between having a single trap handler or multiple trap handlers with a vectored mode. For Nanos we use the direct mode meaning a single trap handler. This handler will interpret the exception code based on the previously mentioned interrupt type bit. For interrupts, the code is a combination of three types of interrupts and the privilege mode which at this time can only be machine or supervisor. The three types of interrupts recognized by the cpu are software, timer, and external interrupts.

Software interrupts are triggered by writes to an interrupt controller's mmio space. Timer interrupts are triggered by an external timing device, and external interrupts are triggered by any other type of external interrupt, such as a PCI device. It is important to note that on RISC-V, all traps are machine mode by default, handled by the machine mode trap handler. In order for the supervisor mode to receive some of these traps, the machine mode software must delegate exceptions and interrupts through two special registers, medeleg and mideleg. The BIOS does this for us, but some traps cannot be delegated because of platform limitations. In particular for the QEMU virt platform, the timer interrupt and machine external interrupt cannot be delegated to supervisor mode. The timer we use has memory mapped registers that let supervisor mode set the timer, but the interrupt that fires after timer expiration is handled by machine mode, thus OpenSBI. The OpenSBI trap handler will directly set the supervisor timer interrupt to trigger our trap handler. Since supervisor mode does not have access to the machine level interrupt, we have to use an environment call (ecall) to OpenSBI to tell it to reset the timer interrupt. The software and timer interrupts are handled by a device called the CLINT (Core Local Interruptor) and external interrupts are handled by another device called the PLIC (Platform-Level Interrupt Controller).

The interpretation of interrupt causes in the scause register
Interrupt BitException CodeDescription
10Reserved
11Supervisor software interrupt
12Reserved
13Machine software interrupt
14Reserved
15Supervisor timer interrupt
16Reserved
17Machine timer interrupt
18Reserved
19Supervisor external interrupt
110Reserved
111Machine external interrupt
112–15Reserved
1≥16Designated for platform use

The CLINT on the QEMU RISC-V virt platform is modeled after a real interrupt controller created by SiFive, a maker of RISC-V hardware. The PLIC is a very simple standard interrupt controller standard governed by the RISC-V committee. Both devices are controlled through memory mapped registers which lets the operating system in supervisor mode configure and control the devices directly, with some caveats. In Nanos, we do not need to use software interrupts now, although we will when we add RISC-V SMP support. The timer interrupt is easily controlled by reading a time value from one register, adding the desired delta and then calling SBI to set the timer alarm for that value.

External interrupts are more complicated. The PLIC can handle up to 1023 interrupt sources with support for interrupt priorities. These sources can be individually enabled or disabled, which is completely independent of the CPU's external interrupt enable bits. The priority registers can be used in combination with the priority threshold registers to mask sources with priorities less than or equal to the threshold. We do not use priorities in Nanos. The interrupt enable and priority threshold registers are set by context. A context in PLIC terms is a combination of a RISC-V Hart (technically an execution unit, but for now think of it as a CPU) and a privilege mode. Since both of these can be variable in number depending on the platform, the PLIC cannot predefine values. Therefore each platform must define how a context gets mapped. Since Nanos is currently running on a single CPU vm in supervisor mode, we are interested in context 1. Context 0 is wired to machine mode.

One important note is that this platform does not delegate external machine interrupts to supervisor mode. If an interrupt is enabled for a machine mode context, the supervisor mode will never be notified even if the delegation bit is set. Also note that the PLIC is not programmable: devices are wired to a particular source number on the PLIC. The PLIC wires context 0 to the machine external interrupt bit and context 1 to the supervisor external interrupt bit of CPU 0. When we enumerate PCI devices, we have to be able to transform the PCI bus address to a source number so that we know which PLIC source corresponds to which PCI device. The QEMU RISC-V virt platform assigns four PCI devices to source numbers 32-35 using a simple transformation of the bus, device, and function numbers. The final piece of the PLIC is that a pending interrupt for a particular context is serviced by reading the claim register which returns ID of the highest priority pending interrupt and also clears the pending interrupt bit. When the device interrupt is handled, the handler writes the ID back into the claim register to signal completion of the interrupt.

To recap: the RISC-V CPU can be configured to service traps at various privilege levels. The OpenSBI BIOS sets up the CPU to delegate most traps to supervisor mode for the OS trap handler. This includes most of the exceptions such as page faults or invalid instructions, and some of the interrupts. There are three types of interrupts: software, timer, and external. These can be globally controlled by the CPU with the mie/sie registers which have an enable bit per interrupt type. The CPU has a corresponding set of pending interrupt registers mip/sip. The timer and software interrupts are handled by a device called the CLINT, which wires the timer interrupt only to the machine mode timer interrupt pending. This requires the OS in supervisor mode to use ecalls to OpenSBI to set or clear the timer alarm. The external interrupts are handled by the PLIC, which is wired to both the machine and supervisor external interrupt pending bits. The OS must use the correct context when setting up external interrupts so that the supervisor external interrupt for the desired CPU is asserted.

Summary

This discussion covers most of the confusing, ambiguous, or not as well documented aspects we found porting Nanos to the QEMU RISC-V virt platform. The other devices Nanos uses that are not covered here are the serial port (NS16550 type) which is the same kind of serial port used on the PC and the Goldfish RTC, whose interface is trivial and so is not worth commenting on other than to say the alarm feature of the device is not wired to anything and cannot be used. To learn more about RISC-V, the best place to start is the RISC-V website. From there you can find the two volumes of the main ISA specification, one describing the general architecture and instructions, and the other that describes privileged instructions and registers required for operating systems and other non-user software. QEMU is the easiest way to try RISC-V, and hopefully this guide has shown you how easy it is to explore this new architecture. We will continue to improve RISC-V support in Nanos, including SMP, devicetree, and integration with the new hypervisor mode as it evolves. Porting Nanos to RISC-V was a great way to learn it, and we are really excited about the possibility of running Nanos on a native RISC-V hypervisor in the future!

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.