GPU-accelerated Computing with Nanos Unikernels

Nowadays, Graphics Processing Units are used for much more than just video rendering: their computing power can be harnessed to run various workloads, including computing tasks that have traditionally been executed on general-purpose CPUs. Heterogeneous computing libraries such as OpenCL include GPUs in the set of computing devices on which user programs can be run. Many of today's digital services rely on artificial intelligence and machine learning algorithms that require a massive amount of computing power to be trained and executed, and server-grade GPUs are increasingly being used as additional computing devices in server farms. In this landscape, it's no wonder major public cloud providers such as GCP and AWS started offering VM instances equipped with GPUs.

In May 2022, NVIDIA™, a world leader in the GPU market, announced the release of an open-source version of their GPU kernel modules for Linux. This move, long awaited by the open-source community, paves the way for supporting NVIDIA GPUs in operating systems that have been unable to make use of these powerful devices. Given the increasing relevance of GPU use in public and private clouds, and the fact that both GCP and AWS include NVIDIA GPUs in their cloud offerings, we set out to port the NVIDIA Linux kernel drivers to Nanos: the result is a klib that can be inserted in a unikernel image and allows applications running on Nanos to make use of the GPU attached to a VM instance.

The GPU Driver Porting Work

As one can see by browsing the source code in the NVIDIA GitHub repository, the GPU drivers are made up of a massive amount of code: 2600+ source files. The majority of these files (around 2100) are in the OS-independent part of the code base, while the rest is Linux-specific code; we were only interested in a subset of the Linux drivers, specifically those (NVIDIA core and UVM) that are needed by any application that makes use of the NVIDIA CUDA parallel computing libraries; the code in these drivers is made of around 270 source files. Since NVIDIA released this code with a permissive MIT license, we decided to do the porting to Nanos by forking their GitHub repository and modifying the Linux-specific parts, instead of re-writing everything from scratch.

Given the large amount of code involved, adding this code to the main Nanos kernel binary was not an option: it would have made the kernel binary 4-5 times larger than it is now! Luckily, we have klibs, which offer a convenient way to add new functionalities to a running kernel without adding code to the kernel itself, so implementing the GPU driver as a klib was a natural choice. But since including this klib in the set of core klibs that are shipped with the kernel would have increased considerably the compilation time of the kernel source tree, as well as the size of the kernel release tarballs, we opted for an out-of-tree klib, which is built separately from the kernel and its in-tree klibs.

As expected, the OS-independent part of the NVIDIA drivers required little effort to be used in Nanos (mostly some changes to the Makefile and header includes, which were needed to ensure the out-of-tree klib is built without including standard header files and with the same set of compiler and linker flags as the core klibs). The bulk of the work was on the Linux-specific part of the code, where heavy use of Linux kernel functionalities and data structures required non-trivial changes. In addition, the Nanos kernel itself needed to undergo some changes, to provide new functionalities that are used by the GPU klib.

As previously mentioned, the drivers that we wanted to port to Nanos are the NVIDIA core and UVM drivers. These are implemented on Linux by two separate kernel modules; since both of them are needed by a typical application that makes uses of the CUDA libraries, we merged them into a single Nanos klib binary file, whose entry point initializes both the core and UVM driver. We needed to ensure that no external tools were involved in setting up the system for use by a given application: for example, creating device nodes such as /dev/nvidiactl, /dev/nvidia0 and /dev/nvidia-uvm (which are the main interface between the low-level CUDA libraries and the kernel) has to be done by the klib itself during its initialization, instead of requiring external tools such as nvidia-modprobe; so the creation of device nodes was added to the driver initialization functions. In addition, we added to the Nanos kernel support for device major and minor numbers, which can now be specified when creating a device node and retrieved by the application via syscalls.

An important new feature that is required by the NVIDIA libraries and had to be implemented in the kernel is the ability to invoke the mmap syscall on the /dev/nvidia* device nodes; we implemented this by adding an mmap callback to the generic file descriptor internal kernel structure, so that any file descriptor can now have its custom mmap implementation; some of the existing code was refactored to use this new callback, which resulted in a cleaner interface between the generic mmap code and the file descriptor-specific parts.

Another area that required porting work is the interface between the driver and the kernel PCI subsystem (NVIDIA GPUs are PCI devices): things like registering a PCI driver and its probe routine with the kernel, reading from and writing to the PCI configuration space of a device, and setting up MSI-X interrupts had to be adapted to the PCI driver interface in Nanos.

Other parts that required modifications include:

memory allocation functions: code in various places that uses the Linux memory heap functionalities had to be changed to interface with Nanos kernel heaps instead; for example, Linux memory caches have been replaced by Nanos "objcache" heaps
virtual address space management and MMU (Memory Management Unit) page table setup, i.e. mapping physical addresses to virtual addresses, for either internal use by the GPU driver (kernel mappings), or direct use by the application (user mappings)
kernel threads: use of Linux kernel threads has been replaced by a simple implementation of kernel threads leveraging the "kernel contexts" that are already being used in Nanos to do things like suspending and resuming specific execution contexts (for example when acquiring and releasing a mutex)
handling of deferred work and interrupt "bottom half" routines: in Nanos we have lock-free asynchronous queues where work items to be executed asynchronously can enqueued and later dequeued for execution
waitqueues, by which application threads waiting for access to shared resources or for occurrence of asynchronous events can be suspended and resumed: in Nanos we have the "blockq" structure, which is heavily used in the Unix-specific parts of the kernel to implement many "blocking" syscalls, i.e. syscalls that can potentially suspend the application threads that invoke them
doubly-linked lists (list structures in Nanos), by which generic data items can be linked to one another in a linear list so that from any given item it is possible to retrieve the next and previous items in the list
scatter-gather lists (sg_list structures in Nanos), that describe a set of non-contiguous address ranges used e.g. to perform a copy between CPU system memory and GPU memory
red-black trees, for efficiently handling ordered data sets so that the computation time needed to insert, retrieve and delete an item from a set increases sub-linearly with the number of items in the set: in Nanos we have our own implementation of red-black trees
radix trees, used in Linux for handling non-ordered data sets: in Nanos we have the table interface, so use of radix trees in the GPU driver has been converted to use tables instead
locking and synchronization primitives: Linux mutexes have been replaced by Nanos mutexes, while for semaphores we didn't have a suitable implementation in Nanos, so we derived from the existing mutex code a simple implementation of semaphores that is able to suspend and resume execution contexts; Linux read-write semaphores have been replaced by Nanos read-write spinlocks
atomic operations, to allow safe concurrent access and modification to memory locations from multiple parallel execution threads
bitmap manipulation functions: even though Nanos has its own bitmap implementation, we couldn't use it because it wraps the bitmap memory in a containing structure; instead, we implemented an extended set of functions that operate directly on bitmap memory and can be used to get and set specific bits in the bitmap, either atomically (i.e. in a multithread-safe manner) or non-atomically
kernel timers, i.e. data structures used to schedule work items to be executed at a specific time in the future
time retrieval functions, i.e. functions used to get the current timestamp
logging functions

Some parts of the Linux driver code could be removed altogether, because they don't apply to a single-process unikernel environment, or are not needed otherwise. For example, we could avoid porting the Linux code that deals with multiple address spaces (i.e. anything that operates on the Linux "mm_struct"), because on Nanos there can only be a single user process and thus a single user-level address space. Likewise, copying between user memory and kernel memory (usually done when executing an ioctl syscall) is not needed in Nanos because our kernel is always able to directly access user memory, so we could replace these copies with a simple validation of the memory address ranges passed by the application to the kernel.

Trying to run applications that use the NVIDIA libraries revealed the need to implement another kernel feature that is not related to the GPU driver but is nonetheless required by the libraries: the SOCK_SEQPACKET type of Unix domain sockets. In Nanos we already had an implementation of Unix domain sockets for the SOCK_STREAM and SOCK_DGRAM types, so we could easily add the new socket type with small modifications and additions to the existing code.

In the future, some of the functionalities that needed to be implemented and that are now part of the GPU klib code could be re-used in the main Nanos kernel: for example, kernel threads, semaphores and bitmap manipulation functions could be incorporated into the kernel if and when the need arises. Thus, the development work we have done on the GPU klib could be beneficial for future developments of the main kernel as well.

Using the GPU in Your Application

If you want to use an NVIDIA GPU in your unikernel application, below are the steps to get started:

install the Ops orchestration tool
download the main kernel code from the Nanos repository
download the GPU driver code from the Nanos NVIDIA GPU klib repository
build the klib by going to the directory where the GPU driver code is located and running the following command (after replacing the file path with the actual location where you downloaded the kernel code)
```
make NANOS_DIR=/path/to/nanos/kernel/source
```
The resulting klib binary file is located at kernel-open/_out/Nanos_x86_64/gpu_nvidia
copy the klib binary file to ~/.ops/nightly/klibs/ (first create this folder is it doesn't exist), which is the folder where Ops looks for klibs when creating an image from the nightly kernel build
download the NVIDIA 64-bit Linux drivers from the NVIDIA website; the driver version must coincide with the version of the Linux kernel driver from which the Nanos klib has been derived, which can be found in the version.mk file in the klib source code; at the time of this writing, the klib driver version is 515.65.01
extract the contents of the NVIDIA Linux driver package by executing the downloaded file with the '-x' command line option:
```
~/Downloads/NVIDIA-Linux-x86_64-515.65.01.run -x
```
move to the directory containing the target filesystem for your unikernel application, create a directory named "nvidia", with a subfolder named after the driver version, e.g. 515.65.01
copy the GSP firmware file from the NVIDIA drivers extracted earlier to the above directory:
```
cp ~/Downloads/NVIDIA-Linux-x86_64-515.65.01/firmware/gsp.bin nvidia/515.65.01/
```

Now the files needed to initialize the GPU and make it available to the application are in place; the rest of the steps depend on the actual application you want to run: it may be a TensorFlow program, a CUDA application, an OpenCL application, or anything else that uses, directly or indirectly, the NVIDIA lower-level libraries (which can be found in the NVIDIA driver package extracted in the above steps). If you use the CUDA toolkit, remember that the toolkit version must be compatible with the driver version; the driver download page in the NVIDIA website indicates what CUDA toolkit version is compatible with a given driver version; for example, the driver version 515.65.01 is compatible with the CUDA toolkit version 11.7.

In order to create a suitable unikernel image and then spin up a VM instance from that image, the Ops configuration file must include the NVIDIA GPU klib among the klibs to be inserted in the image, and the "nvidia" directory created above (containing the GSP firmware file) must be included in the image filesystem. In addition, the instance type chosen for running the image must support NVIDIA GPUs, and a GPU must be attached to the instance. The following configuration is an example using a Google Cloud "n1-standard-1" instance equipped with a Tesla T4 GPU:

{
  "CloudConfig" :{
    "ProjectID" :"my-proj",
    "Zone": "us-west1-b",
    "BucketName":"my-bucket",
    "Flavor":"n1-standard-1"
  },
  "RunConfig": {
    "GPUs": 1,
    "GPUType": "nvidia-tesla-t4"
  },
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"]
}

Before creating the image, amend the above configuration to include your application and its dependencies. Then, when creating the image, add the '-n' command line option so that Ops takes the Nanos kernel from the latest nightly build (the running kernel must match the kernel source that was referenced by the NANOS_DIR environment variable when building the klib, and nightly builds are created from the master branch of the Nanos repository). Below are example command lines to create an image and then spin up an instance on GCP:

ops image create my_program -t gcp -c config.json -n
ops instance create my_program -t gcp -c config.json

After the instance creation is complete, if you look at the instance console logs you should see lines similar the the following:

NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  515.65.01  Release Build  (user@hostname)  Fri Oct 21 18:54:35 CEST 2022
Loaded the UVM driver, major device number 0.

The above lines indicate that the GPU klib was loaded successfully, and the GPU attached to your instance is available for your application to use.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.

GPU-accelerated Computing with Nanos Unikernels

The GPU Driver Porting Work

Using the GPU in Your Application

Deploy Your First Open Source Unikernel In Seconds

Company

Markets

Resources

Learn

Contact