Extending io_uring Support in the Nanos Unikernel

Intro

Windows NT 3.1 was released on my tenth birthday (... a long time ago) and came with asynchronous io. Linux was stuck with posix AIO or linux AIO for a very long time until recently when Jens Axboe gave us io_uring.

We initially put in basic io_uring support almost four years ago but have been slowly applying more and more support as newer applications utilize it. In particular we added support for send, accept and recv which amongst other things allows you to write things like webservers using io_uring.

Wait - you can use io_uring for sockets? Yes, io_uring started out as a mechanism for doing things with storage io but sockets can be worked with as well.

First off - What Is IO_URING?

IO_URING is an asynchronous I/O API.

IO_URING is from the mind of Jens Axboe - you might know of the tool 'fio' or the flexible i/o tester that he wrote. We used that so much that we ended up making some patches to turn it's webserver into a multi-threaded version so we could test volume IOPS on the various cloud instances.

Async what? First it might be helpful to review how we characterize asynchronous io from synchronous io and what options were before io_uring came along.

Normally when you call a syscall such as write(2) your program will block until the write call returns. Even if it is a small write there is a chance that the operating system will switch out your program for another program while this happens and of course in a single thread, single process program, such as most scripting langugages, you can't do any more work until it returns.

Multiple threads can work around this, however, generally only scale with the number of hardware threads or vcpus available and if you are using one of those languages without proper threads (which is a substantial amount in 2024) you are just SOL.

Before io_uring there was the posix AIO api and linux AIO api, however, linux aio had many limitations and can still block under certain circumstances.

Linux AIO also did not support sockets (until very recently) and mandated O_DIRECT, which while useful for certain applications bypasses the page cache and so can in many circumstances slow things down considerably. If you didn't use O_DIRECT then this behaves in a synchronous manner kinda defeating the point.

What do you think? Do you think it might be possible to aim for a generic "do system call asynchronously" model instead?
- Linus - 2016

So at its core io_uring allows you to queue up work for the kernel in a non-blocking manner and allow your program to continue on doing other useful work.

We've had non-blocking sockets for a while through the use of things like select and epoll but not for storage io. Typically in the past if you wanted something like that you would rely on a thread pool. This is why libraries like libuv, which node.js uses, exposes an interface to hide this.

While there have been some security concerns in the past few years we see it more of a "this is new and of course you are going to see a lot of issues for it" vs "this is fundamentally flawed". At the end of the day io_uring enables far too many benefits not to continue to be worked on.

How Does io_uring work?

IO_URING makes use of non-blocking ring buffers that are shared between the user application and the kernel.

The program puts io requests on a submission queue and waits for their completion on the completion queue. These requests are called SQEs and CQEs respectively. This unique, elegant and simple interface allows several things. One you can submit and complete an async operation in a single syscall. Two you can make multiple io requests with one syscall as opposted to a syscall for each individual request. Thirdly, you can even link arbitrary long sets of these requests together.

One really neat example of this is being able to do a file copy in a single syscall. Normally you would read from one file and then write to another file in separate syscalls. Check this example that I took from the liburing repo out:

#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <errno.h>
#include <inttypes.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include "liburing.h"

#define QD  64
#define BS  (32*1024)

struct io_data {
  size_t offset;
  int index;
  struct iovec iov;
};

static int infd, outfd;
static int inflight;

static int setup_context(unsigned entries, struct io_uring *ring)
{
  int ret;

  ret = io_uring_queue_init(entries, ring, 0);
  if (ret < 0) {
    fprintf(stderr, "queue_init: %s\n", strerror(-ret));
    return -1;
  }

  return 0;
}

static int get_file_size(int fd, off_t *size)
{
  struct stat st;

  if (fstat(fd, &st) < 0)
    return -1;
  if (S_ISREG(st.st_mode)) {
    *size = st.st_size;
    return 0;
  } else if (S_ISBLK(st.st_mode)) {
    unsigned long long bytes;

    if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
      return -1;

    *size = bytes;
    return 0;
  }

  return -1;
}

static void queue_rw_pair(struct io_uring *ring, off_t size, off_t
offset)
{
  struct io_uring_sqe *sqe;
  struct io_data *data;
  void *ptr;

  ptr = malloc(size + sizeof(*data));
  data = ptr + size;
  data->index = 0;
  data->offset = offset;
  data->iov.iov_base = ptr;
  data->iov.iov_len = size;

  sqe = io_uring_get_sqe(ring);
  io_uring_prep_readv(sqe, infd, &data->iov, 1, offset);
  sqe->flags |= IOSQE_IO_LINK;
  io_uring_sqe_set_data(sqe, data);

  sqe = io_uring_get_sqe(ring);
  io_uring_prep_writev(sqe, outfd, &data->iov, 1, offset);
  io_uring_sqe_set_data(sqe, data);
}

static int handle_cqe(struct io_uring *ring, struct io_uring_cqe *cqe)
{
  struct io_data *data = io_uring_cqe_get_data(cqe);
  int ret = 0;

  data->index++;

  if (cqe->res < 0) {
    if (cqe->res == -ECANCELED) {
      queue_rw_pair(ring, data->iov.iov_len, data->offset);
      inflight += 2;
    } else {
      printf("cqe error: %s\n", strerror(-cqe->res));
      ret = 1;
    }
  }

  if (data->index == 2) {
    void *ptr = (void *) data - data->iov.iov_len;

    free(ptr);
  }
  io_uring_cqe_seen(ring, cqe);
  return ret;
}

static int copy_file(struct io_uring *ring, off_t insize)
{
  struct io_uring_cqe *cqe;
  off_t this_size;
  off_t offset;

  offset = 0;
  while (insize) {
    int has_inflight = inflight;
    int depth;

    while (insize && inflight < QD) {
      this_size = BS;
      if (this_size > insize)
        this_size = insize;
      queue_rw_pair(ring, this_size, offset);
      offset += this_size;
      insize -= this_size;
      inflight += 2;
    }

    if (has_inflight != inflight)
      io_uring_submit(ring);

    if (insize)
      depth = QD;
    else
      depth = 1;
    while (inflight >= depth) {
      int ret;

      ret = io_uring_wait_cqe(ring, &cqe);
      if (ret < 0) {
        printf("wait cqe: %s\n", strerror(-ret));
        return 1;
      }
      if (handle_cqe(ring, cqe))
        return 1;
      inflight--;
    }
  }

  return 0;
}

int main(int argc, char *argv[])
{
  struct io_uring ring;
  off_t insize;
  int ret;

  if (argc < 3) {
    printf("%s: infile outfile\n", argv[0]);
    return 1;
  }

  infd = open(argv[1], O_RDONLY);
  if (infd < 0) {
    perror("open infile");
    return 1;
  }
  outfd = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0644);
  if (outfd < 0) {
    perror("open outfile");
    return 1;
  }

  if (setup_context(QD, &ring))
    return 1;
  if (get_file_size(infd, &insize))
    return 1;

  ret = copy_file(&ring, insize);

  close(infd);
  close(outfd);
  io_uring_queue_exit(&ring);
  return ret;
}

This example is using liburing which most people should probably just use but the next example we'll show without liburing. You first call io_uring_setup to set up the initial queues. Then you fill in your SQEs. Then we call io_uring_enter and once the kernel is done we find our items in the shared buffer. io_uring_enter is non-blocking and can be handled in-line.

The SQ buffer is writeable only by consumer applications while the CQ buffer is writeable only by the kernel.

CQEs are what you get back and they are super straight-forward so let's show that first:

  • user_data - any data coming back to us
  • res - our result code
  • flags

SQEs are a bit more complex than CQEs but essentially contain:

  • opcode - what operation (eg: what sycall equivalent)
  • fd - file descriptor
  • ioprio - io priority
  • offset - offset of where to take place
  • addr - address of the operation
  • len - byte count || number of vectors
  • flags

There's actually a lot more to this struct but this is the important part.

What is really interesting is that any syscall could potentially be made async via io_uring as it's focus is really on transport. We now have 27 ops supported.

Let's take a look at this socket example I took from our test suite (keep in mind this is without using liburing so there is more ceremony here than what you might normally do):

#include <stdio.h>
#include <stdbool.h>
#include <limits.h>
#include <string.h>
#include <stdint.h>
#include <assert.h>
#include <unistd.h>
#include <errno.h>

#include <sys/param.h>
#include <arpa/inet.h>

#include <sys/mman.h>
#include <sys/socket.h>

#include <linux/io_uring.h>

#ifndef SYS_io_uring_setup
#define SYS_io_uring_setup      425
#endif
#ifndef SYS_io_uring_enter
#define SYS_io_uring_enter      426
#endif
#ifndef SYS_io_uring_register
#define SYS_io_uring_register   427
#endif

#define BUF_SIZE        8192

typedef unsigned char u8;
typedef u8 boolean;

static inline __attribute__((always_inline)) void write_barrier(void)
{
    asm volatile("sfence" ::: "memory");
}

static inline __attribute__((always_inline)) void read_barrier(void)
{
    asm volatile("lfence" ::: "memory");
}

struct iour {
    struct io_uring_params params;
    int fd;
    uint8_t *rings;
    struct io_uring_sqe *sqes;
    uint32_t *sq_head;
    uint32_t *sq_tail;
    uint32_t sq_mask;
    uint32_t *sq_array;
    uint32_t *cq_head;
    uint32_t *cq_tail;
    uint32_t cq_mask;
    struct io_uring_cqe *cqes;
};

static int iour_init(struct iour *iour, unsigned int entries)
{
    iour->fd = syscall(SYS_io_uring_setup, entries, &iour->params);
    if (iour->fd < 0)
        return iour->fd;

    assert(iour->params.features & IORING_FEAT_SINGLE_MMAP);

    /* Exploit the single mmap feature and map both SQ and CQ rings with a
     * single syscall. */
    uint32_t sqring_size = iour->params.sq_off.array +
            iour->params.sq_entries * sizeof(uint32_t);
    uint32_t cqring_size = iour->params.cq_off.cqes +
            iour->params.cq_entries * sizeof(struct io_uring_cqe);
    iour->rings = mmap(0, MAX(sqring_size, cqring_size), PROT_READ | PROT_WRITE,
        MAP_SHARED | MAP_POPULATE, iour->fd, IORING_OFF_SQ_RING);
    assert(iour->rings != MAP_FAILED);

    iour->sq_head = (uint32_t *)(iour->rings + iour->params.sq_off.head);
    iour->sq_tail = (uint32_t *)(iour->rings + iour->params.sq_off.tail);
    iour->sq_mask = *(uint32_t *)(iour->rings + iour->params.sq_off.ring_mask);
    iour->sq_array = (uint32_t *)(iour->rings + iour->params.sq_off.array);
    iour->cq_head = (uint32_t *)(iour->rings + iour->params.cq_off.head);
    iour->cq_tail = (uint32_t *)(iour->rings + iour->params.cq_off.tail);
    iour->cq_mask = *(uint32_t *)(iour->rings + iour->params.cq_off.ring_mask);
    iour->cqes = (struct io_uring_cqe *)(iour->rings + iour->params.cq_off.cqes);
    iour->sqes = mmap(0, iour->params.sq_entries * sizeof(struct io_uring_sqe),
        PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, iour->fd,
        IORING_OFF_SQES);
    assert(iour->sqes != MAP_FAILED);

    assert(iour->params.sq_entries >= entries);
    assert(*iour->sq_head == 0 && *iour->sq_tail == 0);
    assert(iour->sq_mask == iour->params.sq_entries - 1);
    assert(*(uint32_t *)(iour->rings + iour->params.sq_off.flags) == 0);
    assert(*(uint32_t *)(iour->rings + iour->params.sq_off.dropped) == 0);
    assert(iour->params.cq_entries >= entries);
    assert(*iour->cq_head == 0 && *iour->cq_tail == 0);
    assert(iour->cq_mask == iour->params.cq_entries - 1);
    assert(*(uint32_t *)(iour->rings + iour->params.cq_off.overflow) == 0);

    /* Use some non-trivial ordering of SQEs */
    for (int i = 0; i < iour->params.sq_entries; i++)
        iour->sq_array[i] = iour->params.sq_entries - 1  - i;

    return 0;
}

static struct io_uring_sqe *iour_get_sqe(struct iour *iour)
{
    assert(*iour->sq_tail >= *iour->sq_head);
    assert(*iour->sq_tail - *iour->sq_head <= iour->params.sq_entries);
    if (*iour->sq_tail == *iour->sq_head + iour->params.sq_entries)
        return NULL;
    return &iour->sqes[iour->sq_array[*iour->sq_tail & iour->sq_mask]];
}

static void iour_setup_txrx(struct iour *iour, boolean tx, int fd,
uint8_t *buf, uint32_t len,
                            int flags, uint64_t user_data)
{
    struct io_uring_sqe *sqe = iour_get_sqe(iour);

    assert(sqe);
    memset(sqe, 0, sizeof(*sqe));
    sqe->opcode = tx ? IORING_OP_SEND : IORING_OP_RECV;
    sqe->fd = fd;
    sqe->off = 0;
    sqe->addr = (uint64_t)buf;
    sqe->len = len;
    sqe->msg_flags = flags;
    sqe->user_data = user_data;
    sqe->buf_index = 0;
    write_barrier();
    (*iour->sq_tail)++;
}

static void iour_setup_sqe(struct iour *iour, uint8_t opcode, int fd,
                           uint64_t addr, uint32_t len, uint64_t offset,
                           uint64_t user_data)
{
    struct io_uring_sqe *sqe = iour_get_sqe(iour);

    assert(sqe);
    memset(sqe, 0, sizeof(*sqe));
    sqe->opcode = opcode;
    sqe->fd = fd;
    sqe->addr = addr;
    sqe->len = len;
    sqe->off = offset;
    sqe->user_data = user_data;
    write_barrier();
    (*iour->sq_tail)++;
}

static void iour_setup_read(struct iour *iour, int fd, uint8_t *buf,
                            uint32_t len, uint64_t offset, uint64_t
user_data)
{
    iour_setup_sqe(iour, IORING_OP_READ, fd, (uint64_t)buf, len, offset,
        user_data);
}

static int iour_exit(struct iour *iour)
{
    return close(iour->fd);
}

static struct io_uring_cqe *iour_get_cqe(struct iour *iour)
{
    struct io_uring_cqe *cqe;

    read_barrier();
    if (*iour->cq_tail == *iour->cq_head)
        return NULL;
    assert(*iour->cq_tail > *iour->cq_head);
    assert(*iour->cq_tail - *iour->cq_head <= iour->params.cq_entries);
    cqe = &iour->cqes[*iour->cq_head & iour->cq_mask];
    (*iour->cq_head)++;
    return cqe;
}

static void iour_setup_accept(struct iour *iour, int fd, struct sockaddr
*addr, socklen_t *addr_len,
                              int flags, uint64_t user_data)
{
    struct io_uring_sqe *sqe = iour_get_sqe(iour);

    assert(sqe);
    memset(sqe, 0, sizeof(*sqe));
    sqe->opcode = IORING_OP_ACCEPT;
    sqe->fd = fd;
    sqe->addr = (uint64_t)addr;
    sqe->addr2 = (uint64_t)addr_len;
    sqe->accept_flags = flags;
    sqe->user_data = user_data;
    write_barrier();
    (*iour->sq_tail)++;
}

static int iour_submit(struct iour *iour, unsigned int count,
                       unsigned int min_complete)
{
    return syscall(SYS_io_uring_enter, iour->fd, count, min_complete,
        IORING_ENTER_GETEVENTS, NULL);
}

int main() {
    struct iour iour;
    int tx_fd, rx_fd;
    struct sockaddr_in addr;
    uint8_t read_buf[BUF_SIZE], write_buf[BUF_SIZE];
    struct io_uring_cqe *cqe;

    memset(&iour.params, 0, sizeof(iour.params));
    assert(iour_init(&iour, 1) == 0);

    tx_fd = socket(AF_INET, SOCK_DGRAM, 0);
    assert(tx_fd > 0);

    rx_fd = socket(AF_INET, SOCK_DGRAM, 0);
    assert(rx_fd > 0);

    addr.sin_family = AF_INET;
    addr.sin_port = htons(1234);
    addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
    assert(connect(tx_fd, (struct sockaddr *)&addr, sizeof(addr)) == 0);
    assert(bind(rx_fd, (struct sockaddr *)&addr, sizeof(addr)) == 0);

    /* start an asynchronous read when there is nothing to be read */
    iour_setup_read(&iour, rx_fd, read_buf, BUF_SIZE, 0, 0);
    assert(iour_submit(&iour, 1, 0) == 1);
    assert(iour_get_cqe(&iour) == NULL);

    /* do a write and wait for the asynchronous read to complete */
    for (uint64_t i = 0; i < BUF_SIZE; i += sizeof(i))
        memcpy(write_buf + i, &i, sizeof(i));
    assert(write(tx_fd, write_buf, BUF_SIZE) == BUF_SIZE);
    for (int retry = 0; retry < INT_MAX; retry++) {
        cqe = iour_get_cqe(&iour);
        if (cqe)
            break;
    }
    assert(cqe && (cqe->user_data == 0) && (cqe->res == BUF_SIZE));
    for (uint64_t i = 0; i < BUF_SIZE; i += sizeof(i))
        assert(memcmp(read_buf + i, &i, sizeof(i)) == 0);

    iour_setup_txrx(&iour, true, 0, write_buf, BUF_SIZE, 0, 0); /* non-socket file descriptor */
    assert(iour_submit(&iour, 1, 1) == 1);
    cqe = iour_get_cqe(&iour);
    assert(cqe && (cqe->user_data == 0) && (cqe->res == -ENOTSOCK));

    iour_setup_txrx(&iour, true, tx_fd, write_buf, BUF_SIZE, 0, 0);
    assert(iour_submit(&iour, 1, 1) == 1);
    cqe = iour_get_cqe(&iour);

    assert(cqe && (cqe->user_data == 0) && (cqe->res == BUF_SIZE));
    iour_setup_txrx(&iour, false, rx_fd, read_buf, BUF_SIZE, 0, 0);
    assert(iour_submit(&iour, 1, 1) == 1);
    cqe = iour_get_cqe(&iour);
    assert(cqe && (cqe->user_data == 0) && (cqe->res == BUF_SIZE));

    close(tx_fd);
    close(rx_fd);
    tx_fd = socket(AF_INET, SOCK_STREAM, 0);
    assert(tx_fd > 0);
    rx_fd = socket(AF_INET, SOCK_STREAM, 0);
    assert(rx_fd > 0);
    assert(bind(rx_fd, (struct sockaddr *)&addr, sizeof(addr)) == 0);
    assert(listen(rx_fd, 1) == 0);

    /* start an asynchronous accept when there is no peer waiting to be connected */
    iour_setup_accept(&iour, rx_fd, NULL, NULL, 0, 0);
    assert(iour_submit(&iour, 1, 0) == 1);
    assert(iour_get_cqe(&iour) == NULL);

    /* do a connect and wait for the asynchronous accept to complete */
    assert(connect(tx_fd, (struct sockaddr *)&addr, sizeof(addr)) == 0);
    for (int retry = 0; retry < INT_MAX; retry++) {
        cqe = iour_get_cqe(&iour);
        if (cqe)
            break;
    }

    assert(cqe && (cqe->user_data == 0));
    close(rx_fd);
    rx_fd = cqe->res;
    assert(rx_fd > 0);

    assert(write(tx_fd, write_buf, 1) == 1);
    assert(read(rx_fd, read_buf, 1) == 1);

    close(tx_fd);
    close(rx_fd);
    assert(iour_exit(&iour) == 0);
}

Io_uring is definitely one of the larger more consequential changes in Linux done in a long time and as we pointed out earlier a "catching up" type of change. The io_uring story is not done in Linux yet and it we'll keep adding more support to Nanos as our users and customers request it. If you have an application you've written that uses io_uring and runs nanos or you would like it to run nanos - let us know!

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.