Red-black trees (rbtrees) are data-structures widely used in the kernel. For example, this data-structure is one of the building blocks of the Linux scheduler. In this article, we roughly overview general concepts about trees and then we focus on what makes rbtrees very interesting. In the context of kernel development, we present which data-structures are better described by using rbtrees. We finish this article by presenting how rbtrees are used in NanoVMs and its API. If you want to get a deeper understanding about rbtrees, we recommend you to follow this link.
Trees are a widely used data-structure in which nodes are hierarchically distributed. Each tree has a root node. The root node has zero or more child nodes. Each child node has zero or more child nodes. In general, each node can contain any data as a key.
Among tree types, one interesting type is the Binary Tree in which each node has two or less children. In particular, Binary Search Trees (bst) have the property that, for every node, the key of the node is greater than all the keys of the left subtree and less than all the keys of the right subtree. Thus, the lookup operation over a bst has the same complexity as the binary search algorithm, which is O(lg n) where n is the number of elements in the tree. Note that this is better than the lookup operation over a linked-list, which is O(n).
We highlight tree common operations over a tree:
- Insert/remove an element
- In-order traversal
- Pre-order traversal
- Post-order traversal
Let’s present the different ways to traverse a tree. When traversing a tree in-order, the left subtree is visited first, then the root node, and finally the right tree. In a bst, this operation allows visiting the nodes in a sorted way. When traversing in a pre-order way, the tree is traversed by first visiting the root node, then the left subtree, and finally the right subtree. This traversing is usually used to find an element in the tree by looking first at the root node and exiting when it is the node that we look for. Finally, when traversing in a post-order way, the tree is traversed by first visiting the left subtree, then the right subtree and finally the root node.
A rbtree is a bst in which nodes contain an extra bit that represents a color. Each node is identified with either the black or the red color. This bit is used to keep the tree balanced after insertions and deletions operations. Let’s explain why self-balancing is important. For example, let’s say that we insert the following keys in a bst: (1), (2) and (3). If we insert first the (1), and then, (2) and (3), the complexity is O(n). However, if we first insert (2), and then, (1) and (3), the complexity is O(lg n). Thus, bst worst-case execution is O(n). Conversely, rbtrees are balanced after each operation thus the worst-case execution of all operations is always O(lg n). This makes operations over rbtrees very stable in terms of execution time. Also, the extra color only requires an extra bit thus rbtrees consume almost the same amount of memory as bst.
Red Black Trees as a kernel data structure
We have shown that operations over a rbtree are very stable in terms of time complexity. This makes this kind of tree good for time-critical components like a scheduler. For example, in the Linux kernel, the Completely Fair Scheduler (CFS) is implemented by relying on a rbtree. The scheduler keeps a rbtree in which each node represents a schedulable entity like a process or a thread. The rbtree is sorted by using the execution time of each schedulable entity. In this tree, the leftmost node is the task that has the lowest spent execution time. When the scheduler is called, it simply picks up the leftmost node as the next new task for scheduling. The complexity of the algorithm that inserts nodes is O(log n). Choosing the next entity to run is made in constant time because the leftmost node is always cached.
NanoVMs also leverages rbtree properties to keep different kernel data structures. In the current implementation, each parent node has an array of two pointers that point to each child node. When defining a new tree, the developer has to provide a function that will be used to compare nodes. We can think of this function as the comparator for qsort(). For rbtree, the comparator function takes as parameters the two nodes to compare and also the offset in which the node is defined. This offset is used to get the structure that contains the node, i.e, the context. For example, if the node is defined as a field in a thread, the function moves the pointer to the required offset to get the beginning of the thread structure. Any field in the thread can be used to compare nodes. For example, if we want to have a rbtree in which nodes are sorted by the id, we can simply use the tid field from the thread structure. This is the Thread rbtree that we present in the following section.
The Threads Red-Black Tree
The kernel in NanosVM keeps a rbtree that stores all the threads in the system. When a thread is created, a node is inserted in the tree. This rbtree defines as a comparator the function thread_tid_compare, which is used to compare the tid of each thread.
To create the rbtree, the function allocate_rbtree is used. For example, this is where the rbtree is allocated and initialized:
p->threads = allocate_rbtree(h, closure(h, thread_tid_compare), closure(h, tid_print_key)
The function requires to define a handler, i.e., thread_id_compare, which is used to compare nodes. The function thread_tid_compare is used to choose between two threads by comparing their thread id. When a thread is created, the following line inserts it into the rbtree:
For example, If you would like to count the user time that all threads have consumed, you can simply travers the rbtree and accumulate the utime of each thread. This is done by the following code:
rbtree_traverse(p->threads, RB_INORDER, stack_closure(count_thread_time, &utime, true));
This function traverses the tree in ORDER, i.e., RB_INORDER and applies the function count_thread_time to each node. This function returns in utime the amount of utime of all threads. This operation is O(n).
In this article, we have presented an overall overview of rbtrees. We have first presented some general concepts like a binary tree and a bst. We have used these concepts to introduce rbtree, which are self-balanced binary search trees. We have shown that operations over a rbtree are more stable in terms of execution time than in a bst. This makes rbtrees interesting for time-critical components. This article is just a rough overview on rbtrees. We strongly recommend you to read more to have a better picture on benefits of rbtree and other potential use-cases.
Matias E. Vara Larsen