General Purpose Operating Systems use a filesystem (fs) to store data in a persistent way like on a hard disk. Generally speaking, data is stored in files which belong to a directory. Thus most filesystems are organized in a hierarchical way. The fs stores files in blocks into physical devices. When a user accesses a file, the fs looks for the blocks that belong to that file and moves the content to the user. The fs requires metadata to allow these operations. For example, the fs keeps a bitmap of the blocks that are free thus can be allocated for new files. There are different ways to organize the data on the disk so there are different fs, e.g., fat, ext2/ext3, ntfs, etc. In this article, we look at what a filesystem is and how the fs is implemented in the Nanos kernel.
The Virtual FileSystem
To have a better idea about how a fs is implemented, let’s start with Unix/Linux. The Unix operating system design, and so Linux, is centered on its filesystem. In other words, everything is represented as a file, .e.g, a block device, a char device, a network device, a socket, etc. Linux is based on the Virtual FileSystem (VFS) in which each filesystem is described in terms of blocks, superblocks, inodes and files. The VFS is a kernel software layer that handles all system calls related to the standard Unix FileSystem. Its main objective is to provide a common interface to several kinds of filesystems. Thus, an user application can write into a file without knowing what the fs is or which block driver to use. Both the fs and the block driver are implemented as drivers for the VFS. This layer schedules the calls of the right methods depending on the file. A filesystem developer has only to provide the right fs interface and the right block driver interface and simply plug them into the VFS. The lowest layer is the buffer-cache that allows accessing the device at the block level and also keeps a cache of the most used blocks. Broadly speaking, when the user triggers the opening of a file, the VFS looks for the handler of the corresponding path and calls the open() method for that file which returns a file descriptor (fd). That fd is used for write() to call the corresponding write() method depending on the file. The write handler ends up translating the file content into a sector address that the buffer-cache understands.
Such a level of generality may not be needed for a unikernel. For example, unikernels may not need several filesystems or the whole filesystem may remain static during the execution of the VM. Also, we may not require different users, groups or permissions. Cloud providers usually expose a few block devices so only one block driver could be enough. These requirements may call for a solution that is different from a general purpose fs. In the following, we present the filesystem that has been written for the Nanos kernel.
Logfs and extents
First of all, let’s introduce some concepts like a log-structured filesystem and extent-based filesystem. A log-structured fs stores changes on the fs as a log record. The log structure is append-only, it is always in memory and it may contain both data and metadata. For writing files, a few log records are created, for example, “assign the block X to file Y” and “Write ‘hello’ into block X”. For reading, a file needs to be reconstructed from the logs. When the log structure fills up, it is flushed into disk. During this operation, some of these logs may be out-of-date so they should not need to be written into disk. The log can be compacted by removing these records. When the log reaches the end of the disk, the log is clean by identifying empty spaces.
In the context of storing files on disk, an extent structure indicates the blocks that belong to the file. This structure is a pointer-length pair that points to a disk sector address and the length that tells how many consecutive blocks, starting at that address, are part of this extent. This is different to the inode direct pointers for extended filesystems like ext2 or ext3 in which these pointers point to a single block.
Compared with a traditional filesystem, the logsfs achieves better performance by avoiding seeking. The log is always written into disk in a sequential way. In the Nanos kernel, we leverage on that property and we rely on a log-structured fs to provide fs access for applications running on top of the unikernel.
The Tuple FileSystem
The Tuple Filesystem (tfs) is the fs implemented in the Nanos kernel. The original motivation was to have a bare amount of capabilities working at first. TFS is still a work-in-progress. In the following, we present the main characteristics of this fs.
The fs is represented with a hierarchical set of tuples and in which any update to this structure is stored in a log-structured fs. Tuples can represent a directory or a file. In this structure, the root tuple is used to store information like including paths, environment variables, and debugging flags. The tuple-based metadata allows us to add properties to files in a flexible way. At mounting time, this hierarchical structure is created and is always kept in memory. Updates to this structure are translated into a log record.
When a tuple is updated, a record is created in a log. The log contains metadata regarding directory structure, file lengths, manifest type config, but also the collection of extents which map file offset to storage offset. Note that file data itself does not go into the log. The log is used to keep only changes in file metadata. We use extents to map file offsets to sector offsets. For example, when a new extent is added to a file, such a change is stored in the log. Any of the number spaces from zero to the file length that isn't covered by an extent is considered a gap (and read as zero data). This gives support for sparse files.
On the disk, the data is stored in sectors in "block" logical units. The blocks are allocated from an ID heap which lives in memory. There is no file allocation table or any kind of allocator structure on storage itself; it is all allocated using the id heap (which in turn uses the bitmap allocator). The same storage area (volume) is shared between file storage and log extensions.
Log updates are buffered in memory for only a second before being written to disk. During the operation of the fs, tuples are updated so records in the logs may become obsolete. The log compaction occurs once the ratio between the total number of entries in the log and the obsoleted entries falls below 2:1. During compaction, obsolete records are removed and the “time zero” of the log structure is moved forward.
In this article, we have presented some basic notions of a filesystem like a file or directory but also the required metadata to organize those concepts like inode or block. In particular, we have talked about log-structured fs and extents. We have shown that the main difference of a log-structured fs is that all updates are kept in a log record. This is different to a traditional filesystem in which changes update metadata structures, like an inode or superblock, are quickly written into disk. In a log-structured fs, updates are batched in a log and flushed into the disk periodically. We have presented that the nanos kernel leverages on these features and implements the tfs which is a minimal fs that relies on tuples and log records. The tfs provides a simple fs abstraction. In the future, Nanos may be extended to support a VFS thus enabling different filesystems to plug into the kernel. For example, this could allow the support of a filesystem that performs better for large files.
Matias E. Vara Larsen