Refactoring Nginx Worker Processes to be Multi-Threaded

We've supported nginx running as a unikernel as-is w/no patches for a long time but it still utilizes 90s style worker processes, so naturally we cut patches to convert them to threads and instantly got a free performance boost.

Let's step back in time though to understand why nginx works this way. Nginx was released in 2004 but work started around 2002. Back then the dominant webserver on linux was apache.

While the Intel Pentium 4 was released in 2000 it wasn't until 2002 that we had the first Northwood processors with SMP.

The beaver is out of detox

Then in late December of 2003 (almost 2004) Linux 2.6 landed with NPTL which, threads as we know them in linux today, finally landed.

It wasn't just the lack of availability of consumer SMP enabled processors or NPTL that was problematic at this time though. For sure we needed those but those were just enablers.

The programming idioms that were used in webservers before this time reflected the architecture and libraries available. It also reflected the reality of that time period. CGI and its infinite number of permutations was a very common method of serving up traffic. In it's simplest form a http request would come in and the webserver would quite literally fork/exec a whole new program to serve each request including db setup and all the other associated startup costs. For those not around during this time period, you could imagine how bad from a performance and security viewpoint this was. This would later evolve to pre-forking a handful of worker processes which were persistent and shuttle data along either unix sockets or through a local tcp connection. Even though this style of infrastructure solved the backend scaling problem there was still the problem of the front-end and thus it was common, for systems that could support it, to have multiple worker processes in front as well and that is what the 'worker processes' variable in the nginx config files still does up to today.

Enter unikernels

Unikernels are single process by nature and so are not allowed architecturally to pre-fork processes on their own. This is by design. To scale these you must either pre-spawn unikernels to the same number you'd have as pre-forking your processes, or you can rely on the software in question's ability to use threads. If you are using something like go, rust, java, etc. you can scale pretty far on one vm simply because you have access to threads. Modern machines have hundreds of threads available and we can only see this trend continuing.

Writing code to use multiple processes when an underlying language can use multiple threads is problematic and wasteful on many fronts. With large heaps spawning new processes (such as taking a database dump as some applications will do) can be very expensive and can also cause issues with GC enabled languages. Multiple processes are inherently unsafe from a security point of view and that's ignoring the sheer amount of code you'll see in various scripting languages that will wrap command after command after command in highly unsafe calls. Finally, the complexity that is introduced into multiple process paradigms is very high with signaling, shared memory cause they can't access the same heap, dealing with existing resources (connections) that can't go into a new process upon creation, etc.

It's a paradigm that worked for us in the past, but with newer languages such as GO and Rust, along with a wide collection of existing language support, it simply is not something that is advisable any longer. Even the scripting languages are starting to jump ship and those are inherently single-process/single-thread by design, and I feel with the embrace of isolates those languages will eventually go full bore here. Although, arguably this is a much tougher task because of how these languages actually work.

For instance it's fairly easy to make a multi-threaded webserver that runs something like php but when you want to run wordpress on it now you have to go in and re-write large portions of wordpress because codebases like that assume you are in a single thread environment and have never had to deal with issues such as global vars everywhere. I'm sure your comp sci professor told you in CS53 to avoid globals but unfortuantely that's just not real life.

One just needs to peruse some of the prs from projects like roadrunner:

"You are not able to pass more than one request through the WP. Since WP treats the request as a global instance it will cache its values inside the config and other functions. This makes impossible to reuse already initiated application. This is caused by the fact that WP does not use any request/response abstraction."

or by looking at swoole:

"I'd like to know if its possible to run WordPress on Swoole?

No, there are a lot of global variables and other problems in WordPress"

Back to nginx. So why can't we scale with threads here since it is written in c? Well, as I'm about to show you - we absolutely can and sometimes it doesn't even take that much code to do so.

Early on Nginx distinguished itself by having an event driven architecture using epoll but it still would spawn multiple frontend listener processes.

Refactoring Nginx

We decided to look at the worker_processes setting first as that is the main setting that is stubbed out in the default configs of any nginx package you might find on the ops package repository. You might have noticed that nginx has a 'thread_pool' setting, in which Nginx notes a 9X improvement, although that only deals with a few calls such as read and sendfile.

One of the major problems with code that has been written to work as multiple processes vs multiple threads is the gratuitous use of global/static variables. You'll see this a lot in projects like this, but it's also one of the main problems with scripting languages as well. If you are working in a codebase that is built around multiple processes chances are there isn't any thread usage, or if there is it is very isolated as mixing the two can be very problematic. So what you'll find is that in the multi-process program having globals/statics everywhere isn't a huge issue, at least when it comes to shared state since there is none - each new process will have its own. This is actually slightly a lie though as we'll point out because then you start using things like shared memory between processes, or shuttling data back and forth on unix sockets or issuing signals to other processes to "do something". When you convert these to threads, however, now you have to think about that and that's what makes people shy away from ever doing the conversion on legacy codebases. This was actually brought up on the postgres mailing list recently and we even got in on the action and cut a unikernel for postgres.

Luckily for us, we have a simple replacement where we can prefix these variables with the __thread keyword indicating that we want each variable to be allocated per each thread.

I should state that this enables us to quickly convert nginx over but that doesn't mean that if someone wanted to further enhance the code it might behoove them to refactor the code under these new assumptions. Simply replacing one pattern with another almost never gets you optimal architecture. Typically in cases like this you'll want to rethink how the architecture works. For instance in this case instead of just plugging in _thread everywhere why not encapsulate that info explicitly? If you end up going down this route with nginx or something else please let us know!

Another change we made was, in a multiple process configuration there is typically a master process that deals with coordination amongst the child processes. So removing the signaling for that was done. We ended up adding a new ngx_master_cycle to faciliate copying the contents of ngx_cycle_t.

Nginx used to pass data through unix sockets but that is not necessary in a multi-thread setup so we removed those and instead added support to get the channel fds from the ngx_processes array which we left as is.

We got rid of the setuid/gid calls for a few reasons. One, they utilize xidcmd which can be racy and two, we stub them any.

You can apply the patches yourself against a recent version of nginx (1.25.2 as of this writing):

git apply 0001-ngx_spawn_process-disable-non-blocking-operation-on-.patch
git apply 0002-Worker-processes-convert-to-worker-threads.patch

Or you can just run the package we built:

ops pkg load eyberg/nginx-mt:1.25.2 --nanos-version=042cf2f -p 8084

I should also state as people have already asked - I wouldn't expect Nginx or F5 to pull this into their mainline. Nginx has been around for almost 20 years and so has a considerably large existing userbase, who, for a portion these changes would not work for. That's ok.

This also points to our way of thinking. Having a new kernel that embraces certain characteristics doesn't diminish from the fact that there is a large swath of software out there that is decades old that may or may not conform to different patterns. There is a lot of software out there that is just like the roads and bridges that should be repaired. Going in and making these changes though allow broader community adoption for existing brown-field software yet at the same time encourages others to consider making more unikernel-native software. Not only do we get immediate improvements for a lot of existing software by provisioning them as unikernels, software that is explicitly made for unikernels can actually express themselves in a manner that the older software can't.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.