One of our customers reported that they were seeing some crashes with a Go payload and they were thinking it was tied to the fact that we didn't have support for MADV_DONTNEED in madvise. I thought this was weird as we use Go internally for several of our services (such as the unikernel repo that is a go unikernel running on Google Cloud for instance) and not only that but we've been monitoring these systems for a very long time as well. We very much eat our own dogfood and we run things on a number of different platforms to do so as we support north of 14 different clouds/hypervisors. Turns out they were right.
What is madvise
Let's back up first though - what is madvise to begin with?
madvise basically allows the application to tell the kernel what access patterns the kernel might expect from a particular region of memory. For instance it could tell the kernel that the region is not to be expected to be used anymore and it can be discarded or it could tell it that it is expected soon and the kernel could proactively bring the region into memory using read-ahead.
There are around five different advice arguments available in posix madvise but linux has an additional 21 types of advice you can give instead.
What is MADV_DONTNEED
When you set MADV_DONTNEED with madvise on a region of anonymous memory the pages will be cleared immediately. So the next time you read from the same memory you'll see zeros. You can see this in action by runing this snippet of code with nanos 0.1.54 but if you set --nanos-version to 0.1.53 you'll see that the assert fails as we didn't support it and the one will still be set.
#include <stdio.h>
#include <assert.h>
#include <sys/mman.h>
typedef unsigned char u8;
int main() {
size_t map_len = 4 * 1024;
u8 *addr = mmap(NULL, map_len, PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
assert(addr != MAP_FAILED);
addr[0] = 1;
assert(madvise(addr, map_len, MADV_DONTNEED) == 0);
assert(addr[0] == 0);
munmap(addr, map_len);
}
You might be wondering how MADV_DONTNEED compares to calling free. They are totally different things used for different purposes. It is just in context of this particular bug that their relationship was muddied. Free deallocates memory so that future calls might re-use it. That memory that is initially provisioned via malloc is handled by malloc and not necessarily returned to the system. MADV_DONTNEED can be used internally by various allocators but in general it will clear out the region and the next time the region is accessed it will either reload with updated contents if file mapped or have zero filled pages for anonymous mapped.
// mallocgc should be an internal detail,
// but widely used packages access it using linkname.
// Notable members of the hall of shame include:
// - github.com/bytedance/gopkg
// - github.com/bytedance/sonic
// - github.com/cloudwego/frugal
// - github.com/cockroachdb/cockroach
// - github.com/cockroachdb/pebble
// - github.com/ugorji/go/codec
- mallocgc go runtime
Why Does Go use MADV_DONTNEED
Since Go is a garbage collected language generally the end developer doesn't deal with malloc/free. Internally go uses mallocgc which is derived from tcmalloc.
Internally this allocator stores pages in spans. Crucially, each span, which is a run of in-use pages managed by the heap, has a flag called 'needzero'. If it is false then objects are already zeroed but if it is true objects will get zeroed as they are allocated. Herein lies our problem.
Go has an interesting history with MADV_DONTNEED. In Go 1.11 they used MADV_DONTNEED. Then they switched to MADV_FREE in go1.12. but then in go1.16 they switched back to MADV_DONTNEED.
MADV_FREE lets the kernel decide when to reclaim pages but MADV_DONTNEED tells the kernel to drop the pages immediately. Using madvise in this manner is essentially a faster and cheaper method of disposing of unused memory of which is why go uses it.
What was wrong
You might know in go that all variables are initialized to a zero value. This works but it also assumes that the allocator is dealing with zeroed pages to begin with.
Basically instead of zeroing pages when unmapping memory, the pages get zeroed when map is called. The problem is that without the madvise the unmapping call was never actually getting called. The go runtime doesn't immediately unmap memory for performance reasons. It was just assuming that MADV_DONTNEED was zeroing the pages when it was not. Underneath, the needzero flag was not being set.
Moral of the Story
I guess the lesson here is that when initializing your variables you need to understand the platform you are on, the version of the language runtime you might be using, the allocator you're using, and various assumptions being made about when to zero or not zero your pages. Oh yeh, and always put in observability for your prod workloads.
