How can static memory allocation work for a system like TigerBeetle?

I have seen in various talks around TigerBeetle the mention of the fact that Zig not having hidden memory allocation is part of the reason why it works well for building TigerBeetle.

And they mention that on start up, the needed memory is allocated and no more allocation occurs while the system is running.

But I do not understand how this can work.

There is no way to know before hand the size and amount of messages that the system will have to be working with during operation so how can it be possible to allocate all the memory it can ever need?

4 Likes

The best answer is probably to read the source code to get the feeling for how it is done. Another good option would be this episode of iron beetle:

Here a TL;DR:

Fundamentally, TigerBeetle is a database. So the job is to take in a single message over the network, process it, write the results to disk, and repeat many, many times.

Of course, for performance reasons, we want to overlap processing of several messages. But it is relatively straightforward to limit this concurrency: in TigerBeetle, there’s vsr_pipeline_queue_prepares_max constant for this. It of course can be the case that someone tries to send you more than that amount of message. Or someone doesn’t read the messages you are sending, clogging up your send buffers. But, as this is a distributed system any way, any code is prepared for the possibility of lost messages, by having appropriate retry loop. So, if that happens, just drop some message on the floor.

Then, it’s generally a good idea to limit the size of a single message. You wouldn’t want to send several gigs as a single item, lest you overflow the receivers’ memory. So there’s another constant, message_size_max, which defines that. If the client want to send more data than that, they gotta split their request is two. Which is always possible to do, because a single transfer is just 128 bytes and the client packs a bunch of them into a single message anyway.

Given these two constraints, there’s only a constant amount of things that have to be in memory for processing, and you can then derive the upper bound on the size of all data interior data structures.

What is unlimited is the size of file on disk. And, if you don’t want to snail-speed processing, you want to keep some sort of an index in memory, that describes what’s there on disk. But the relation there is very generous: you need a couple of bits of RAM to address a megabyte of storage. So this becomes just another runtime parameter on the CLI: the maximum size of the data file you can process. If, over a long period of time, data file grows bigger than that, you can restart the process allocating more memory for addressing the disk, if you have spare RAM.

But you shouldn’t have spare RAM, because you can use it for caches. Which is strairforwar, another CLI flag for “I want this many gigs of cache” which allocates, at startup, that many gigs of cache!

And I think that’s basically it? Messages are bounded, and there only “large comptime constant” of them at a single time, so you can compute the upper bound of ram that is needed for processing. Data on disk is not bounded, and you need some in-memory index over it, but you can pass a runtime arg (which can be increased at restart) for how large do you want your data file to be. Finally, the rest of your memory can be caches, which is yet another CLI flag.

5 Likes

And I assume there is no fear that such upper bound on the memory size artificially places a limit on the throughput?

Memory is the more flexible parameter here, as it is easy to change the size of the cache. The amount of latent concurrency is more rigid: there’s only many IO operations that the disk can do, and only 5 other replicas to talk to.

So we choose the constants to exploit all available concurrency, and derive the memory requirement from that, not the other way around.

2 Likes