Video from Andrew Kelley on Hytradboi 2025:
This video came at the perfect time for me! Just a few days before I saw it I was trying to apply the same techniques to the data model of a music player Iâm working on. I decided to finally give Richard Fabianâs Data-Oriented Design a read, which made me realize that the âProgramming Without Pointersâ methodology is very reminiscent of first normal form.
Iâm a little confused on one point. Take the disease locations example from the video â how would one go about dynamically adding and removing locations from an already-existing disease? My instinct after watching the video was to supplement Location.Slice
with a capacity
field, turning the locations
array into an ad-hoc allocator. This means you end up with a bunch of âdeadâ subarrays within locations
(which you could clean up with a periodic compaction step), unless you add infrastructure to track free regions, at which point itâs really starting to look like an allocator!
Alternatively, following 1NF I think youâd replace the Location.Slice
approach with a top-level std.ArrayList(struct { diseaseId: u32, location: Location })
field. Unfortunately, finding all the locations associated with a given disease requires a search through the entire array.
How do you solve this?
Would anyone have any further resources on programming in the style of the relational model? The DOD book is a little light on practical examples, at least so far (I just finished chapter 7).
I think sometimes remove just isnât needed, in those cases this organization makes it super easy.
If you need remove, I think there are a bunch of different options, that arenât necessarily way harder.
One is to rewrite all the data without holes to a new instance, that is basically a copy (rebuilding the data structure without what should be removed), removing the old one. (this is basically like data-structure-specific garbage-collection without having to keep track of things in a complicated general way)
For the diseases example you also can swap-remove a disease (if order / index stability isnât needed) and when adding a new one, you could use the ArrayList in a way, where the unused capacity slice contains valid off+len slices
, so basically your freelist becomes the unused capacity slice. (And the ones that donât correspond to removed locations would have len 0)
Because the diseases array indexes into the locations array and the former is likely indexed externally while the latter probably is not, you also have the full information available and ability to re-organize internally, you basically could write an in-place compaction method that moves locations forward to fill in-gaps created by removed diseases and their locations.
If you need index stability for the diseases, you could mark removed diseases with a zero length and add a new free-list entry, basically leaving the disease value there but marked as a tombstone, you then could re-use the offset and use it to index to the next tombstone. That way you just need a sentinel value that means end of list (for example maxInt(Index)) and a first_tombstone
index to have a list of indices that can be used for newly added disease entries.
For removing you can remove them and compact the locations in place, change the len and create a new disease that describes the unused memory and add it to the unused capacity slice that is used as free-list.
For growing the locations you can go through the free-list (which you could sort periodically), find one that is big enough, copy the data over, add the new data and swap the indicies (of the disease and the disease in the free-list), potentially doing the removing part for the remainder that isnât used (creating a free-list entry for it).
I agree that it has some similarity to allocators, or more specifically it is like a memory pool for variable sized slices and it allows you to relocate the slices, because you know where the metadata is stored and you can distinguish between data that can be identified by an id externally and data that is âinternallyâ managed (basically data that can be moved around, because the data-structure holds all the valid indices into the locations array).
I think another approach that can be useful (especially if there are some limited number of expected slice sizes) is to instead store every size of slice as their own group, which turns the index into a handle that can be used to find the right size-group and than the index in that group.
I used something similar to that in my HAMT and StencilArrays implementation, I think for data structures that have nested tree like structure, it might help with making it easier to replace and reuse nodes, because all nodes that share a size are part of the same memory block, which makes it easier to reuse memory (without copying around other memory for compaction purposes).
I think it could cut down on the amount of wasted memory / needed reorganization, if you have a lot of adding and removing.
But I havenât investigated that in detail and it would vary based on how the data-structure is used.
Also, any idea why he uses an non-exhaustive enum(32)
to represent String
instead of a u32
? Is it just to define associated constants (Table
, TableContext
, âŚ)? Itâs around 5min in the video.
To help prevent user error.
Because in Zig, every declaration of an enum
type is a distinct type, while all u32
s is a u32
.
Using an enum(u32)
means that you can rely on the type system to stop you from passing a handle into the wrong function, because itâll give you a compiler error that it doesnât accept the type you gave it.
Ok makes sense, thanks!
I was nodding all along the video in agreement. A couple of years ago after recovering from a decades long âClean Codeâ mind virus infestation, I was tinkering with putting all shared application state into a single nested struct, eliminating most dynamic allocation, and âmutable shared globals considered harmfulâ be damned.
It works surprisingly well despite the obvious downsides (all code has write-access to all data).
âŚone example is here in my 8-bit emulators: an entire emulator state is a single struct without pointers, and this struct can be âsnapshottedâ (via menu => System => Save Snapshot: Tiny Emulators)
Of course since then I keep thinking about a language which would assign filesystem-like access rights to such a globally shared, nested structure (because that struct is pretty much like a file system, e.g. you have âfilesystem locationsâ like app.gfx.buffers
vs app.physics.bodies
).
âŚe.g. there needs to be a way to grant specific code locations (which is also organized in a hiearchy of modules with functions as leaf-nodes) specific access rights into the data tree (e.g. the module gfx
has full read/write access to app.gfx.*
, a handful other modules have readonly access, but all other code doesnât have read nor write accessâŚ
âŚand so onâŚ
Revenge of Fortran
I though again about it when I saw this PR. It seems to be a recurrent pattern but Iâm not sure to fully understand the benefits.
I get that instead of raw u32
you get type check because each enum is unique. But that means that it requires to use @enumFromInt
and @intFromEnum
each time you want to use it to access an array?
Iâve seen multiple people seem to think avoiding built-ins makes code better, but that is highly dependent on what built-in is being used. @constCast
is highly suspect code but it is sometimes needed, thats why it exists, just try a better way first :3
However casting between an enum and an integer is very, very normal.
Perhaps itâs because itâs verbose, but well, thatâs just zig. Perhaps one day zig will let you index with enums directly, or perhaps it wonât.
In fact I wasnât saying this because I want to avoid builtins, I use them quite often. Itâs more about the potential overhead of adding a builtin call. I donât know what code is generated when compiling builtins but I guess it adds at least one instruction?
I was more wondering about: is preventing user mistake by using an enum worth the extra operations (builtin calls) over telling the purpose of a parameter with aliasing like: const NodeIndex = u32
?
Edit:
I checked this simple example and Iâm more confused than before aha. Iâm not the most familiar with Godbolt and assembly but the enum version seems to generate less instructions: godbolt.
I really donât know how are handle builtins under the hood.
@enumFromInt
and @intFromEnum
are no-ops. const Index = enum(usize) { _ };
and const Index = usize
compile to the same code.
I donât really know assembly but in your example the produced code isnât exactly the same but maybe they are equivalent.
What I do is this:
pub const Handle = enum(u32) {
_,
fn toIndex(handle: Handle) usize {
return @as(usize, @intFromEnum(handle));
}
fn fromIndex(idx: usize) Handle {
return @enumFromInt(idx);
}
};
Mostly to make clear the intent. The builtins shouldnât really do anything complicated aside from reinterpreting the number as a different .
Edit, might be a good idea to @truncate
the index in the second method?
I donât think @truncate
is a good idea, if some code creates an index bigger than u32
with truncate that would silently wrap back to some valid but wrong index, it is better if you instead get a safety check error in safe modes.
I would keep the index u32
instead of usize
, that way other code can convert to u32 index and back without needing to worry, if the index gets coerced/widened to usize somewhere, then that code needs to check/make sure the index is still valid, before going back to a handle anyway.
They are the same. One of the functions is tail-calling the other one.
its particularly funny that raw integer function calls the enum function
You can do that right now with an EnumArray, theyâre nice I use them a lot. Doesnât have to be a monotonic dense enum either, the type will comptime-construct a mapping for sparse values.
I am aware, I was talking about the use of the builtins, which EnumArray
does use. I probably should have mentioned it, though.