How to reason about linked lists and arraylists

markus · May 15, 2025, 9:21pm

Hello,
ever since I started getting into low level programming I pretty much only used ArrayList because of its nice properties.
Whenever I considered using a linked list I found that I can emulate e.g. the pointers staying the same as indices that I track in a second arraylist.
I honestly dont know whats more performant and thats why im asking how you guys reason about the two structures.

To more precisely ask my question: Whats the point where you say “Time for a linked list”

squeek502 · May 15, 2025, 10:11pm

When I know I’m not going to be searching the list often or at all, but I know I’m going to be inserting into it/deleting from it often (specifically inserting/deleting at arbitrary points, as opposed to appending/popping from the end).

Relevant link:

Relevant Zig PR where a linked list was not the right data structure due to a lot of searching:

github.com/ziglang/zig

GeneralPurposeAllocator: Considerably improve worst case performance

master ← squeek502:gpa-optim-treap

opened 07:08AM - 03 Oct 23 UTC

squeek502

+197 -134

Before this PR, GeneralPurposeAllocator could run into incredibly degraded perfo…rmance in scenarios where the bucket count for a particular size class grew to be large. For example, if exactly `slot_count` allocations of a single size class were performed and then all of them were freed except one, then the bucket for those allocations would have to be kept around indefinitely. If that pattern of allocation were done over and over, then the bucket list for that size class could grow incredibly large, and to find a particular bucket, the entire (doubly linked) list would have to be scanned linearly. This allocation pattern has been seen in the wild: https://github.com/Vexu/arocc/issues/508#issuecomment-1738275688 In that case, the length of the bucket list for the `128` size class would grow to tens of thousands of buckets and cause Debug runtime to balloon to ~8 minutes whereas with the c_allocator the Debug runtime would be ~3 seconds. To address this, there are three different changes happening here: 1. `std.Treap` is used instead of a doubly linked list for the lists of buckets. This takes the time complexity of `searchBucket` [used in resize and free] from `O(n)` to `O(log n)`, but increases the time complexity of insert from `O(1)` to `O(log n)` [before, all new buckets would get added to the head of the list]. This is still a huge win because search happens way more often than insertion of new buckets. Note: Any data structure with `O(log n)` or better search/insert/delete would also work for this use-case. 3. If the 'current' bucket for a size class is full, the list of buckets is never traversed and instead a new bucket is allocated. Previously, traversing the bucket list could only find a non-full bucket in specific circumstances, and only because of a separate optimization that is no longer needed (before, after any resize/free, the affected bucket would be moved to the head of the bucket list to allow `searchBucket` to perform better on average). Now, the current_bucket for each size class only changes when either (1) the current bucket is emptied/freed, or (2) a new bucket is allocated (due to the current bucket being full or null). Because each bucket's `alloc_cursor` only moves forward (i.e. slots within a bucket are never re-used), we can therefore always know that any bucket besides the current_bucket will be full, so traversing the list in the hopes of finding an existing non-full bucket is entirely pointless. 4. Size + alignment information for small allocations has been moved into the Bucket data instead of keeping it in a separate HashMap. This offers an improvement over the HashMap since whenever we need to get/modify the length/alignment of an allocation it's extremely likely we will already have calculated any bucket-related information necessary to get the data. The first change is the most relevant and accounts for most of the benefit here. Also note that the overall functionality of `GeneralPurposeAllocator` is unchanged. In the degraded `arocc` case, these changes bring Debug performance from ~8 minutes to ~20 seconds. ``` Benchmark 1: test-master.bat Time (mean ± σ): 481.263 s ± 5.440 s [User: 479.159 s, System: 1.937 s] Range (min … max): 477.416 s … 485.109 s 2 runs Benchmark 2: test-optim-treap.bat Time (mean ± σ): 19.639 s ± 0.037 s [User: 18.183 s, System: 1.452 s] Range (min … max): 19.613 s … 19.665 s 2 runs Summary 'test-optim-treap.bat' ran 24.51 ± 0.28 times faster than 'test-master.bat' ``` Note: Much of the time taken on Windows in this particular case is related to gathering stack traces. With `.stack_trace_frames = 0` the runtime goes down to 6.7 seconds, which is a little more than 2.5x slower compared to when the c_allocator is used. These changes may or mat not introduce a slight performance regression in the average case: Here's the standard library tests on Windows in Debug mode: ``` Benchmark 1 (10 runs): std-tests-master.exe measurement mean ± σ min … max outliers delta wall_time 16.0s ± 30.8ms 15.9s … 16.1s 1 (10%) 0% peak_rss 42.8MB ± 8.24KB 42.8MB … 42.8MB 0 ( 0%) 0% Benchmark 2 (10 runs): std-tests-optim-treap.exe measurement mean ± σ min … max outliers delta wall_time 16.2s ± 37.6ms 16.1s … 16.3s 0 ( 0%) 💩+ 1.3% ± 0.2% peak_rss 42.8MB ± 5.18KB 42.8MB … 42.8MB 0 ( 0%) + 0.1% ± 0.0% ``` And on Linux: ``` Benchmark 1: ./test-master Time (mean ± σ): 16.091 s ± 0.088 s [User: 15.856 s, System: 0.453 s] Range (min … max): 15.870 s … 16.166 s 10 runs Benchmark 2: ./test-optim-treap Time (mean ± σ): 16.028 s ± 0.325 s [User: 15.755 s, System: 0.492 s] Range (min … max): 15.735 s … 16.709 s 10 runs Summary './test-optim-treap' ran 1.00 ± 0.02 times faster than './test-master' ``` --- Here are some more benchmark results using a very targeted benchmark that intentionally only does worst-case allocation patterns: <details> <summary>Benchmark code</summary> ```zig const std = @import("std"); pub fn main() !void { var gpa = std.heap.GeneralPurposeAllocator(.{}){}; defer std.debug.assert(gpa.deinit() == .ok); const allocator = gpa.allocator(); const alloc_size = 128; const slot_count = @divExact(std.mem.page_size, std.math.ceilPowerOfTwoAssert(usize, alloc_size)); const rounds = 5000; var unfreed_slices: [rounds][]u8 = undefined; var i: usize = 0; while (i < rounds) : (i += 1) { unfreed_slices[i] = try allocator.alloc(u8, alloc_size); for (0..(slot_count - 1)) |_| { const slice = try allocator.alloc(u8, alloc_size); allocator.free(slice); } } for (&unfreed_slices) |slice| { allocator.free(slice); } } ``` </details> #### On Linux: Debug: ``` Benchmark 1 (3 runs): ./gpa-degen-master measurement mean ± σ min … max outliers delta wall_time 3.66s ± 65.5ms 3.59s … 3.70s 0 ( 0%) 0% peak_rss 41.9MB ± 2.36KB 41.9MB … 41.9MB 0 ( 0%) 0% cpu_cycles 14.5G ± 280M 14.2G … 14.6G 0 ( 0%) 0% instructions 25.1G ± 559M 24.5G … 25.5G 0 ( 0%) 0% cache_references 98.7M ± 518K 98.1M … 99.1M 0 ( 0%) 0% cache_misses 13.6M ± 156K 13.4M … 13.7M 0 ( 0%) 0% branch_misses 56.8M ± 1.64M 55.0M … 58.2M 0 ( 0%) 0% Benchmark 2 (9 runs): ./gpa-degen-optim-treap measurement mean ± σ min … max outliers delta wall_time 617ms ± 5.66ms 607ms … 624ms 0 ( 0%) ⚡- 83.2% ± 1.2% peak_rss 41.9MB ± 1.81KB 41.9MB … 41.9MB 2 (22%) - 0.0% ± 0.0% cpu_cycles 1.73G ± 12.9M 1.71G … 1.75G 0 ( 0%) ⚡- 88.0% ± 1.3% instructions 2.79G ± 12.1M 2.78G … 2.82G 0 ( 0%) ⚡- 88.9% ± 1.5% cache_references 38.7M ± 502K 37.9M … 39.3M 0 ( 0%) ⚡- 60.8% ± 0.8% cache_misses 195K ± 10.1K 184K … 215K 0 ( 0%) ⚡- 98.6% ± 0.8% branch_misses 4.39M ± 25.9K 4.36M … 4.43M 0 ( 0%) ⚡- 92.3% ± 1.9% ``` ReleaseFast: ``` Benchmark 1 (27 runs): ./gpa-degen-master-release measurement mean ± σ min … max outliers delta wall_time 187ms ± 4.41ms 178ms … 195ms 0 ( 0%) 0% peak_rss 20.7MB ± 2.22KB 20.7MB … 20.7MB 0 ( 0%) 0% cpu_cycles 705M ± 13.2M 680M … 728M 0 ( 0%) 0% instructions 115M ± 17.8 115M … 115M 1 ( 4%) 0% cache_references 29.5M ± 230K 29.1M … 30.0M 0 ( 0%) 0% cache_misses 12.8M ± 15.0K 12.8M … 12.8M 0 ( 0%) 0% branch_misses 35.4K ± 338 35.2K … 36.4K 1 ( 4%) 0% Benchmark 2 (195 runs): ./gpa-degen-optim-treap-release measurement mean ± σ min … max outliers delta wall_time 25.6ms ± 2.80ms 20.3ms … 31.9ms 0 ( 0%) ⚡- 86.3% ± 0.7% peak_rss 20.9MB ± 2.05KB 20.9MB … 20.9MB 0 ( 0%) + 1.0% ± 0.0% cpu_cycles 35.2M ± 2.55M 30.8M … 46.5M 2 ( 1%) ⚡- 95.0% ± 0.3% instructions 40.3M ± 912K 38.3M … 43.4M 1 ( 1%) ⚡- 65.0% ± 0.3% cache_references 1.23M ± 186K 851K … 1.78M 3 ( 2%) ⚡- 95.8% ± 0.3% cache_misses 12.3K ± 508 11.2K … 14.1K 6 ( 3%) ⚡- 99.9% ± 0.0% branch_misses 52.5K ± 2.95K 48.9K … 57.8K 0 ( 0%) 💩+ 48.3% ± 3.2% ``` #### On Windows: Debug: ``` Benchmark 1 (3 runs): gpa-degen-master.exe measurement mean ± σ min … max outliers delta wall_time 4.47s ± 165ms 4.29s … 4.62s 0 ( 0%) 0% peak_rss 44.5MB ± 2.36KB 44.5MB … 44.5MB 0 ( 0%) 0% Benchmark 2 (9 runs): gpa-degen-optim-treap.exe measurement mean ± σ min … max outliers delta wall_time 562ms ± 3.21ms 557ms … 567ms 0 ( 0%) ⚡- 87.4% ± 2.5% peak_rss 44.5MB ± 2.05KB 44.5MB … 44.5MB 0 ( 0%) - 0.0% ± 0.0% ``` ReleaseFast: ``` Benchmark 1 (9 runs): gpa-degen-master-release.exe measurement mean ± σ min … max outliers delta wall_time 564ms ± 44.9ms 497ms … 603ms 0 ( 0%) 0% peak_rss 23.5MB ± 2.05KB 23.5MB … 23.5MB 0 ( 0%) 0% Benchmark 2 (120 runs): gpa-degen-optim-treap-release.exe measurement mean ± σ min … max outliers delta wall_time 41.8ms ± 1.55ms 38.5ms … 46.1ms 0 ( 0%) ⚡- 92.6% ± 1.4% peak_rss 23.7MB ± 18.6KB 23.7MB … 23.9MB 1 ( 1%) + 0.9% ± 0.1% ``` --- Various notes: - A memory pool is used for the `Treap.Node`s. This has two slightly weird things: + Because the GPA doesn't have an `init` function and is directly instantiated instead, the memory pool can't use `backing_allocator` and instead always uses the page_allocator + The number of allocated `Node`s will always stay at the peak number of `Node`s necessary, meaning that e.g. if a program needs 5000 buckets at one point, then all 5000 of those nodes will live for the rest of the program even if all memory in the buckets is freed (but those 5000 nodes will also be re-used whenever a new node is needed). - I initially used a [skip list](https://en.wikipedia.org/wiki/Skip_list) implementation that I wrote for this because I wasn't aware of `std.Treap`, but `std.Treap` slightly outperformed it in my benchmarks and provides all the same benefits.

(note also that the changes in that PR have been obsoleted after that allocator was rewritten to avoid the need to search for buckets entirely)

Sze · May 15, 2025, 11:12pm

Linked lists are nice for freelists (although I prefer to use indizies in freelists too) or with intrusive data structures where you can have an item be part of multiple different collections, so basically in scenarios where there is no search necessary, just either adding the item or removing/processing them.

But I think it is always good to think about and measure what the linked list actually gives you and consider whether you could get better performance by avoiding it.

Linked lists can be pretty terrible, so be careful when using them and convince yourself that you aren’t trashing your cache etc.

But overall I would say it really depends on algorithmic details and then it is more a question of what is the right data-structure from a whole bunch of different data-structures.

kj4tmp · May 16, 2025, 12:07am

If you have real-time constraints, a linked list may be appropriate if you would like to be able to grow / shrink / insert the list in deterministic constant time (like microseconds), or avoid syscalls when growing the list (by using some pre-allocated memory pool).

Real-time pop / append can be accomplished with fixed memory pool and array list, but not insert. Linked list also provide ability for O(1) insert.

Applications with real-time constraints are things like audio / video capture, industrial controls / networking.

mnemnion · May 16, 2025, 2:30am

Probably the most interesting property of a linked list is that splits and joins are both O(1), and neither of them allocates. There are times when this property is useful.

They’re also good when the data shape is mostly a list, but not entirely. Consider something like an undo/redo buffer: just a bunch of commands, they can be undone, and anything undone can be redone using the same struct. You can store this in an array, but if you do, you’re stuck: once the user does something new after an undo, all of the redo capability is lost, because you have to start overwriting it.

With a linked list, you can just split, turn it “sideways”, and make the alternate history a node in the main-line list. Of course, at that point, strictly speaking, you have a tree. But not really a tree, because your alternate history sticking off the side is data, you don’t reach it with the usual list traversal. That kind of flexibility can come in handy.

This also has the nice property that your place in the tree is an object, rather than the combination of an aggregate and an index. For this kind of complex system, that can be a lifesaver.

Linked lists are nice when the data is only a list sometimes. Let’s say you have macros. Just a list of commands (this is an array for our purposes), you send them off to whatever executes the commands, things happen, it finishes the macro.

Ok, but, what if the macro calls a macro?

Again, you can have a stack of macros here, implemented as an ArrayList or whatever we’re calling it, but now you’re always dealing with a stack, working off the top of it, when what you wanted to be doing is playing back a macro. Really you want the macro itself to be an ?*Macro, so that you only execute it when there is one. This is less elegant with an array, although I’ll admit that checking if the array is empty does accomplish the same thing. The point is that macro-calls-macro is this inconvenient corner case which you don’t want to be constantly handling, the normal case is just one.

So you can add a prev field to the macro struct. If your macro calls a macro, the current macro gets put in that prev field of the new macro, and life goes on. Each time the command loop returns to the macro executor, it’s just executing a macro, and your code only has to deal with it as a stack when it has to.

Whenever you reach the end of a macro, you just check if there’s a stack, and pop it back in place if so. Since your prev field is a ?*Macro, and your execution is storing macros as a ?*Macro, you just assign prev and the next round of execution does whatever it needs to do.

Oh, and by the way, you might want to check that stack when an incoming macro arrives, just in case you’re executing a copy of it already. Vim doesn’t ^_^.

Linked lists are good for that kind of semi-principled ad-hoccery.

markus · May 16, 2025, 7:22am

I genuinely found all your answers really interesting and helpful, so instead of marking one as a solution ill just mark this message, indirectly referring to all of your answers here.

markus · May 16, 2025, 7:25am

The idea of needing something not super-fast most of the time but just always somewhat fast but consistently faster than some time constraint makes a lot of sense to me!

markus · May 16, 2025, 7:25am

I’ll definitely read some of the links you provided, I appreciate the help!

IntegratedQuantum · May 16, 2025, 6:55pm

I think in most cases you can either just use an ArrayList, if you are fine with reusing items in LIFO order, or you can use a resizable circular buffer queue if you want FIFO.

I guess the one nice thing about linked lists, for a free list, is that you can place the nodes directly into the free memory block instead of having an external list. Additionally it is also easier to use lockfree algorithms with linked lists, but most of the time a mutex should be fast enough anyways.