Why this Zig code is slower than Go

jiacai2050 · October 8, 2023, 12:50am

I came across this repo yesterday, and was surprised that Go top the benchmark, Zig’s version is here:

related_post_gen/zig/main.zig at main · jinyus/related_post_gen · GitHub

I try different method to optimize it, but it still slower on my machine (30ms vs 20ms), is there anyone have any ideas how to improve the Zig code, I expect its speed should be(at least) same level with Go’s.

AndrewCodeDev · October 8, 2023, 1:04am

var allocator = std.heap.c_allocator;

One thing to trying different allocators. If you know the size of your problem, you can get performance differences by pre-allocating space. This can call malloc - here’s a note from the code:

/// Supports the full Allocator interface, including alignment, and exploiting
/// `malloc_usable_size` if available. For an allocator that directly calls
/// `malloc`/`free`, see `raw_c_allocator`.

It’s running in a for-loop here. Since it’s an empty list, appending to it will trigger an allocation:

var temp = ArrPosts.init(allocator);
try temp.append(i);

I’d try messing around with that and see what you can get.

AndrewCodeDev · October 8, 2023, 1:24am

…also, following up on that thought… there’s some other things going on here too that smell funny to me.

So for instance:

// Convert indexes back to Post pointers
var top_posts = try std.ArrayList(*Post).initCapacity(allocator, 5);

This is done to store results from a top 5 list (note that this is no corresponding deinit statement).

Not too long after, this happens:

 .related = try top_posts.toOwnedSlice()

So let’s understand why this is a problem.

        pub fn toOwnedSlice(self: *Self) Allocator.Error!Slice {
            const allocator = self.allocator;

            const old_memory = self.allocatedSlice();
            if (allocator.resize(old_memory, self.items.len)) {
                const result = self.items;
                self.* = init(allocator);
                return result;
            }

            const new_memory = try allocator.alignedAlloc(T, alignment, self.items.len);
            @memcpy(new_memory, self.items);
            @memset(self.items, undefined);
            self.clearAndFree();
            return new_memory;
        }

You can see here that it first checks if we need to resize. Thankfully since the capacity was initialized with the list, the appends will not resize. That said… this can be done with a static buffer - the 5 value is fixed, so why are we doing this?

I would want to see a better example in general before trying to benchmark. This is not code that will give you a reliable test of performance.

Zig requires that you carefully think about your problem and your representation of that problem. If you just throw everything on the heap then you’re going to get bad performance in almost any language you come across.

jiacai2050 · October 8, 2023, 1:31am

Ooops, I forget to post my forked version:

related_post_gen/zig/main.zig at feat-zig · jiacai2050/related_post_gen · GitHub

I already change allocator to Arena, and declare top_posts as array, but the result is still slower than Go’s.

AndrewCodeDev · October 8, 2023, 1:32am

Can you point out in your code where you are using the arena allocator like an arena? IE, you are clearing memory manually to reuse the same buffer? It looks like you’re just using it as like a stack allocator that can grow. Which is a linked list under the hood, so I don’t believe this will help.

pub fn reset(self: *ArenaAllocator, mode: ResetMode) bool {

Anyhow, you may want to take more granular measurements here. Another little issue is passing by value in your for-loops because you’ll memcpy on every iteration. If you can time the individual parts of your code, maybe we can come up with some better solutions and learn something

jiacai2050 · October 8, 2023, 12:54pm

From my point of view, Arena means we can free all resource at once, not about reuse.

In my forked version, all container are init with given capacity, so allocation is not an issue.

Pass by value is preferred when struct is small, if you use heap-based struct, access will incur one more redirect, which hurts perf.

 
-        for (post.tags) |tag| {
-            for (map.get(tag).?.items) |i_t| {
-                tagged_post_count[i_t] += 1;
-            }
-        }
+        // for (post.tags) |tag| {
+        //     for (map.getPtr(tag).?.items) |i_t| {
+        //         tagged_post_count[i_t] += 1;
+        //     }
+        // }

If I comment this loop, the cost drop from 30ms to 17ms, so I guess HashMap maybe a bottleneck compared to Go’s.

LucasSantos91 · October 8, 2023, 3:02pm

I don’t get this benchmark. Do you have to faithfully recreate the steps given? There are so many unnecessary steps.
I wouldn’t allocate at all.
Memory map the input file.
Count the number of [, this will give you the number of entries.
The benchmark doesn’t specify how the output should look like. A file like this:

{{1,2,3,4,5,},{2,3,4,5,6}}

Would be valid JSON (if I remember correctly). The index is implicit. The first entry represents post number 0, and so on. Each number represents the index of the related post, in order.
Since you know the number of entries, you know the number of entries in your output file. You also know how many digits each number can take (if there 1000 entries, each index can only have, at most, 4 digits). So now you know the exact size of the output file.
Memory map an output file with the size you computed.
Now you don’t need an allocation. Store your results directly in your output file.
For each input:

Create an array of indices-count pair, something like var entry = [5]Pair{ .{.index = 0, .count = 0}, ... } ;
Look for the next [. This will be in the next input entry.
Compare the tags. Use the fact that tags end with ". Tags that match will have those in the same place. If you do this naively, the compiler will probably autovectorize for you. If it doesn’t, it would probably be easy to manually vectorize using Zig’s @vector. Because these tags are so small, I wouldn’t bother hashing them. An SSE2 vector can hold 16 bytes worth of data, so most of these strings can be compared with a single instruction.
Now that you know how many tags match, see if this result is better than any other entry in your array of pairs.
Once done with this input, store the result in the memory mapped file.

AndrewCodeDev · October 8, 2023, 5:41pm

Freeing all at once is indeed a benefit of arenas. If you read the implementation of the reset function, you’ll see that you get multiple modes.

One of the modes is “retain_capacity” - it queries the capacity of the arena and, if the final allocated node is the right size, it simply wraps up and returns. Here’s where it does that:

self.state.buffer_list.first = first_node;
// perfect, no need to invoke the child_allocator
    if (first_node.data == total_size)
        return true;

I only bring this up because I agree with @LucasSantos91, it seems like we’re not using the right tools for this job and the problem representation itself needs to be reconsidered.

IntegratedQuantum · October 8, 2023, 9:56pm

First of all some general optimization advice:

Profile it first! Once you know the lines of code that are slow, you and everyone else will have an easier time finding the root causes. Like here in this thread no one knows where the actual bottlenecks are, so of course you’ll get mostly general advice, like to try different allocators and different algorithms.

Alright, now take the following information with a grain of salt: I could only test the following things on a larger dataset. The given dataset was giving me runtimes somewhere between 20 and 70 ms. It’s impossible to do any kind of testing with that data.

My profiler showed me that the relevant lines are:

            for (map.get(tag).?.items) |i_t| {
                tagged_post_count[i_t] += 1;
            }

and

            if (count > min_tags) {

Now the obvious solution would be to find a better algorithm(preferably one that isn’t O(n²)), but we can do some things without that:

In this particular case there seem to be 5000 Posts in the dataset. Given that each Post is huge, that will not fit into your L1 cache. And since the calculations in that slow loop are trivial, waiting for memory is likely the main bottleneck.
So reducing the total memory will reduce the number of L1 cache-misses, improving performance.
A few possibilities here would be:
- string deduplication (I don’t know about Go, but java’s garbage collectors does this for example)
- replacing the tags with a single u32 for internal processing. Now this is some extra upfront work, but since the hot path is O(n²) it shouldn’t be a big deal.
- Transfering the Posts into a MultiArrayList could remove rarely used values like the Post.id and Post.title from the cache.
- Using smaller types can also help. For example you could probably use a u16 or even u8 for the tagged_post_count array. No need to fill the cache with the extra zeroes of a usize.
Now the second problem is a branch. Given that this branch is executed 5000 times, and there is only a small amount of tags, this branch will almost always fail. We know that, but the compiler doesn’t. Edit: It seems the is no easy solution to that in Zig yet(see comment below). The only way I see would be to manually do vector comparisons. ~~Zig has a builtin @setCold which can be used inside the branch to tell the compiler that it’s rarely executed.~~

squeek502 · October 8, 2023, 10:10pm

@setCold only works at the function level. The relevant accepted proposals are replace @setCold() with @cold() · Issue #5177 · ziglang/zig · GitHub and @expect() hint to optimizer · Issue #489 · ziglang/zig · GitHub.

IntegratedQuantum · October 8, 2023, 10:35pm

Oh, my bad. I actually measured a 10% performance improvement though . I just double checked with godbolt and the critical sections produce the same assembly. Must have been code alignment or some other weird effect.
Thanks for pointing it out.

AndrewCodeDev · October 9, 2023, 3:47am

Either way, great contribution @IntegratedQuantum, I think this is all really sound advice in general.

jnordwick · October 19, 2023, 10:16am

that will not fit into your L1 cache.

It is kind of an L1 cache problem but of a specific type I think (I don’t understand the problem). Working set is too big so you can’t keep it in there regardless maybe?

The problem is a streaming issue and too many jumps around memory. The lines you point to has maps. Those sucks. Hash maps depending on impl can go from kind of bad to really trash. The fewer access the better.

You can to reading linear and handling it all so keeping the memory prefetchers running full bore and functional units hot. Any stall on a hash will kill that.

string deduplication (I don’t know about Go, but java’s garbage collectors does this for example)

Are you sure about that? They never used to (from inspecting the assembly). It would performance killer for most workloads.

replacing the tags with a single u32

or just make sure tags are <4 (or <8 chars) and cast then to ints then use them.

even u8

u8s have a loop carried dependency on the full register for some reguster names. Generally avoid unless you can see in your head well enough that won’t be a problem. They are about 3 times slower the u16 (I think) and u16 is 1.5 slower than u32 IIRC (check Agner Fog). So this needs to be balanced against how it is used and cache usage.

@setCold

In general don’t use fhese. Even profiling will give you incorrect results compared to runtime. Only the compiler knows the penalty for the taking the less used branch if it decides to relo that to far away to keep the straight line happy. It could be very substantial (eg, an icache miss plus page fault) and kill any performance gain. The middle ground teh compiler will likely chose will keep those penalties low. Anwyays, the example given the branch prediction / target prediction should handle that close to perfectly.

I’ve have more often seem people correctly annotate their code with likely/unlikely and slow the code down because of the huge penalties applies for the less likey branches taken. That’s why gcc allows you to give a percet likely to help.

BUT that should be handled by PGO for most workloads, not you guessing. The compiler will then have all the date to figure out how to layout a function. Putting those takes at the function level makes no sense at all. Inlining will make sure of that.

IntegratedQuantum · October 19, 2023, 3:11pm

Note that in this particular problem the hash map is relatively small, but contains rather large array.
So I doubt that this is significant at all. The loop body here is executed thousands of times, while the hashmap.get is executed only a few times.

Yeah I am sure. It was added in java 8: JEP 192: String Deduplication in G1 It seems to be optional though.
I don’t think it would kill performance. It would be about as expensive as giving each tag a unique integer id, while satisfying the rules of this particular challenge, which specifically asks to “Represent tags as strings”

Good idea. Though I don’t think it’s allowed here.

That’s the first time I heard about that. Interesting.
Do you have any source code that backs up those numbers?
So far I only made good experiences using u8 or u16. I think compiler vectorization and cache miss reduction also play a big role here. So it would be great to see an example where this doesn’t apply.

desttinghim · October 29, 2023, 6:01am

Am I missing something? The result I’m seeing in the readme is showing that zig is slightly faster. The formatting is a little wonky but zig is at 23.00 ms and go is at 24.59 ms for the smallest benchmark.

PT600 · October 30, 2023, 9:24am

Port fxhash from rust.
rust fxhash

Sze · October 30, 2023, 3:28pm

I think the benchmark was updated/changed/rerun showing new results.

iameatingbits · November 2, 2023, 7:34pm

is it really better than xxh?