After hearing about the new io interface which is proposed to unify reading, writing, and async operations, I was curious about the performance implications of zig’s pattern of using dependency injection for allocators and soon io.
For a better comparison, what is the difference in performance between a) specifying a global io / allocator configuration and implementation or b) creating local variables containing that information and passing it around through functions.
Hey, welcome to Ziggit!
I don’t have any exact numbers for you, but the Vulkan API (whose main goal is maximum performance) doesn’t use any global state and makes you pass around local state all the time. If there is any overhead it seems to be low enough to be acceptable for high performance rendering.
I think what it mainly comes down to is not performance but rather how you want to design your application.
Thinking about it more, local config might even improve performance. You can’t really control where your global config will be stored (unless you mess with the linker) but you can influence the spatial locality of your local config in relation to the data that needs it, so it might even save you a few cache misses (especially if you use your config infrequently).
Performance is not the only attribute to look at. Global state makes it more difficult to test and also to have “more than one instance” of whatever you are building. Passing state around, whether it is directly or via some sort of dependency injection, makes everything easier / better.
Witness zig’s allocators versus malloc
/ free
– it is a PITA/ impossible to have more than one memory allocation strategy in your C code, that applies to all the code (including third-party libraries).
Since the original question was only about performance:
Of course passing more data to functions has a cost, it has to occupy an additional register, and, according to the x86_64 calling convention, a function using more than 5 registers (rax, rcx, rdx, r8, r9) needs to push the remaining registers it overwrites to the stack and pop them again at the end of the function.
In practice most likely you will need an extra push and pop to stack in each (not-inlined) function that gets passed your singleton.
This probably won’t cause any actual performance problems, especially given that e.g. allocations are generally rather expensive to begin with.
So the main reason why Zig or other APIs will pass things around is because they are not always singletons as @gonzo already mentioned.
I was also wondering about singletons passing or just global. Not sure yet.
…a function using more than 5 registers…
Would you say that if we put - let’s say - 5 parameters in a struct and pass that struct as a reference to a function and thus create a “stackless” call that would speed up things?
Or would the pointer-like references you have to make inside the function again slow it down?
It depends, if you do this on every function in the call stack then you are no better than before, since you now manually do the stack allocations, all you do is merge the pop instructions into just a single add rsp 40
. But if the function that takes it uses all of those parameters, then you also end up using 1 more register inside the function, and according to the calling convention the called function is still responsible of putting all used registers (except the 5 which are used for input/output) on the stack.
But if most of the data in the struct isn’t used in most functions along the call stack, and you pass the same pointer through the entire callstack, then yes it’s probably worth doing that.
But again, keep in mind that most functions in the hot path will be inlined, and inlined functions don’t need to adhere to the calling convention.
Ok thanks for the replies, this makes a lot more sense now.
I do see the ergonomic benefit of being able to swap out allocators and io, so I’m excited to see how this gets further implemented into the standard library
I doubt that you’d see any difference at all in practice. Although in our mind, we imagine things being done in a certain order, namely that the sixth argument get written to the stack after the five available registers have been exhausted, the compiler is under no obligation to schedule things that way. Since it’s just a pointer sitting around, there’s no dependency to resolve. The write-to-stack operation is ready to go. The compiler can choose to do that first then work on the other arguments. The write would happen in parallel.