Is banning implicit memcopies feasible?

I’ve noticed a curious thing about the structure of the code in TigerBeetle — everything in its right place!

That is, whenever an object is created, it can always be created in its final place. And the code which tends to move objects around often can be refactored to the code that doesn’t do that.

In other words, it feels like compiler-introduced copies (for (xs) |x| vs for (xs) |*x|) feel like they are almost always a code smell? That is, sometimes they are required to make the code work, but often it’s better to refactor the code to not memcpy data needlessly.

So I am wondering — would it be reasonable (not necessary in Zig as it is today) to just outright remove all mecpys from the language, and to require the user to explicitly mark all the places where a copy is desired? That is, to double-down on PRO (parameter reference optimization), and require that the user themselves deals with aliasing?

4 Likes

It looks like we’re going the other way, given that PRO is going to be mostly removed.

I think this could make the language non-ergonomic and has the risk of worsening performance. Values do lead to better optimizations, and they are so much easier to reason about. Most times it’s ok to create a short lived copy that will be restricted in scope, and the compiler can easily optimize this. I think it would be better to improve the algorithms for elliding copies.

3 Likes

This is quite an interesting idea that I am intrigued by. I think the first thing I would explore here would be making aggregate types pinned by default (note that Proposal: Pinned Structs · Issue #7769 · ziglang/zig · GitHub has been an accepted issue for a long time now) however you could mark an aggregate type as not pinned. I suspect that is too aggressive and the default would need to be flipped around the other way.

For TigerBeetle’s purposes specifically, I think the default doesn’t matter too much since you would get what you want by pinning all aggregate types.

Oh, and the accepted proposal regarding PRO is only helping/simplifying this use case more. If the aggregate type is pinned you must pass by pointer. If not, then it may be passed byvalue, in which case it is always copied (of course optimizations are allowed to elide copies if it can be proved it would be illegal to detect the difference).

The main problem I see with this is ability to avoid copy returning an aggregate which presents familiar aliasing problems if you can access the result location pointer before the function returns. There are some possibilities to address it. For instance, we could make the simple rule that the pattern return expr; forwards the result location to expr however the result location is otherwise inaccessible. This would prevent aliasing while allowing a function’s return type to be pinned.

7 Likes

It’s interesting that you should mention this because I’ve been kicking around with some local folks how feasible it would be to do a “stackless” Zig ie. function arguments and in function variables are all allocated as a static variable until a recursive call somehow happens or recursive functions must take an allocator or … That would leave return addresses as the only thing on the “stack”.

“The stack” has been around so long that people have forgotten that one of its functions is that of “time multiplexing” memory locations between functions. Back in the mists of time, fully allocating the variables of every single function as static variables would use too much memory.

I think TigerBeetle is showing that this is probably not be true anymore. This especially interacts well with Zig not compiling functions that aren’t used. And I think it dovetails with the way WASM does things.

It would probably also help in debugging since you don’t have to grovel in the stack for variables–they’re either in registers or in static memory or optimized away. And core dumps could examine the last known state of every single function rather than just the last couple function calls on the stack.

1 Like

On the contrary, you’ve described the BASIC model of programming, to a T.

It’s rather hard to reason about, stacks were developed to get away from it.

It does not. Unlike any non-exotic physical CPU, WASM has a hard-coded stack, something compilers in CPU-land create out of conventional use of memory and registers.

Data stacks also serve to localize data-in-use, which gives better cache coherence, very important for the more powerful CPUs.

The sort of architecture you’re mulling over here is used in embedded contexts, where (when C is used) the ABI dictates exact memory addresses which get filled for the calling convention, some paltry number of words are used as a return address stack, and what’s left needs to be split between global and function-local state.

But it scales poorly to bigger programs.

2 Likes

static != global. Static variables can have local access scope but global storage. Think C’s “static” keyword but inside functions.

I don’t accept this a priori without evidence (and I don’t expect you to accept my statements a priori, either, I’m just conjecturing here …).

A couple of structs can easily blow out a cache line. Every single incarnation of std.BoundedArray blows a cache line (generally multiple). A stack makes a specific area of memory extremely hot on read/write and cache intensive and can blow out the associativity by writing over and over to the same place rather than amortizing over the full associativity. Large stacks also make context switches expensive.

Even the concept of “stack” as a data structure doesn’t map all that well to modern CPUs because we tend to carry around a lot of pointers into things to try to keep them off the “limited” stack and incur the pointer chasing overhead.

Look at the gains even in a compiler for SoA to AoS tranformations. The gains seem to be quite a bit larger than anyone expected before trying them out.

Accessing things in “continuous linear order” seems to be more important than “simple locality” nowadays. And stacks very much do not do “continuous linear order” thanks to lots of “push/pop” operations.

As someone who has suffered through those 8-bit systems, I don’t think we know this at all. We’ve never really tried this kind of stuff with systems that have large amounts of memory and CPU.

We normally assume that stacks and recursion are simply worth the cost and we have never really revisited that.

This solves one problem with classic BASIC, which is that which variable belongs to what subroutine is a matter of convention. It still has the problem that you’ve abolished reentrancy and have created a moral hazard of ‘leaving a little something for later’, which is an invitation to a world of pain.

Conjecture is good! Experiments are even better.

What sort of evidence would you accept?

I don’t think the compactness of the active region of the stack qualifies as a conjecture, it’s part of the definition of the data structure.

Passing by pointer is pretty critical if you have a large struct on the stack, yes, copies are expensive. We want the region of interest to be cache-hot, however.

My conjecture is that the already existing (probably unsolvable) problem of non-locality of instruction data, caused by function calls, would be multiplied by creating a parallel non-locality of the data used internally by those functions. I don’t actually see a way around that.

Where possible, compilers mitigate the associativity problems you refer to by assigning local mutable state to registers. Function-local mutable state would obligate any named variable to be written back to store after every function call, where the existing regimen for register assignment can just overwrite it.

I must admit that I don’t understand what

Means, entirely. Perhaps if you expanded that with more detail.

My perspective is that there’s always heap data around, and instructions to be filled in, and those are plenty to keep the rest of the cache busy. The more use can be made of one line/region, the better. Propagating writes only stalls CPUs if some other core is trying to read from that part of the store, which process isolation and data stacks are designed to prevent.

I’m not seeing what you’re seeing here. push and pop are incrementing and decrementing a frame pointer. CPU optimizations are designed with data stacks in mind, due to their broadly-universal use in systems targeting those CPUs.

If you’re pointing out that a branch free forward read of contiguous memory is the optimal pattern for a CPU, then of course I agree. But we can’t always have that. And I don’t think that hopping around to cold regions of memory in pseudorandom order is going to help achieve it, which is the picture I get from a large number of isolated pockets used to hold function-local data state.

I thought they were fun actually, but I agree that data stacks are a part of modern dogma, and it’s always worth questioning dogma and seeing where it leads.

In this case, I’m pessimistic. But I wouldn’t mind being wrong about that, at all.

About recursion: we’re accustomed to thinking of that in terms of self-calls, or shallow cycles at most. But generalizing it to reentrancy, simply the ability to call a function more than once in the same, well rather than “call stack” let’s say “tree shaped program composed of nested subroutine calls”, is pretty important. I don’t think we actually want to lose that, just from absorbing the experience of the many occasions when libraries and the like were written in a way that makes them not safe to re-enter, and the problems which so frequently arise from that.

I guess I’m not seeing what we’re getting here, but I readily grant that this may be because I’m not getting the picture you’re attempting to transmit. It bears some resemblance to graph-structured stacks, for example, which have shortcomings but have proved their worth in algorithms which benefit from them. But many smaller stacks doesn’t entirely match no stacks at all.

1 Like