Assigning ArrayList makes a copy I think

guacs · February 10, 2024, 2:26pm

I have the following code for a key value store.

    pub fn put(self: *Self, kvp: KeyValuePair) !void {
        var buffer = self.buffer; // changing this to `const buffer = &self.buffer` works

        // key_len + value_len + len('put') + space after 'put' + space after key + space after value
        const full_size = kvp.key.len + kvp.value.len + 3 + 1 + 1 + 1;
        try buffer.ensureUnusedCapacity(full_size);

        std.log.debug("DEBUG: adding {} bytes to buffer\n", .{full_size});

        buffer.appendSliceAssumeCapacity("put");
        buffer.appendAssumeCapacity(' ');
        buffer.appendSliceAssumeCapacity(kvp.key);
        buffer.appendAssumeCapacity(' ');
        buffer.appendSliceAssumeCapacity(kvp.value);
        buffer.appendAssumeCapacity(' ');

        std.debug.assert(buffer.items.len > 0);
    }

    pub fn flush(self: *Self) !void {
        var file = self.file;
        const buffer = self.buffer.items;

        std.log.debug("DEBUG: flushing {} bytes\n", .{buffer.len});

        _ = try file.write(buffer);
        try file.sync();

        self.buffer.clearRetainingCapacity();
    }

When flush is called, no bytes are actually written to the file and I get a memory leak detected error when the program exits. The reason for this is the var buffer = self.buffer. My assumption is that this is making a copy of the buffer. If that is correct, why is that happening? What is actually happening with the line var buffer = self.buffer?

Btw, self.buffer is an ArrayList(u8).

Sze · February 10, 2024, 2:45pm

Hello @guacs, welcome to the forum!

Within put you are making a copy of self.buffer, which means that all your operations on buffer, update fields in buffer and then buffer falls out of scope, which means we forget the state of buffer and can’t restore it. self.buffer’s fields remain unchanged in their previous state, but those fields may point to memory that could be invalid, because the operations on buffer may have deallocated the original memory self.buffer still points to through its fields.

It is best to avoid copying complex things like std.ArrayList, if you really need a copy use its .clone method instead.

So either do what you did in your comment const buffer = &self.buffer or just use self.buffer everywhere.

Also take a look at: Pointers to Temporary Memory
And if you want to look under the hood at: https://ziglang.org/documentation/master/std/src/std/array_list.zig.html#L14
Basically those fields in your case .items and .capacity are only changed in the copy and not in the original.

The leak is that you don’t call deinit on your copy. (which you don’t want to create in the first place, in this example).
Also if you create a copy through assigning (instead of via clone), if you would add deinit in the method, you would get a double free, once for buffer and another time for self.buffer.

Basically assigning an ArrayList is only okay, if you forget the other one.

guacs · February 11, 2024, 2:39am

@sze thank you!

I didn’t realize that assigning creates a copy. So, all assignments in Zig creates a copy? Or how does that work exactly?

ktz_alias · February 11, 2024, 3:04am

Additional information.

Assigning to function parameters will be pass-by-reference except for primitive type and C-ABI.

https://ziglang.org/documentation/master/#Pass-by-value-Parameters

AndrewCodeDev · February 11, 2024, 3:13am

Zig does not have operator overloading, so it’s important to understand that = has a straight forward meaning. You can essentially think of it as “copy the bytes from a into b”. It’s also important to understand that assignment does not follow pointers. You’ll copy the value of the pointer (it’s address) but not the value that’s being pointed to.

Here’s an example…

const A = struct {
    // slice contains a pointer and a length
    data: []const u8, 
};

// later...

const a = A{
    .data = "Hello, World!"
};

// we are copying the pointer and the length into b
const b = a;

Here, a and b will be pointing to the same underlying memory. Now consider ArrayList which has 3 members…

a slice to the current used data
a capacity for the total data
an allocator interface which has a pointer to the allocator and the vtable

All of those pointers get copied over just like in that simple example above - including the direct integers.

I was thinking about making a Doc on this to explain more in detail, but I hope this helps.

guacs · February 11, 2024, 3:21am

@AndrewCodeDev first of all, thanks for the Invalid temporary from init function section in Pointers to Temporary Memory. That was something I hadn’t realized.

So, just to clarify my understanding when I do var buffer = self.buffer from my initial example, the following is happening:

Copy the slice (i.e. the pointer to the underlying memory and the length), the capacity and the allocator interface to wherever buffer is located in memory.
Now when I call buffer.ensureUnusedCapacity, the new allocated memory slice is set to the items of buffer, but NOT the items of self.buffer hence why I faced issues. This means that if I had not increased the capacity or appended to the list and that did not require reallocation, then I would not have faced any issues.

Is this correct?

Sze · February 11, 2024, 3:31am

I think maybe one way to think about Zig is that more things can be treated as mutable direct memory/values, then in languages like Java or Python.

In Zig when you have a variable of some struct type you have that struct memory on the stack and when you assign it to another variable it gets copied to that variable.

In Java or Python the language would just force you to create that struct on the heap instead and your variable would just be a pointer (that is hidden from you) that points to the actual struct that gets automatically allocated on the heap.
Now if such a pointer is copied you only get a copy of a pointer, and thus you now have 2 variables pointing to the same object on the heap.

But Zig doesn’t put things on the heap without you telling it do that, so if you want 2 variables that both point to the same thing, in a similar way to Python or Java, in Zig you have to explicitly use a pointer to the struct and then copy that pointer.

At first that may seem like a drawback, but actually it has its benefits:

you can copy structs by assignment when it is useful
you can keep many things on the stack and explicitly copy them to the locations where they are needed
the pointers are explicit, instead of being hidden from you (which potentially helps in avoiding unnecessary indirection / pointer chasing)
the compiler may be able to generate code that keeps things in registers, instead of dealing with boxed values (those pointers that are hidden from you)

Sometimes this “values being copied” is called “value semantics”, while the “pointers being copied” is called “reference semantics”.

I couldn’t find one place where it is explicitly defined, what Zig actually does and with Zigs concepts like Result Location Semantics (which aren’t documented yet) and comptime values (that get interned at compile time) it gets more complicated.

Additionally there are places (self from method calls, parameters, payload captures) where zig decides whether it will copy the value or create a reference and use that instead. And it seems like that might still change where when and how those decisions are made, if that is required to end up with something that fits Zig’s goals.

One thing I would recommend is to get a feel for the status quo, by looking at how are things done in the standard library, through that you can at least get a feel for which practices are practically standard and what usage patterns, are maybe not that well supported.

AndrewCodeDev · February 11, 2024, 3:33am

Hey, happy to help - and just for the record, we’ve had a few people help with our docs so far and edit my original post… so thanks goes out to everyone that has contributed.

I’m a pictures kinda guy - let’s draw a picture.

            + ------DATA-----+
            |                |
     array_list.items        |
                         var buffer

Since they have the same pointer, they point to the same place.

Now… what can happen if we change the capacity (aka, the allocation) of the original ArrayList?

// let's copy the slice
var buffer = array_list.items

// let's modify that original list...
array_list.ensureUnusedCapacity(array_list.capacity * 100);

// here's what can happen:

   Modify ArrayList Allocation ---> potentially frees ----> OLD DATA
          |                                                    ^
          v                                                    |           
   NEW LARGER DATA                                        var buffer
          ^
          |
   array_list.items

In this diagram, we can see that var buffer can still be pointing to data that was abandoned by the ArrayList - which could have been a freeing/invalidating operation.

Now, in your case, the issue is like what @Sze said about deinit, but I wanted to make the general case that we’re sharing underlying memory by pointing at it in two places. Modifications to one thing can effect another.

Also note that ensureUnusedCapacity has a comment above it for this exact reason:

/// Modify the array so that it can hold at least `additional_count` **more** items.
/// Invalidates element pointers if additional memory is needed.

Edit - I’m going to add one more example like @Sze was saying to give a consistent picture about deinit.

Here we go…

// x is an ArrayList

var y = x;

// becomes...

               +-----DATA-----+
               |              |               
               x              y


x.deinit() // frees data

y.deinit() // tries to free the same data

Sze · February 11, 2024, 3:46am

I would say technically you are correct, but you also wouldn’t have done any meaningful work (in the context of your example).
All your example did was mutate things, without those mutations it becomes an operation that does nothing.

However there may be other examples where it makes sense to copy a struct, for example if some code just needs a bunch of data it then can work on independently, without being affected by mutations of the “original”.

AndrewCodeDev · February 11, 2024, 4:40am

Here’s the doc based on this thread.

dee0xeed · February 11, 2024, 8:56am

I’d like to add a couple of words about C# and D in connection with this.
In both these langs structs are implicitly “copy values” (C,Zig), but classes are implicitly “reference values” (Python, Java), if I remember right.

guacs · February 12, 2024, 1:50pm

Thank you so much! This just made it click for me.

Also, result location semantics is something new to me, and that a really cool optimization.

I’ve been doing this more and more now, and it’s been fantastic!