How to reallocate without loosing data?

zetyty · January 6, 2025, 9:28pm

I can’t “retrieve” the content of a slice after reallocation using a temporary backup varariable.

Non working exemple

In this exemple I use the zig tokenizer to count the number of keywords “for” in a string.

But assuming I don’t know in advance that number at runtime I use an allocator to store the “for” tokens and check if it is full (i.e. allocation size reached), then I use a temporary variable to “save” the elements contained in the allocated memory. After I reallocate with a bigger allocation size and try to retrieve the firsts elements contained in the temporary variable.

However I got the following error:

$ zig test test.zig
Source: You are the one for me, for me, for me, formidable'
Segmentation fault at address 0x107fdb000
/Users/[..]/src/test.zig:46:31: 0x107d78e46 in test.test count for keywords (test)
                    for (temp,0..) |tmp,j| {
                              ^
???:?:?: 0x107e6fab7 in ??? (???)
Unwind information for `???:0x107e6fab7` was not available, trace may be incomplete

error: the following test command crashed:
/Users/[..]/.zig-cache/o/c674d5dee9354dbee1a6fbc3bf9690f2/test --seed=0x1e70405c

Full code

const std = @import("std");
                                   
pub const My_token = struct {
    loc: Loc,

    pub const Loc = struct {
        start: usize,
        end: usize,
    };
};

test "test count for keywords" {

    const source:[:0]const u8 = "You are the one for me, for me, for me, formidable";
    std.debug.print("Source: {s}'\n", .{source});

    var alloc_size:usize = 2;
    var for_tokens = try std.testing.allocator.alloc(My_token, alloc_size);
    defer std.testing.allocator.free(for_tokens);
    var tok = std.zig.Tokenizer.init(source);
    var tag_name:[]const u8 = undefined;
    var i:usize = 0;
    var count_for_keywords:u8 = 0;
    var token:@TypeOf(tok.next()) = undefined;
    var temp:[]My_token = undefined;
        while (true) : (i += 1) {
            token = tok.next();
            tag_name = @tagName(token.tag);
            if (std.mem.eql(u8,tag_name,"keyword_for")) {

                for_tokens[count_for_keywords].loc.start = token.loc.start;
                for_tokens[count_for_keywords].loc.end = token.loc.end;
                count_for_keywords += 1;

                if (count_for_keywords == alloc_size) {
                    temp = for_tokens[0..count_for_keywords];
                    alloc_size *= 2;
                    for_tokens = try std.testing.allocator.realloc(for_tokens, alloc_size);
                    for (temp,0..) |tmp,j| {
                        for_tokens[j].loc.start = tmp.loc.start;
                        for_tokens[j].loc.end = tmp.loc.end;
                    }
                }
            }

            if (std.mem.eql(u8, tag_name, "eof")) {
                break;
            } 
        }
   
    const expected:u8 = 3;
    const actual = count_for_keywords;
    try std.testing.expectEqual(expected, actual);
}

To me, it looks like the variable temp lost its content after the realloc, because for_tokens is then empty. But, the temp has been assigned before the reallocation so I don’t understand…

Questions

How to fix the above code in order to remove the error?
Is there a “better” way to do such a thing, i.e. restore the content of reallocated variable?

IntegratedQuantum · January 6, 2025, 9:40pm

realloc is a function that either resizes the buffer in place or allocates a new buffer and copies over all the data from the old buffer.
You don’t need temp since it’s all been copied over into the buffer by the realloc function already.
The following changes should make it work as expected:


--                  temp = for_tokens[0..count_for_keywords];
                    alloc_size *= 2;
                    for_tokens = try std.testing.allocator.realloc(for_tokens, alloc_size);
--                  for (temp,0..) |tmp,j| {
--                      for_tokens[j].loc.start = tmp.loc.start;
--                      for_tokens[j].loc.end = tmp.loc.end;
--                  }

Furthermore I’d recommend to use an std.ArrayList(My_token) for for_tokens instead of implementing the resize and count behavior manually. I’m sure you’ll find it more convenient to use.

zetyty · January 7, 2025, 8:54pm

Thank you very much for your help and your comment about ArrayList!
I didn’t know about it and I find it very easy to use.

I share the code refactored using ArrayList in case it could help someone and in case I’m not using ArrayList “efficiently”…

const std = @import("std");
                                   
pub const My_token = struct {
    loc: Loc,

    pub const Loc = struct {
        start: usize,
        end: usize,
    };
};

test "test count for keywords" {

    const source:[:0]const u8 = "You are the one for me, for me, for me, formidable";
    std.debug.print("Source: {s}'\n", .{source});

    var for_tokens = std.ArrayList(My_token).init(std.testing.allocator); 
    defer for_tokens.deinit();
    var tok = std.zig.Tokenizer.init(source);
    var tag_name:[]const u8 = undefined;
    var i:usize = 0;
    var token:@TypeOf(tok.next()) = undefined;
        while (true) : (i += 1) {
            token = tok.next();
            tag_name = @tagName(token.tag);
            if (std.mem.eql(u8,tag_name,"keyword_for")) {
                
                try for_tokens.append(.{.loc = .{.start = token.loc.start, 
                                                 .end   = token.loc.end}});
            }

            if (std.mem.eql(u8, tag_name, "eof")) {
                break;
            } 
        }
    const expected:u8 = 3;
    const actual = for_tokens.items.len;
    try std.testing.expectEqual(expected, actual);
}

Thanks a lot!

Travis · January 7, 2025, 9:28pm

Why are you using mem.eql() to check token tags? You could do that with if (token == .eof) and such. Since this is just a test, perf doesn’t matter much here. But comparing the tag is going to be more performant and I find it easier to write and read.

I would usually write a switch:

const std = @import("std");

test "test count for keywords" {
    const source = "You are the one for me, for me, for me, formidable";
    std.debug.print("Source: {s}'\n", .{source});

    var for_tokens = std.ArrayList(My_token).init(std.testing.allocator);
    defer for_tokens.deinit();
    var tok = std.zig.Tokenizer.init(source);
    while (true) {
        switch (tok.next()) {
            .keyword_for => try for_tokens.append(.{
                .loc = .{ .start = token.loc.start, .end = token.loc.end },
            }),
            .eof => break,
        }
    }
    try std.testing.expectEqual(3, for_tokens.items.len);
}

zetyty · January 11, 2025, 8:07pm

Is there a significant difference in terms of performances between the two approaches (i.e. ArrayList or a slice with manual realloc and count) ?

IntegratedQuantum · January 11, 2025, 8:34pm

For all practical purposes, there will be no measurable performance difference because the compiler inlines all of the small functions from the ArrayList(Unamanged).

Of course there is no such thing as a zero-cost abstraction, if you implement it yourself you always have the potential to make it just a tiny bit faster.
Here there is a difference in the initial size and the resize behavior, which could make things a tiny bit faster in theory if you were to micro-optimize these parameters.
But these factors rarely matter in practice, and unless you need to micro-optimize the program, then you won’t see a difference, and there are often better things to optimize anyways (like choosing a faster allocator for example).
So you shouldn’t ever need to optimize your ArrayList.