Runtime-mutating and passing around '[]const u8' character strings

FObersteiner · February 9, 2024, 3:11pm

Here’s a question that has been bugging me for a while now. The broader context is strings, as []const u8, being handled at runtime. In a comptime-known part of a program, things seem clear to me; the compiler can reserve memory for the needed bytes and put the characters there - like you would expect the compiler to do for say a u64. But how is this done for runtime variables? A string of characters isn’t a fixed number of bytes, so how is this implemented? (maybe this is something for the ‘docs’ category?)

The specific problem I encountered with this is that I defined a struct, which has a []const u8 field, which changes at runtime. More specifically: the name and abbreviation fields of zdt.Timezone. That sometimes worked and sometimes not. I’m not happy with this but up to now, I wasn’t able to set up a minimal reproducible example. Sometimes, it worked in debug mode, but not with releaseSafe optimization. Sometimes it worked on Linux but not on Windows etc. I’ve experienced this with Zig 0.11 stable as well as the nightly builds (tested with dev.2665+919a3bae1 today). There’s a couple of commented lines here, if somebody wants to test it. Comparing the output of the ‘any’ directive to the ‘s’ directive, it even seems to me that the bytes are there, they just come out as non-printable garbage sometimes…

Any ideas what I’m doing wrong here? Passing around a fixed number of null-terminated bytes works fine; it’s just that I have to slice them (std.mem.sliceTo(data[0..], 0)) to get a ‘clean’ string representation

IntegratedQuantum · February 9, 2024, 3:38pm

Generally you are the one to decide where the string is stored. You can either store it in a buffer on the stack(be careful with lifetimes here!), or you need to allocate the string using some allocator(don’t forget to free it afterwards!).

Judging by the behavior you are getting I would guess that you are triggering some form of undefined behavior. One possible problem might be Pointers to Temporary Memory.

dee0xeed · February 9, 2024, 3:58pm

I tried just now to make a mutable string and this is what I came up with:

const std = @import("std");
const log = std.debug.print;

const KindaMutableString = struct {
    str: []const u8 = undefined,
    buf: [32] u8 = undefined,

    fn set(dst: *KindaMutableString, src: []const u8) !void {
        dst.str = try std.fmt.bufPrint(&dst.buf, "{s}", .{src});
    }
};

pub fn main() !void {
    var s = KindaMutableString{};
    try s.set("string-1");
    log("{s} ({} bytes)\n", .{s.str, s.str.len});
    try s.set("string-string-2");
    log("{s} ({} bytes)\n", .{s.str, s.str.len});
    try s.set("-0");
    log("{s} ({} bytes)\n", .{s.str, s.str.len});
}

The idea was that we have buffer and we have a slice of that buffer.
Actually I am not sure if this is really Zig way to do this, but maybe it’ll be helpful.

dude_the_builder · February 9, 2024, 4:13pm

Didn’t check everywhere, but this is one instance where you’re storing a pointer to temporary memory. In my experience split and tokenize iterators are notorious for this type of bug where you just copy what next returns, which is usually pointing to temporary memory on the stack (maybe a buffer with bytes read from the network or a file.) allocator.dupe is your friend here, (but not a temporary FixedBufferAllocator):

.name = try allocator.dupe(slice)

LucasSantos91 · February 9, 2024, 5:01pm

The slice points to the buffer, which is a part of the struct. That’s called a self-referential pointer, and it causes all kinds of problems. It’s better to keep an index to where the string ends. For the purpose of a string whose size is bounded, it’s usually easier to just use std.BoundedArray(u8).

dee0xeed · February 10, 2024, 8:42am

Well, in my example ptr part of the slice always points to the beginning of the buf, so I can not see any problem here. Could you please make it more specific - what kind of problems do you mean?

len part of the slice is exactly such an index.

One way or the other, if we want some mutable string, we have to have some buffer
(it may be of fixed length, or it may be growable, or you can reallocate it each time to fit exact length of a new content). I’ve just placed (fixed length) buffer and the actual length (in the form of slice len) into single entity.

FObersteiner · February 10, 2024, 9:29am

Thank you guys for all the input. It feels like I have aquired a footgun with a sensitive trigger here

@IntegratedQuantum I’ve seen that post, it felt like very related, but I couldn’t make the clear connection…

@dee0xeed that looks like a plan. In my case, the strings in question aren’t particularly long, and the maximum length can be determined upfront.

@dude_the_builder good catch, that one came in just recently, before, I was experiencing the issue described in the question only for the abbreviation of the tz. It’s the same problem I think. However, the allocator I’m using in the function you looked at is only used to store the content of the TZif file (the tz rules essentially), not the names.

@LucasSantos91 could you elaborate a bit on this? I’m having trouble to see where this self-reference is an issue; assuming I use a buffer to store the bytes of the name of the struct’s instance, that name describes just this instance. So there shouldn’t be a conflict, no?

slonik-az · February 10, 2024, 10:07am

I am not @LucasSantos91, may be they meant other issues . One problem I see with this self-referential struct is copying. If the obj is copied, one needs to be careful to properly set the str slice in the copy to point to the new buffer. Otherwise the str will point to the old buf in the source object. Storing index or length would not have this problem. Rust, for example, refuses to compile self-referential structs and requires unsafe to deal with them. Self-ref structs are useful but, as anything containing long-living pointers, should be treated carefully.

dee0xeed · February 10, 2024, 10:58am

Stupid question - how would you just print such a construction (buffer + end index) then?

dee0xeed · February 10, 2024, 11:30am

Ok. Here is another variant:

const std = @import("std");
const log = std.debug.print;
const Allocator = std.mem.Allocator;

const ReallyMutableString = struct {

    str: []u8 = undefined,

    fn set(self: *@This(), from: []const u8, a: Allocator) !void {
        self.str = try a.realloc(self.str, from.len);
        @memcpy(self.str, from);
    }
};

pub fn main() !void {

    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const a = gpa.allocator();

    var s = ReallyMutableString{};
    try s.set("string-1", a);
    log("{s} ({} bytes)\n", .{s.str, s.str.len});

    try s.set("string-string-2", a);
    log("{s} ({} bytes)\n", .{s.str, s.str.len});
}

Mutable slice (no const after []) and realloc every time.
Yes, this is a bit of overhead, but no problems with self-referentiality.

Sze · February 10, 2024, 1:45pm

Sorry @dee0xeed my brain misfunctioned and I thought we are talking about in place resize.

realloc only works sometimes so it can’t be your only strategy, you could use realloc with a dedicated FixedBufferAllocator, but at that point you could just use a BoundedArray(u8) instead, for example using its fromSlice method.

Also if your realloc succeeds, you leak memory.
You are missing defer _ = gpa.deinit() which is why the gpa never gets deinitialized, keeping it from complaining about the memory leaks (in debug mode).

dee0xeed · February 10, 2024, 2:02pm

I know
Ok, I added defer std.debug.print("leakage?.. {}\n", .{gpaa.deinit()}); and got something interesting:

ReleaseSmall, RealeseFast:
leakage?.. heap.general_purpose_allocator.Check.ok
ReleaseSafe:
error(gpa): memory address 0x7f8e25791000 leaked:
leakage?.. heap.general_purpose_allocator.Check.leak
ReleaseDebug (default):
thread 41477 panic: Invalid free and abnormal terminaton.

Sze · February 10, 2024, 2:21pm

If you don’t specify options gpa picks defaults based on your -Doptimize=... optimization mode, resulting in no checks for ReleaseSmall and ReleaseFast.
If you actually specify:

var gpa = std.heap.GeneralPurposeAllocator(.{ .safety = true }){};

You always get the safety checks, but we are getting of topic here.

My point was, that I don’t like using the gpa without calling its deinit method in an example, because it could lead people to think, that your code doesn’t have any leaks, when you are just not checking for them. So I think you should always use deinit. And if you intend to leak, then use an arena to make it obvious to the reader.

dee0xeed · February 10, 2024, 2:33pm

In that example the leak is obvious - s.str never de-allocated and that’s no problem at all for one-shot program.

Ok,

here is a version without leaks and without crash in ReleaseDebug mode

const std = @import("std");
const log = std.debug.print;
const Allocator = std.mem.Allocator;

const ReallyMutableString = struct {

    a: Allocator,
    str: []u8 = undefined,

    fn init(a: Allocator) !ReallyMutableString {
        return .{
            .a = a,
            .str= try a.alloc(u8, 1),
        };
    }

    fn set(self: *@This(), from: []const u8) !void {
        self.str = try self.a.realloc(self.str, from.len);
        @memcpy(self.str, from);
    }

    fn fini(self: *@This()) void {
        self.a.free(self.str);
    }
};

pub fn main() !void {

    const GPA = std.heap.GeneralPurposeAllocator(.{});
    var a = GPA{};
    defer log("leakage?.. {}\n", .{a.deinit()});

    var s = try ReallyMutableString.init(a.allocator());
    try s.set("string-1");
    log("{s} ({} bytes)\n", .{s.str, s.str.len});

    try s.set("string-string-2");
    log("{s} ({} bytes)\n", .{s.str, s.str.len});

    try s.set("string-1");
    log("{s} ({} bytes)\n", .{s.str, s.str.len});

    s.fini();
}

FObersteiner · February 10, 2024, 3:50pm

ok, so what about this option: storing the string’s data in a buffer within the struct, then have a function that returns a []const u8 pointer to get the “string representation”?

Example from the TZif parser:

pub const Timetype = struct {
    // some more fields...
    name_data: [6:0]u8,

    pub fn abbreviation(self: Timetype) []const u8 {
        return std.mem.sliceTo(self.name_data[0..], 0);
    }
    // some more methods...
}

I’ll have to do some tests again but I remember having some issues (invalid output) from this as well.

dee0xeed · February 10, 2024, 4:33pm

A far as I could understand this is somewhat similar to what BoundedArray (which has been mentioned 2 times already) is doing. But I did not understand how to overwrite buffer (from the beginning) using it’s API. Should one use Writer interface?

dee0xeed · February 10, 2024, 7:20pm

A variant without sentinel terminated array (I do not think it is really needed):

const std = @import("std");
const log = std.debug.print;

const ToyStr = struct {

    const CAP: usize = 8;

    buf: [CAP]u8 = undefined,
    len: usize = 0,

    fn set(self: *ToyStr, src: []const u8) void {
        const len: usize = if (src.len <= CAP) src.len else CAP;
        // log("len = {}\n", .{len});
        @memcpy(self.buf[0..len], src[0..len]);
        self.len = len;
    }

    fn get(self: *ToyStr) []u8 {
        return self.buf[0..self.len];
    }

};

pub fn main() !void {
    var ts = ToyStr{};

    ts.set("aaa");
    var s = ts.get();
    log("{s} ({} bytes)\n", .{s, s.len});

    ts.set("bbbbbbbbbbbb"); // len > ToyStr.CAP, will be truncated
    s = ts.get();
    log("{s} ({} bytes)\n", .{s, s.len});
}

slonik-az · February 11, 2024, 12:01pm

Good question, actually. Perhaps something like this:
std.debug.print("{s}", .{s.buf[0..s.strlen]}); assuming that struct s contains array buf and actual string length strlen.

dee0xeed · February 11, 2024, 12:09pm

No, it was really stupid question (that was a sort of temporary mental cloudiness from my side), one just have to to take a slice, no problem - see my last example, get function.

Sze · February 11, 2024, 1:47pm

I think alternatively you could define a pub format function on your ToyStr type, that basically uses your get function.