Runtime-mutating and passing around '[]const u8' character strings

Thank you guys for all the input. It feels like I have aquired a footgun with a sensitive trigger here :slight_smile:

@IntegratedQuantum I’ve seen that post, it felt like very related, but I couldn’t make the clear connection…

@dee0xeed that looks like a plan. In my case, the strings in question aren’t particularly long, and the maximum length can be determined upfront.

@dude_the_builder good catch, that one came in just recently, before, I was experiencing the issue described in the question only for the abbreviation of the tz. It’s the same problem I think. However, the allocator I’m using in the function you looked at is only used to store the content of the TZif file (the tz rules essentially), not the names.

@LucasSantos91 could you elaborate a bit on this? I’m having trouble to see where this self-reference is an issue; assuming I use a buffer to store the bytes of the name of the struct’s instance, that name describes just this instance. So there shouldn’t be a conflict, no?

I am not @LucasSantos91, may be they meant other issues :slight_smile: . One problem I see with this self-referential struct is copying. If the obj is copied, one needs to be careful to properly set the str slice in the copy to point to the new buffer. Otherwise the str will point to the old buf in the source object. Storing index or length would not have this problem. Rust, for example, refuses to compile self-referential structs and requires unsafe to deal with them. Self-ref structs are useful but, as anything containing long-living pointers, should be treated carefully.

5 Likes

Stupid question - how would you just print such a construction (buffer + end index) then?

Ok. Here is another variant:

const std = @import("std");
const log = std.debug.print;
const Allocator = std.mem.Allocator;

const ReallyMutableString = struct {

    str: []u8 = undefined,

    fn set(self: *@This(), from: []const u8, a: Allocator) !void {
        self.str = try a.realloc(self.str, from.len);
        @memcpy(self.str, from);
    }
};

pub fn main() !void {

    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const a = gpa.allocator();

    var s = ReallyMutableString{};
    try s.set("string-1", a);
    log("{s} ({} bytes)\n", .{s.str, s.str.len});

    try s.set("string-string-2", a);
    log("{s} ({} bytes)\n", .{s.str, s.str.len});
}

Mutable slice (no const after []) and realloc every time.
Yes, this is a bit of overhead, but no problems with self-referentiality.

Sorry @dee0xeed my brain misfunctioned and I thought we are talking about in place resize.

realloc only works sometimes so it can’t be your only strategy, you could use realloc with a dedicated FixedBufferAllocator, but at that point you could just use a BoundedArray(u8) instead, for example using its fromSlice method.


Also if your realloc succeeds, you leak memory.
You are missing defer _ = gpa.deinit() which is why the gpa never gets deinitialized, keeping it from complaining about the memory leaks (in debug mode).

I know :slight_smile:
Ok, I added defer std.debug.print("leakage?.. {}\n", .{gpaa.deinit()}); and got something interesting:

  • ReleaseSmall, RealeseFast:
    leakage?.. heap.general_purpose_allocator.Check.ok
  • ReleaseSafe:
    error(gpa): memory address 0x7f8e25791000 leaked:
    leakage?.. heap.general_purpose_allocator.Check.leak
  • ReleaseDebug (default):
    thread 41477 panic: Invalid free and abnormal terminaton.

If you don’t specify options gpa picks defaults based on your -Doptimize=... optimization mode, resulting in no checks for ReleaseSmall and ReleaseFast.
If you actually specify:

var gpa = std.heap.GeneralPurposeAllocator(.{ .safety = true }){};

You always get the safety checks, but we are getting of topic here.

My point was, that I don’t like using the gpa without calling its deinit method in an example, because it could lead people to think, that your code doesn’t have any leaks, when you are just not checking for them. So I think you should always use deinit. And if you intend to leak, then use an arena to make it obvious to the reader.

1 Like

In that example the leak is obvious - s.str never de-allocated and that’s no problem at all for one-shot program.

Ok,

here is a version without leaks and without crash in ReleaseDebug mode
const std = @import("std");
const log = std.debug.print;
const Allocator = std.mem.Allocator;

const ReallyMutableString = struct {

    a: Allocator,
    str: []u8 = undefined,

    fn init(a: Allocator) !ReallyMutableString {
        return .{
            .a = a,
            .str= try a.alloc(u8, 1),
        };
    }

    fn set(self: *@This(), from: []const u8) !void {
        self.str = try self.a.realloc(self.str, from.len);
        @memcpy(self.str, from);
    }

    fn fini(self: *@This()) void {
        self.a.free(self.str);
    }
};

pub fn main() !void {

    const GPA = std.heap.GeneralPurposeAllocator(.{});
    var a = GPA{};
    defer log("leakage?.. {}\n", .{a.deinit()});

    var s = try ReallyMutableString.init(a.allocator());
    try s.set("string-1");
    log("{s} ({} bytes)\n", .{s.str, s.str.len});

    try s.set("string-string-2");
    log("{s} ({} bytes)\n", .{s.str, s.str.len});

    try s.set("string-1");
    log("{s} ({} bytes)\n", .{s.str, s.str.len});

    s.fini();
}

ok, so what about this option: storing the string’s data in a buffer within the struct, then have a function that returns a []const u8 pointer to get the “string representation”?

Example from the TZif parser:

pub const Timetype = struct {
    // some more fields...
    name_data: [6:0]u8,

    pub fn abbreviation(self: Timetype) []const u8 {
        return std.mem.sliceTo(self.name_data[0..], 0);
    }
    // some more methods...
}

I’ll have to do some tests again but I remember having some issues (invalid output) from this as well.

A far as I could understand this is somewhat similar to what BoundedArray (which has been mentioned 2 times already) is doing. But I did not understand how to overwrite buffer (from the beginning) using it’s API. Should one use Writer interface?

A variant without sentinel terminated array (I do not think it is really needed):

const std = @import("std");
const log = std.debug.print;

const ToyStr = struct {

    const CAP: usize = 8;

    buf: [CAP]u8 = undefined,
    len: usize = 0,

    fn set(self: *ToyStr, src: []const u8) void {
        const len: usize = if (src.len <= CAP) src.len else CAP;
        // log("len = {}\n", .{len});
        @memcpy(self.buf[0..len], src[0..len]);
        self.len = len;
    }

    fn get(self: *ToyStr) []u8 {
        return self.buf[0..self.len];
    }

};

pub fn main() !void {
    var ts = ToyStr{};

    ts.set("aaa");
    var s = ts.get();
    log("{s} ({} bytes)\n", .{s, s.len});

    ts.set("bbbbbbbbbbbb"); // len > ToyStr.CAP, will be truncated
    s = ts.get();
    log("{s} ({} bytes)\n", .{s, s.len});
}
2 Likes

Good question, actually. Perhaps something like this:
std.debug.print("{s}", .{s.buf[0..s.strlen]}); assuming that struct s contains array buf and actual string length strlen.

:slight_smile: No, it was really stupid question (that was a sort of temporary mental cloudiness from my side), one just have to to take a slice, no problem - see my last example, get function.

I think alternatively you could define a pub format function on your ToyStr type, that basically uses your get function.

Your get function is a very useful pattern for these use-cases. Short-lived temporary slices attached to arrays are idiomatic. But it could become problematic if a slice outlives its parent or points to the wrong side of the copy.

I may be missing some larger context here but why not simply store the bytes in an array list and upon update clear while retaining capacity? If the update is smaller than the original, no new allocs are performed. If it’s larger, you expand.

If on the other hand you want to store multiple strings, you could use a single array list of bytes and return start index and length for any newly added string to avoid use-after-move issues.

1 Like

finally found the time to try this, with some interesting results:

  • your code works as-is with zig 0.11
  • with zig 0.12-dev, I need to use fn get(self: *ToyStr) []const u8 { ...
    • that works as far as I can tell (debug, safe and fast mode tested)
  • with zig 0.12-dev, if I replace the pointer to self with self only, fn get(self: ToyStr) []u8 { ..., this does not work - why ?

I checked it with 0.12.0-dev.1769+bf5ab5451 (which I currently have on my machine, a bit outdated), works as with 0.11.0. It follows, something has been changed henceforward.

I don’t think it’s a good idea.
Yes, it stops working correctly, I got this output:

$ ./toy-str 
 (3 bytes)
bbbb (8 bytes)

Lengths are correct, but not the buffer content.
But I can not wise up quickly why it happens like this.

1 Like

Inside function everything is correct.
You put a copy on stack, made a slice of that copy,
returned that slice, but stack is already changed
and that slice became invalid.

Something like this.
So, pass an instance by reference, not by value.

1 Like

ah of course! I want to point to some data of that instance, not some copy…! why didn’t I spot this before, makes perfect sense ^^

will do so. Thanks for coming back to this!