German Strings in Zig

So I was reading about the German String optimization (the one that is impossible in Rust, btw). And I think I came up with something nice. What do you guys think? What would you add?


pub const String = packed struct {
    len: u32,
    payload: packed union {
        content: @Vector(12, u8),
        heap: packed struct { prefix: @Vector(4, u8), ptr: [*]u8 },
    },

    pub fn format(
        self: String,
        writer: *std.Io.Writer,
    ) std.Io.Writer.Error!void {
        if (self.len <= 12) {
            const content: [12]u8 = self.payload.content;
            try writer.print("{s}", .{content[0..self.len]});
        } else try writer.print("{s}", .{self.payload.heap.ptr[0..self.len]});
    }

    pub fn init(
        alloc: std.mem.Allocator,
        s: []const u8,
    ) std.mem.Allocator.Error!String {
        const len: u32 = @truncate(s.len);
        if (len <= 12) {
            var content: [12]u8 = @splat(0);
            @memcpy(content[0..len], s);

            return .{ .len = len, .payload = .{ .content = content } };
        } else {
            var prefix: [4]u8 = @splat(0);
            @memcpy(&prefix, s[0..4]);

            return .{ .len = len, .payload = .{
                .heap = .{
                    .prefix = prefix,
                    .ptr = (try alloc.dupe(u8, s)).ptr,
                },
            } };
        }
    }

    pub fn deinit(
        self: *String,
        alloc: std.mem.Allocator,
    ) void {
        defer self.len = 0;
        if (self.len > 12) alloc.free(self.payload.heap.ptr[0..self.len]);
    }

    pub fn eql(
        lhs: String,
        rhs: String,
    ) bool {
        if (lhs.len != rhs.len) return false;
        if (lhs.len <= 12)
            return @reduce(
                .And,
                lhs.payload.content == rhs.payload.content,
            );

        if (@reduce(
            .Or,
            lhs.payload.heap.prefix != rhs.payload.heap.prefix,
        ))
            return false;

        const len = lhs.len;
        return std.mem.eql(
            u8,
            lhs.payload.heap.ptr[0..len],
            rhs.payload.heap.ptr[0..len],
        );
    }

    pub fn copy_to_bytes(
        self: String,
        alloc: std.mem.Allocator,
    ) std.mem.Allocator.Error![]const u8 {
        return std.fmt.allocPrint(alloc, "{f}", .{self});
    }
};

edit: fixed a bug in the quality function !! not false results, just more work than necessary.

14 Likes

Cool. Just scanning the code it looks plausible, at least.

Here’s the reference for the curious.

3 Likes

The only thing I would change is that it returns an error (or a panic) if the given length is too large for 32bit.

For the really curious, this goes quite a bit deeper into it: Why German Strings are Everywhere | CedarDB

4 Likes

This seems to assume that a pointer is 8 bytes. You could use 12 - @sizeOf([*]u8) instead of hardcoded 4 for the prefix length.

3 Likes

Shouldn’t the arguments of type String be * const?

No, because this is a handle type, it only contains values that can be copied, small value types and sometimes a pointer.

1 Like

Hear me out now…

writer.print("{ß}", ...);
11 Likes

What about generalizing it some more? I am not sure about how useful / powerful it will still be compared to on 64 bit, but here is my suggestion for supporting 32 bit archs (mostly for wasm32-freestanding targets).

Going off of @n0s4 idea of calculating the prefix size based on the pointer size and the fact that this struct takes up 2 ptrs worth of space on the stack I have this

pub const String = packed struct {
    // a strings length is 2 usizes long, so a length of string is half of a pointer
    // on a 64 bit arch its half of a 8 bytes (64 bits) so its 4 bytes or a u32
    const ptrSize = @sizeOf(usize);
    const LenType = @Type(.{ .int = .{ .signedness = .unsigned, .bits = @divExact(@bitSizeOf(usize), 2) } });
    // on a 64 bit arch, this will result in 12 (bytes)
    const restSize = (2 * ptrSize) - @sizeOf(LenType);
    const prefixLength = restSize - ptrSize;

    pub const maxLen = (@as(usize, 1) << @bitSizeOf(LenType)) - 1;

    len: LenType,
    payload: packed union {
        content: @Vector(restSize, u8),
        heap: packed struct { prefix: @Vector(prefixLength, u8), ptr: [*]u8 },
    },

    pub fn format(
        self: String,
        writer: *std.Io.Writer,
    ) std.Io.Writer.Error!void {
        if (self.len <= restSize) {
            const content: [restSize]u8 = self.payload.content;
            try writer.print("{s}", .{content[0..self.len]});
        } else try writer.print("{s}", .{self.payload.heap.ptr[0..self.len]});
    }

    pub fn init(
        alloc: std.mem.Allocator,
        s: []const u8,
    ) std.mem.Allocator.Error!String {
        std.debug.assert(s.len <= maxLen);
        const len: LenType = @truncate(s.len);
        if (len <= restSize) {
            var content: [restSize]u8 = @splat(0);
            @memcpy(content[0..len], s);

            return .{ .len = len, .payload = .{ .content = content } };
        } else {
            var prefix: [prefixLength]u8 = @splat(0);
            @memcpy(&prefix, s[0..prefixLength]);

            return .{ .len = len, .payload = .{
                .heap = .{
                    .prefix = prefix,
                    .ptr = (try alloc.dupe(u8, s)).ptr,
                },
            } };
        }
    }

    pub fn deinit(
        self: *String,
        alloc: std.mem.Allocator,
    ) void {
        defer self.len = 0;
        if (self.len > restSize) alloc.free(self.payload.heap.ptr[0..self.len]);
    }

    pub fn eql(
        lhs: String,
        rhs: String,
    ) bool {
        if (lhs.len != rhs.len) return false;
        if (lhs.len <= restSize)
            return std.simd.countTrues(
                lhs.payload.content == rhs.payload.content,
            ) == restSize;

        if (std.simd.countTrues(
            lhs.payload.heap.prefix != rhs.payload.heap.prefix,
        ) == prefixLength)
            return false;

        const len = lhs.len;
        return std.mem.eql(
            u8,
            lhs.payload.heap.ptr[0..len],
            rhs.payload.heap.ptr[0..len],
        );
    }

    pub fn copy_to_bytes(
        self: String,
        alloc: std.mem.Allocator,
    ) std.mem.Allocator.Error![]const u8 {
        return std.fmt.allocPrint(alloc, "{f}", .{self});
    }
};

This does add a few limitations, namely that when compiling to a 32 bit arch your length is a u16, so you can only have strings that are 65536 bytes long, and for short strings you only have 6 bytes to work with.
Also the prefix is only 2 bytes long, so when comparing long(er) strings you will have to compare the actual value on the heap more often.

You can also throw in a

comptime {
    const ptrSize = @sizeOf(usize);
    const strSize = @sizeOf(String);
    if (strSize != (2 * ptrSize)) {
        @compileError("String does not fit in 2 usize");
    }
}

to ensure the math is mathin

You need to be able to get a slice even when it’s embedded, otherwise it’s not that useful.

The point of a payload union here, is to safely get the embedded slice, but your implementation requires a pointer cast on a vector (which I don’t think is supposed to be allowed in packed types either)
If you implement that anyway, you will encounter the weirdness of pointers to packed fields, which cause you to account for field offsets yourself, making it even less safe.

But you can do it with normal or extern types and an array:

// I first did a packed implementation that was `GermanSlice`
// so i figured this one can be dutch :)
const DutchSlice = struct {
    comptime {
        std.debug.assert(@sizeOf(@This()) == @sizeOf(usize) * 2);
    }
    const ptr_size = @sizeOf(usize);
    const HalfArch = std.meta.Int(.unsigned, @divExact(@bitSizeOf(usize), 2));

    pub const embed_max_len = @sizeOf(HalfArch) + ptr_size;
    pub const max_len = std.math.maxInt(HalfArch);

    const Payload = union {
        embedded: [embed_max_len]u8,
        external: [*]u8 align(@alignOf(HalfArch)), // align to remove padding
    };

    len: HalfArch,
    /// if len <= embed_max_len `embedded` otherwise `external`
    payload: Payload,

    fn init(s: []u8) @This() {
        std.debug.assert(s.len <= max_len);
        var ds: @This() = .{
            .len = @intCast(s.len),
            .payload = undefined,
        };
        if (s.len > embed_max_len)
            ds.payload = .{ .external = s.ptr }
        else
            @memcpy(ds.payload.embedded[0..s.len], s);
        return ds;
    }

    fn initAlloc(allocator: std.mem.Allocator, s: []const u8) error{OutOfMemor}!@This() {
        std.debug.assert(s.len <= max_len);
        var ds: @This() = .{
            .len = @intCast(s.len),
            .payload = undefined,
        };
        if (s.len > embed_max_len)
            ds.payload = .{ .external = try allocator.dupe(s) }
        else
            @memcpy(ds.payload.embedded[0..s.len], s);
        return ds;
    }

    fn deinit(ds: @This(), allocator: std.mem.Allocator) void {
        if (ds.len > embed_max_len) return allocator.free(ds.payload.external[0..ds.len]);
    }

    fn slice(ds: *@This()) []u8 {
        if (ds.len > embed_max_len) return ds.payload.external[0..ds.len];
        return ds.payload.embedded[0..ds.len];
    }
};

It’s a bit more clunky, but if you want a slice (to the stored data) without allocing. and keeping it as a vector for faster comparisons, then would this work? no

pub fn slice(self: String) struct { [*]const u8, LenType } {
    if (self.len > restSize)
        return .{ self.payload.heap.ptr, self.len };

    return .{ &@as([restSize]u8, @bitCast(self.payload.content)), self.len };
}

and then you call it like

const slice_ptr, const slice_len = str.slice();
const slice = slice_ptr[0..slice_len];

small tests seem to show this works, but I am not too familiar with vectors so I am not sure if this will break down at some point

edit: this is not stable, it works fine when I build and run the test for x86_64-linux but then doesn’t work when doing it for x86-linux, it fails on the short string. it also stops working when I do ReleaseFast

You could also

pub fn slice(self: String) []const u8 {
    if (self.len > restSize)
        return self.payload.heap.ptr[0..self.len];

    return @ptrCast(&self.payload.content);
}

It has the same behaviour as your version (still doesn’t work), but taking the address of a vector field from a packed type doesn’t have the pointer weirdness as other packed fields would have.

Regardless, my main point was about the (lack of) safety of doing this.

I did do a packed struct implementation first, which does work, I didn’t use vectors as I still don’t think they are intended to be allowed in packed types, I think the above behaviour is evident of that.:

const GermanSlice = packed struct {
    comptime {
        std.debug.assert(@sizeOf(@This()) == @sizeOf(usize) * 2);
    }

    const ptr_size = @sizeOf(usize);
    const HalfArch = std.meta.Int(.unsigned, @divExact(@bitSizeOf(usize), 2));
    pub const embed_max_len = @sizeOf(HalfArch) + ptr_size;
    pub const max_len = std.math.maxInt(HalfArch);
    const payload_offset = @sizeOf(HalfArch);

    // if len <= small_max_len the slice is embedded in the struct
    // otherwise it points to external memory
    len: HalfArch,
    prefix: HalfArch,
    ptr: [*]u8,

    fn init(s: []u8) @This() {
        std.debug.assert(s.len <= max_len);
        var gs: @This() = undefined;
        gs.len = @intCast(s.len);
        if (s.len > embed_max_len)
            gs.ptr = s.ptr
        else
            @memcpy(gs.slice(), s);
        return gs;
    }

    fn initAlloc(allocator: std.mem.Allocator, s: []const u8) error{OutOfMemor}!@This() {
        std.debug.assert(s.len <= max_len);
        var gs: @This() = undefined;
        gs.len = @intCast(s.len);
        if (s.len > embed_max_len)
            gs.ptr = try allocator.dupe(u8, s)
        else
            @memcpy(gs.slice(), s);
        return gs;
    }

    fn deinit(gs: @This(), allocator: std.mem.Allocator) void {
        if (gs.len > embed_max_len) allocator.free(gs.ptr[0..gs.len]);
    }

    fn slice(gs: *@This()) []u8 {
        if (gs.len > embed_max_len) return gs.ptr[0..gs.len];
        return std.mem.asBytes(gs)[payload_offset..][0..gs.len];
    }
};

It originally had a payload field, but &gs.payload would create a pointer with the same address as &gs, though it did have bit offset information that was comptime known and part of the pointer type, but asBytes didn’t account for that (I don’t think it’s possible to at all) so I had to deal with the payload offset manually, at which point I flattened the type as the payload type wasn’t providing any benefit.

Linking to the one I shared earlier, so people don’t grab the wrong one:

it is possible, you just can’t do a @ptrCast, asBytes does @ptrCast(@alignCast(ptr)) and it shifts the data around.
If you inline the function and just do a @bitCast it works as far as I tested

pub const String = StringType(usize);

pub fn StringType(comptime T: type) type {
    const t = @typeInfo(T);
    switch (t) {
        .int => |ti| {
            if (ti.signedness == .signed) @compileError("Type has to be signed");
            if ((ti.bits % 8) != 0) @compileError("Container has to be a full multiple of 8 bit byte");
            if (ti.bits < 8) @compileError("Container has to be at least 1 byte large");
        },
        else => @compileError("Invalid container type, must be unsigned int"),
    }

    // a strings length is 2 usizes long, so a length of string is half of a pointer
    // on a 64 bit arch its half of a 8 bytes (64 bits) so its 4 bytes or a u32
    const ptrSize = @sizeOf(T);
    const LenType = std.meta.Int(.unsigned, @divExact(@bitSizeOf(T), 2));
    // on a 64 bit arch, this will result in 12 (bytes)
    const restSize = (2 * ptrSize) - @sizeOf(LenType);
    const prefixLength = restSize - ptrSize;

    const BufferType = std.meta.Int(.unsigned, restSize * 8);
    const PrefixType = std.meta.Int(.unsigned, prefixLength * 8);

    return packed struct {
        const Self = @This();
        pub const max_length = (@as(usize, 1) << @bitSizeOf(LenType)) - 1;

        comptime {
            const strSize = @sizeOf(Self);
            if (strSize != (2 * ptrSize)) {
                @compileError("String does not fit in 2 of container");
            }
        }

        len: LenType,
        payload: packed union {
            content: BufferType,
            heap: packed struct { prefix: PrefixType, ptr: [*]u8 },
        },

        pub fn format(
            self: Self,
            writer: *std.Io.Writer,
        ) std.Io.Writer.Error!void {
            try writer.print("{s}", .{self.slice()});
        }

        pub fn init(
            alloc: std.mem.Allocator,
            s: []const u8,
        ) std.mem.Allocator.Error!Self {
            std.debug.assert(s.len <= max_length);
            const len: LenType = @truncate(s.len);
            if (len <= restSize) {
                var content: [restSize]u8 = @splat(0);
                @memcpy(content[0..len], s);

                return .{
                    .len = len,
                    .payload = .{
                        .content = std.mem.bytesToValue(BufferType, &content),
                    },
                };
            } else {
                var prefix: [prefixLength]u8 = @splat(0);
                @memcpy(&prefix, s[0..prefixLength]);

                return .{
                    .len = len,
                    .payload = .{
                        .heap = .{
                            .prefix = std.mem.bytesToValue(PrefixType, &prefix),
                            .ptr = (try alloc.dupe(u8, s)).ptr,
                        },
                    },
                };
            }
        }

        pub fn deinit(
            self: *Self,
            alloc: std.mem.Allocator,
        ) void {
            defer self.len = 0;
            if (self.len > restSize) alloc.free(self.payload.heap.ptr[0..self.len]);
        }

        pub fn eql(
            lhs: Self,
            rhs: Self,
        ) bool {
            if (lhs.len != rhs.len) return false;
            if (lhs.len <= restSize) {
                return lhs.payload.content == rhs.payload.content;
            }

            if (lhs.payload.heap.prefix != rhs.payload.heap.prefix)
                return false;

            const len = lhs.len;
            return std.mem.eql(
                u8,
                lhs.payload.heap.ptr[0..len],
                rhs.payload.heap.ptr[0..len],
            );
        }

        pub inline fn slice(self: Self) []const u8 {
            if (self.len > restSize)
                return self.payload.heap.ptr[0..self.len];

            return @as([restSize]u8, @bitCast(self.payload.content))[0..self.len];
        }
    };
}

it doesn’t work without the inline

thats because it makes a copy, even if you take a pointer to self.

alignCast does not change the address of the pointer, it casts the alignment TYPE information and inserts a safety check in case the alignment of the real runtime address doesn’t match the type alignment.
It is also unnecessary if the alignment you are casting to and from are the same.

After the testing I just did, a pointer to a packed field with a vector type, actually is a weird packed field pointer, the copy I didn’t notice in previous testing made the address different, but without it, it’s the same. So to make it work, you need to account for field offset manually.

1 Like

When I first heard the German string optimization term, I immediately thought of literal German strings (words) and how they can be optimized. My idea is that some words, not sure how are they called, which are a combination of two other words, can somehow be optimized xD

A better and also common name is ‘short string optimisation’ sometimes referred to as ‘SSO’.

compound words, I think, you could do this with string interning, where you hash and store each unique string and refer to them via an ID which can just be the pointer to the string. This works for all strings, not just compound words.

It’s not really specific to the German language so much as it is a quirk of its orthography.

Most Germanic languages do it in some way. English just happens to put spaces between the nouns.

got it, I reworked it to do some ptr math, this should be more stable

pub fn slice(self: *const Self) []const u8 {
    if (self.len > restSize) return self.payload.heap.ptr[0..self.len];

    const bitOffset = comptime @bitOffsetOf(Self, "payload");
    comptime std.debug.assert(bitOffset % 8 == 0);
    const byteOffset = comptime bitOffset / 8;

    const payload_ptr: [*]const u8 = @ptrFromInt(@intFromPtr(self) + byteOffset);
    return payload_ptr[0..restSize][0..self.len];
}

I’m curious, why is it impossible in Rust?

it is possible, it’s just really hard. see this article

but basically the rust ownership model gets in the way of a naive implementation, you have to use some special types and unsafe rust

the quote that it’s impossible in rust comes from this CedarDB article

they’ve since added a foot note of people who proved it wrong

2 Likes