Dynamic string done right

maysara-elshewehy · December 28, 2024, 5:30am

Hi everyone, I just finished one part of my project and since I’m working alone there’s no room for criticism or advice, so I hope you can give me some

In short, this is a data type that can store characters in a safe dynamic way with Unicode support.

Dynamic string type

I haven’t finished the whole project yet, but this part is done and I even wrote all the documentation (hopefully it doesn’t contain too many errors).

EDIT: I have created a Fixed string type too.

GigaGrunch · December 28, 2024, 9:03am

This looks pretty cool to me! I wasn’t able to try it yet, but how well does it handle multi-codepoint graphemes in functions like ubytes? 🚵🏻‍♀️ should just count as 1, right?

maysara-elshewehy · December 28, 2024, 9:40am

Thanks, bytes() counts .{ 'A', '🌍', '🌟' } as 9, but ubytes() counts it as 3 .

yataro · December 28, 2024, 11:25am

Why would you do this in string.init?

if (builtin.os.tag == .windows) { _ = std.os.windows.kernel32.SetConsoleOutputCP(65001);

There’s no way to specify allocator I want. anyerror is abused everywhere. What is the purpose of typedefs like pub const unsigned = usize;?
Why writing your own @memcpy as utils.copy? No use of amortized allocation is strange.

There’s a lot more, but I think that’s enough of me. No offense, but the code is “unconventional” (if I can say this) and hard to read, I can’t say it done right.

maysara-elshewehy · December 28, 2024, 11:53am

Hi, thanks for your time, as for the first part “why are you doing this in string.init?”, well I forgot to remove it, I removed it now, thanks for alerting me,

and about the anyerror everywhere, you are right too, I cleaned it up, thanks again,

as for utils.copy and types.<typeName>, these are internal functions/types that can be removed now but they don’t cause me a problem, thank you very much

IntegratedQuantum · December 28, 2024, 12:01pm

Yeah, a library should always explicitly state error sets instead of using anyerror.
Especially if the error set is just error{OutOfMemory}. anyerror just makes error handling needlessly difficult.

As for the allocator, hardcoding it to the page_allocator is a really bad idea, since you will always allocate at least 4096 bytes per string, which is really bad for small strings. The Zig way would be to take an Allocator in the init function.

Also a few note on m_buff: ?types.str.
First up I would prefer to see just ?[]u8 there, it’s simpler and easier to see what it is.
Secondly I would suggest to use just m_buff: []u8 = &.{} without an optional, initialized to length 0. This will make all the logic simpler since you don’t need that extra if everywhere.
And all allocators have a special case for zero-length slices, so you can free and realloc zero-length slices just fine, even if they were never allocated with that allocator.

GigaGrunch · December 28, 2024, 1:19pm

That does not really answer my question about multi-codepoint graphemes. I took my example straight from this article: The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) @ tonsky.me
To be very specific, I wrote this test:

test {
    var str = string.init();
    defer str.deinit();
    try str.append("☹️");
    try std.testing.expectEqual(1, str.ubytes());
}

I expect this to return 1, but it returns 2.

maysara-elshewehy · December 28, 2024, 10:45pm

Hi, thank you for your valuable time and great advice, I really learned from you and I’m very happy about it.

I will work on this issue now, thank you very much

maysara-elshewehy · December 28, 2024, 10:50pm

Hi, it seems you have dug deep into the code, which is really great, thank you for your time and valuable advice.

For anyerror, this issue has been resolved (I will update the repository soon)

Also the m_buff has been made non-optional, thank you for your suggestion.

For the m_alloc, I will change it to use a custom Allocator when the structure is created by the user itself and not the library.

Thank you very much, you have helped me a lot.

pierrelgol · December 29, 2024, 6:04pm

Your code is very C++ like, it’s quite funny to see Zig looking like C++. I think a general advice would be to go read some Zig code out there, and try to see how people do things, trying to remain within what’s usually done. Also if you could reduce the uses of anytype that would be great, documentation is good but it’d would be a lot better, if you had functions like.

pub fn insertScalarAt(string : *String, index : usize, ch : u8) Error!void;
or
pub fn insertOneAt(string : *String, index : usize, ch : u8) Error!void;
instead of having things like this
pub fn insertReal(_self: *Self, _it: anytype, _pos: types.len) Error!void

If you need some inspiration look at the containers provided by the standard library. They should give you an overall sense of what an average Zig user expect to see from a container. Of course you are free to code as you want those are just recommendations.

Validark · January 1, 2025, 5:57am

Why are so many of your functions tagged as inline? You don’t think they should show up in stack traces or what?

maysara-elshewehy · January 4, 2025, 9:35am

The links have been updated and for some reason I can’t edit them in the original post so here are the new links :

Also thanks everyone for your comments and advice I’ve updated everything and I hope you’ll check them out to check if I did it right or not, thank you.

pierrelgol · January 4, 2025, 10:51am

Hi, there is still the same issues with your code, It’s not very conventional obviously, you can write the code that you like and you don’t need to justify yourself, nothing wrong with that, but if you’d like to make more idiomatic code, code that fits what people expect from a Zig container, than it’s really not that hard, you just have to copy the ArrayList interface, and then you can expand it and specialize it to your liking. I’ve made a quick example.

pub const DynamicString = struct {
    items: []u8,
    capacity: usize,
    allocator: Allocator,

    pub fn init(allocator: Allocator) DynamicString {
        return .{
            .items = &[_]u8{},
            .capacity = 0,
            .allocator = allocator,
        };
    }

    pub fn initCapacity(allocator: Allocator, num: usize) Allocator.Error!DynamicString {
        var self = init(allocator);
        try self.ensureTotalCapacityPrecise(num);
        return self;
    }

    pub fn deinit(self: *DynamicString) void {
        self.allocator.free(self.allocatedSlice());
    }

    pub fn append(self: *DynamicString, item: u8) Allocator.Error!void {
        const new_item_ptr: *u8 = try self.addOne();
        new_item_ptr.* = item;
    }

    pub fn addOne(self: *DynamicString) Allocator.Error!*u8 {
        const new_len = self.items.len + 1;
        try self.ensureTotalCapacity(new_len);
        return self.addOneAssumeCapacity();
    }

    pub fn ensureTotalCapacity(self: *DynamicString, new_capacity: usize) Allocator.Error!void {
        if (self.capacity >= new_capacity) {
            return;
        }

        const better_capacity = growCapacity(self.capacity, new_capacity);
        return try self.ensureTotalCapacityPrecise(better_capacity);
    }

    pub fn addOneAssumeCapacity(self: *DynamicString) *u8 {
        std.debug.assert(self.items.len < self.capacity);

        self.items.len += 1;
        return &self.items[self.items.len - 1];
    }

    pub fn ensureTotalCapacityPrecise(self: *DynamicString, new_capacity: usize) Allocator.Error!void {
        if (self.capacity >= new_capacity) {
            return;
        }

        const old_memory = self.allocatedSlice();
        if (self.allocator.resize(old_memory, new_capacity)) {
            self.capacity = new_capacity;
        } else {
            const new_memory = try self.allocator.alloc(u8, new_capacity);
            @memcpy(new_memory[0..self.items.len], self.items);
            self.allocator.free(old_memory);
            self.items.ptr = new_memory.ptr;
            self.capacity = new_memory.len;
        }
    }

    fn growCapacity(current: usize, minimum: usize) usize {
        var new = current;
        while (true) {
            new +|= new / 2 + 8;
            if (new >= minimum)
                return new;
        }
    }

    pub fn allocatedSlice(self: *DynamicString) []u8 {
        return self.items.ptr[0..self.capacity];
    }

    pub fn view(self: *DynamicString) []const u8 {
        return self.items;
    }
};

pub const DynamicStringUnmanaged = struct {
    items: []u8,
    capacity: usize,

    pub fn init() DynamicStringUnmanaged {
        return .{
            .items = &[_]u8{},
            .capacity = 0,
        };
    }

    pub fn initCapacity(allocator: Allocator, num: usize) Allocator.Error!DynamicStringUnmanaged {
        var self = init();
        try self.ensureTotalCapacityPrecise(allocator, num);
        return self;
    }

    pub fn deinit(self: *DynamicStringUnmanaged, allocator: Allocator) void {
        allocator.free(self.allocatedSlice());
    }

    pub fn append(self: *DynamicStringUnmanaged, allocator: Allocator, item: u8) Allocator.Error!void {
        const new_item_ptr: *u8 = try self.addOne(allocator);
        new_item_ptr.* = item;
    }

    pub fn addOne(self: *DynamicStringUnmanaged, allocator: Allocator) Allocator.Error!*u8 {
        const new_len = self.items.len + 1;
        try self.ensureTotalCapacity(allocator, new_len);
        return self.addOneAssumeCapacity();
    }

    fn growCapacity(current: usize, minimum: usize) usize {
        var new = current;
        while (true) {
            new +|= new / 2 + 8;
            if (new >= minimum)
                return new;
        }
    }

    pub fn ensureTotalCapacity(self: *DynamicStringUnmanaged, allocator: Allocator, new_capacity: usize) Allocator.Error!void {
        if (self.capacity >= new_capacity) {
            return;
        }

        const better_capacity = growCapacity(self.capacity, new_capacity);
        return try self.ensureTotalCapacityPrecise(allocator, better_capacity);
    }

    pub fn addOneAssumeCapacity(self: *DynamicStringUnmanaged) *u8 {
        std.debug.assert(self.items.len < self.capacity);

        self.items.len += 1;
        return &self.items[self.items.len - 1];
    }

    pub fn ensureTotalCapacityPrecise(self: *DynamicStringUnmanaged, allocator: Allocator, new_capacity: usize) Allocator.Error!void {
        if (self.capacity >= new_capacity) {
            return;
        }

        const old_memory = self.allocatedSlice();
        if (allocator.resize(old_memory, new_capacity)) {
            self.capacity = new_capacity;
        } else {
            const new_memory = try allocator.alloc(u8, new_capacity);
            @memcpy(new_memory[0..self.items.len], self.items);
            allocator.free(old_memory);
            self.items.ptr = new_memory.ptr;
            self.capacity = new_memory.len;
        }
    }

    pub fn allocatedSlice(self: *DynamicStringUnmanaged) []u8 {
        return self.items.ptr[0..self.capacity];
    }

    pub fn view(self: *DynamicStringUnmanaged) []const u8 {
        return self.items;
    }
};

and a main to test.

const std = @import("std");
const DynamicString = @import("root.zig").DynamicString;
const DynamicStringUnmanaged = @import("root.zig").DynamicStringUnmanaged;

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer {
        if (gpa.detectLeaks()) {
            @panic("leak detected");
        }
        _ = gpa.deinit();
    }

    const allocator = gpa.allocator();

    {
        var str = try DynamicString.initCapacity(allocator, 8);
        defer str.deinit();

        const my_text = "Hello, World!";

        for (my_text) |item| {
            try str.append(item);
        }

        std.debug.print("{s}", .{str.view()});
    }

    {
        var str = try DynamicStringUnmanaged.initCapacity(allocator, 8);
        defer str.deinit(allocator);

        const my_text = "Hello, World!";

        for (my_text) |item| {
            try str.append(allocator, item);
        }

        std.debug.print("{s}", .{str.view()});
    }
}

Obviously my code isn’t perfect either but at least it’s quite conventional, and I don’t think it would shock anybody. But to give you more explanation about why this code is more idiomatic.

My code unlike yours doesn’t commit to one type of allocator, it allows the user to provide whatever is more convenient, it offers one version that needs to be passed an explicit allocator, and one that stores it during the call to init. The code doesn’t use a custom byte type, just a u8, it doesn’t offer len/size function like in C++ because those properties can just be accessed directly. The naming convention is also a 1:1 mapping to the one in ArrayList.

Anyway which you the best

maysara-elshewehy · January 4, 2025, 11:15am

Respecting the traditional Zig style in designing libraries (such as the ArrayList interface) is certainly understandable and helps maintain consistency with the standard library. However, there’s no obligation to follow this pattern if my implementation offers additional features or improvements without introducing any extra overhead or downsides.

In my case, I designed String with a different concept to meet specific needs, such as flexibility in allocator customization and usage, as well as providing built-in memory management capabilities. The goal of this design is to offer a user-friendly and performance-oriented library suitable for various scenarios, without compromising on quality or efficiency.

I believe Zig empowers developers to innovate and create solutions tailored to their needs, as long as the implementation adheres to principles of efficiency, clarity, and simplicity—which my String library fully embraces.

pierrelgol · January 4, 2025, 11:43am

That’s true you are free to do whatever you want, I just believe that it would probably help your project in the long run to adhere to conventions, because it’s also a way to make it “easy” to the end user. It can even help brings more contributors, because one of the side effect of using an unconventional style is that end user who want to use your library, will have to “adapt” it to the rest of the code-base. I think a reasonable middle ground is to implement the ArrayList interface, and then add all the utilities and functions you wanted in the first, place, which in turns offers a high degree of flexibility.

swenninger · January 4, 2025, 11:44am

I would personally shy away from big claims such as “done right” and “highly efficient”, or “SuperZig”, when there are absolutely no benchmarks or comparisons of your implementation with other libraries/implementations.

You have clearly put a lot of effort into the presentation of your work though, so kudos to that!

Happy Zig coding!

IntegratedQuantum · January 4, 2025, 1:13pm

I think it is good that you are trying new things and trying to innovate.
However there are some conventions that don’t cost you anything, but make it so much easier for the average Zig user to get into your library.
One such example would be the init*() and deinit() to create/destroy objects.
If an interface doesn’t follow this, then it can really break the flow of programming.
And honestly, calling them make*() and free() instead doesn’t gain you anything anyways.

That’s why I’d also suggest you to take a look at the naming convention established by the ArrayList and other standard library data structures, and choose your own function names accordingly.

maysara-elshewehy · January 5, 2025, 10:13am

Hello everyone,

First, I’d like to apologize for the delay in responding. I also want to thank you all for your input and support. I’ve read every comment multiple times, and I’m truly grateful to each of you.

I’ve implemented all the changes and fixes you mentioned while maintaining my own style. Feel free to check them out again if you have the time—I’d be very happy to hear your thoughts.

(A small note: everything I’m doing here is purely for learning purposes. I learn best when working on something unfamiliar, and in my pursuit of perfection, I’m sure I’ll receive plenty of criticism, advice, and comments. These will ultimately guide me in the right direction. Thank you all once again!)

Travis · January 5, 2025, 10:29pm

I second this especially since you’ve stated

The most glaring inefficiency i see here is that each string has an optional GPA field - with size 376 in release small builds or 384 in debug builds.

I don’t see where you’ve followed any of the suggestions above. To me this comes across as yet another unfounded claim along with “done right” and “highly efficient”. It almost seems like ‘over promise and under deliver’ is your style.

ForeverZer0 · January 6, 2025, 7:30am

At a glance, some issues that immediately popped out:

Each String is 424/432 bytes (release/debug), not even including its actual data. This is quite heavy for what is essentially supposed to be an analog of a language primitive. Contrast this to the 16 byte type this is intending to replace.
The nullable ?std.heap.GeneralPurposeAllocator field in combination with a std.mem.Allocator field:
1. This should just be the std.mem.Allocator interface. A “typical” library would split this into String and StringUnmanaged types.
2. Falling back and using an internal allocator unknown to the user is an extremely bad practice that you should never do. If something requires an allocator, then require an allocator be given to it. This point cannot be stressed enough. One of the core principles of the language is no hidden memory allocations.
Use of anyerror when there is a known error set (i.e. std.mem.Allocatot.Error). Try to always use explicit error sets where possible. The use of anyerror can “color” functions to a degree when consumed by others, making it more difficult to use their own errors sets in any function that uses yours.
Naming conventions are terrible, there is no way to say this kindly. If you are going to break away from naming conventions used in Zig, at least choose something that is palatable. Underscore-prefixed public members for arguments is just plain disgusting. As petty as it may sound, I would not use a codebase for this issue alone. I know others have remarked about this, my recommendation would be to not dismiss this as just a minor stylistic disagreement: the naming conventions in Zig convey information to its users. While they aren’t compiler enforced, they are more than just what people think looks pretty.

Other than this, there is not much to remark on in terms of “positive”, as I don’t understand the use-case. It doesn’t seem to offer any additional functionality for interacting with strings, just a hefty wrapper over a []u8.

pub const String = @import("std").ArrayList(u8);

This alone provides same basic init/deinit boilerplate, in addition to quite a bit of actual useful functionality.