Need advice and Feedback for building a String Library

pierrelgol · April 11, 2024, 9:08am

Hi, I’m currently trying to develop a String library for Zig, and I wanted to have some feedback and suggestions from more senior member than I, of this community.

These are my goals :

Building a modular library.
Simple to understand and reliable.
Composable.
First class inter ops with C string.

So far my plan has been to do the following.

Build a layer of functions that work with C string. Build on top of that layer a ZStringUnmanaged variant, which doesn’t store a pointer to an allocator. Than build Zstring by simply calling all of ZStringUnamanaged Functions with the pointer to the allocator stored. I just took inspiration from the Standard library.

Now I was wondering if you think this is a good and reasonable approach, and if not what would you suggest ? Also when it comes to string manipulation I want to be exhaustive, as such what kind of functions would you expect to find, what sort of behavior would you expect to find too ? Because the Zstring can be configured to be either Static so if you there is no room left you can’t append new values, or automatic, meaning it will grow on it’s own like a vector but I’m not sured if this shouldn’t just be another type of string at all like the difference between a String and a StringBuilder ?

AndrewCodeDev · April 11, 2024, 9:54am

I’ll throw in a quick plug here for our own @dude_the_builder’s work: GitHub - jecolon/zigstr: Zigstr is a UTF-8 string type for Zig programs.

The way I’d start is by wrapping standard library functions for convenience - stuff like indexOfScalar are SIMD’d so you’ll get good default search performance, etc. There’s a lot of work already done for you there and you could quickly get some work done and move on to less explored territory from there.

You could provide some functions like compare that would return an Order - there’s already a function like that in the standard library in mem:

/// Compares two slices of numbers lexicographically. O(n).
pub fn order(comptime T: type, lhs: []const T, rhs: []const T) math.Order {
    const n = @min(lhs.len, rhs.len);
    for (lhs[0..n], rhs[0..n]) |lhs_elem, rhs_elem| {
        switch (math.order(lhs_elem, rhs_elem)) {
            .eq => continue,
            .lt => return .lt,
            .gt => return .gt,
        }
    }
    return math.order(lhs.len, rhs.len);
}

Your biggest challenge imo is making it play nicely with other structures. I personally wouldn’t use a string that I couldn’t use in a hash map without having to write a lot of boilerplate. Thus, you may want to include some class level functions for interoperability.

Just some thoughts.

pierrelgol · April 11, 2024, 10:25am

Thanks those are some nice suggestions, I didn’t think about it, but for example when you talk about hash map, do you mean providing default, context ? to pass to a hash map ? or making/providing some standard functions with a few variation, like diffrent, eql, and hash wrappers ? In the future, I’d like to go further and implement very “advanced” functions like a rope data structures, using those strings, and a few ideas I have with comptime.

AndrewCodeDev · April 11, 2024, 10:43am

Yes, default context could be a nice touch - for instance, the auto hash may not work for this because it converts the data structure to bytes and then hashes those. That sounds nice, but it would hash your pointers instead of the string data. However you choose to go about that is really a matter of taste, but I would like to see some effort in that direction for any library that I’d personally adopt. There’s a lot of stand-alone data structures that are very cool but if they’re a lot of work to integrate into a project, then I probably won’t reach for them very often.

It really depends (as most things do)… if I wanted to base my code around something in particular, I wouldn’t mind writing the boiler plate. If it’s something that I’d reach for when I just need a string, for instance, then it needs to be very easy to use in multiple contexts.

dude_the_builder · April 11, 2024, 10:57am

Yeah, Zigstr implements a string in a more object oriented style, much like the string objects you find in other languages like Ruby, Python, Java, etc. After developing the library, I realized it’s not exactly a perfect fit for Zig / C style projects. I realized that an approach focusing on functions that operate on strings would probably be much more useful, so I’m glad to see that you are considering that approach. Also, the C strings interop design decision up-front would be a really good idea since I’ve seen that even after 50 years of C, there’s still a need for good, user-friendly string libraries!

pierrelgol · April 11, 2024, 3:46pm

This makes a lot of sense, The reason I want to build this library, is that in the future there are a lot of projects that I want to build for my school community, and since Zig is changing so much, I want to build my own stuff, such that I don’t depend as much on the standard library. But like I say I want to also have first class support for C strings.

The only thing I can’t really decide, is whether I should build the functions around a custom type or if I should just write a lot of functions around [*:0]u8 and u8.

On one hand I do see the benefit of having one distinct sort of type, but on the other hand like you said for the sake of usability it might be better to simply drop that idea and stick to simply support slice and C string. Especially if Zigstr is already doing this approach ?

pierrelgol · April 11, 2024, 3:50pm

That’s exactly the opinion I’m feeling right now, I feel like I should stick to a more C style string manipulation, one where you simply pass a u8, or a [*:0]u8 and go to town. Fortunately enough I’m just exploring, and playing with it for now, trying to use it as I build it to see if there is something quirky or not good in terms of API. Anyway thanks for taking the time to respond, it’s really helpful to be able to confront my ideas to more skilled individual.

pierrelgol · April 11, 2024, 3:55pm

I just have a last question, about the type of C string I should use. What do you think is the safest and better type, is it to have C string declared as ?[:0]u8 or [:0]u8 and return errors instead of null ? I’ve not made my mind about it. On one hand I know that if you interact with C, nullability is very much a common theme, Since C has way less safety around null, if you interract with C you are at risk of getting null strings. On the other hand using non optional type, might be better to transition from the C style handling of errors to the Zig way of returning errors ?

dude_the_builder · April 11, 2024, 7:03pm

In Zigstr, I use std.ArrayList when I need a mutable string and []const u8 when no mutations are needed. I hide them behind a copy-on-write type of struct I made called CowList. This data structure is heavily inspired by Rust’s Cow where you only allocate when necessary, which is typical with strings. This works all fine and dandy, but it’s not as transparent as working directly with [:0]u8 and []const u8 etc. The other advantage of working directly with the native Zig types is that you get all the functionality that the standard library has for working with these for free.

nyc · April 11, 2024, 7:12pm

Rust still has COW strings? They ditched SSO a while ago (unless they readded it), so I didn’t think they would keep the COW around either. .COW is so workload dependant. With Zig it seems keeping track of ownership and allocations would be a massive pain. LLVM even removed it from their C++ strings a long time ago I thought (and c++ has copy/move constructors/sharedptr to help manage that too). I have to look at that code to see how you structured it to ease the pain.

edit: read it, very simple, nice, except: “Hella World!”. Use of the word Hella. Straight to jail.

pierrelgol · April 11, 2024, 7:24pm

Right now I’m feeling like starting all over again, with Zig type in mind and C types in mind. I think that like Andrew mentioned it should be convenient or it shouldn’t be at all, if it’s painful to use, if you have to understand a new type it might be too much overhead. Thanks for your suggestions, they are all very helpful in guiding me, and the CowList is a new thing I never heard of it, but I’m glad I’ve learned about it thanks to you. But small question if I may. The Cowlist, is really just a sort of shared interface that’s transparent to the fact that some strings might be immutable and others not if I read your code correctly at least that’s the intent I’m getting.

dude_the_builder · April 11, 2024, 8:57pm

Yes I found the need to implement such a data structure because it felt inefficient to always copy and allocate a string when initializing a new Zigstr instance even though the use case might not require any allocations at all. For example, if you’re just iterating over the code points or grapheme clusters, or searching with indexOf or the like, you can work with the []const u8 passed in without having to copy and allocate. Only when you actually perform a mutating action such as appending to the string, do you really need to allocate some space and copy. So that’s what the CowList is for.

dude_the_builder · April 11, 2024, 9:00pm

LOL. I had forgotten about that.

pierrelgol · April 11, 2024, 9:13pm

It’s a very smart solution thanks for all the inspiration, the advice and experience, ziggit community is really awesome.

AndrewCodeDev · April 11, 2024, 10:38pm

@pierrelgol, what might be handy is a string wrapper to start with. Maybe it would dispatch to the correct functions depending on if it’s null-terminated or not. It would be nice to be able to write a.cmp(b) == .lt or something similar. I’m thinking like a utility wrapper to start, ya know?

I might actually use that - could be interesting. I’ll message ya and maybe we can kick around some ideas.