Why no builtin string type?

mnemnion · July 24, 2024, 4:16pm

I think Zig made the right choice here, for a different reason: I don’t believe that a minimalist UTF-8 encoded string, like Rust has, carries its weight.

You get two guarantees: the string is valid UTF-8, and cutting it will still result in well-formed UTF-8. The mechanism here is crude: it panics if you slice anywhere which will create invalid UTF-8, and if you ‘side-load’ an invalid string, this creates undefined behavior (meaning: the type system will not actually prevent this from happening, you have to use the API properly).

This is both too strict, and not strict enough. Creating a String with .from_utf8() validates the entire string, so there’s an escape hatch with .from_utf8_unchecked, but again, now you have undefined behavior if that assumption is wrong.

Meanwhile, there’s a need to handle, in particular, WTF-8, and ‘probably-valid’ strings, and so on. Rust does have mechanisms for all of that, but at the expense of a proliferation of string types.

This type of string has no opinion about two critical properties of a Unicode string: normalization, and grapheme clusters. There are tools to work with that, but you won’t get a panic if you cut a grapheme cluster in half, and if you lexicographically compare two differently-normalized strings, your answer will be incorrect.

Zig’s unicode library is a few functions short imho: in particular it should have thisIndex, previousIndex, and nextIndex (names borrowed from Julia and adapted to Zig convention), which collectively make it simple to ensure that a slice doesn’t truncate a codepoint. While I’m at it, I find the interface for utf8Decode annoyingly strict, and would like something which doesn’t make me slice out the exact character I want in order to get a codepoint. But that’s a pretty minor quibble.

Zig more generally would benefit from distinct types, so that we could make a type utf8_slice, for instance. That would make it easier to impose a validation barrier on functions which do expect correct encoding. Wrapping what you need in a struct is eh, an adequate substitute.

I’ve written a whole lot of low-level code which handles UTF-8 as bytes, and what’s interesting is that Unicode has a whole section on what to do when encountering invalid encoding. It isn’t actually any faster to deal with strings which are maybe-UTF-8 than ones which are definitely-UTF-8, you just need to decide what the policy is for invalid byte sequences.

So Zig treating strings as an abstract byte sequence, with UTF-8 as a privileged encoding, is IMHO the correct choice. This lets the library/package ecosystem explore various ways to deal with the full complexity of Unicode, with std sticking to the basics (as I said, I think it’s missing three basics, but not more than that).

I agree with you that the other stable equilibrium is a string type which genuinely puts all the work in to have a full suite of Unicode behaviors, and it does make sense for Swift to provide that. Raku does it as well in a sort of quirky way. But I’ve never been convinced of the value of the many halfway measures between slice-o’-bytes and index-by-grapheme-cluster.