Why no builtin string type?

matklad · July 24, 2024, 10:49am

I have an alternative answer!

There might be two different things which you could call a string type.

One is string as an abstract sequence of unicode codepoints. This type doesn’t really tell you how it is represented internally, but provides a rich set of unicode-aware operations. So, you could do normalization, iteration over grapheme clusters, lowercase/uppercase, etc. Maybe even right-lo-left!

This type of String is really hard for the reasons that @squeek502 mentioned. One other aspect of hardness here is that the semanitcs of operations boils down to “what the unicode standard says”, and unicode standard is not immutable! There are new versions released once in a while. So, to handle this properly, you’ll also need to bundle data from the current unicode standard, and have some means to update these data for existing apps. It’s pretty clear why this shouldn’t be in the stdlib for most programming languages, unless the language is explicitly about building UIs (that’s why Swift has exactly this type!).

But there’s a second string possible: an UTF-8 encoded byte buffer. So, this type guarantees UTF-8 validity, directly exposes underlying representation, but is otherwise agnostic about unicode. It’s comparasion operations are based on the underlying bytes. It might even have some conveniences like to_ascii_uppercase or split_ascii_whitespace which have fixed semantics, independent of the version of the Unicode standard (there are parts of standard, liki ASCII, that are explicitly designated as constant across standard versions).

This is still a useful type, for two reasons:

First, unlike []const u8, it knows how to print itself as text.
Second, validating utf8 validity is costly (linear), so you usually want to preserve “this was validated to be utf-8” in types. This is non-trivial — extrcting substring requires O(1) validation to make sure you don’t cut code point encoded as multiplle bytes in half!

This type is reasonable to have in a low-level progrmming language. That’s what Rust does: its string cmp is byte-based, its string-slicing is byte-offset based. This is not incorrect! Not all operations are human-directed! If you, eg, grep logs for a regex, you want to do a byte-wise matching, and do any unicode normalization before that.

So why wouldn’t Zig have this type, like Rust?

My answer would be that Zig doesn’t have one thing that Rust has: Rust, like C++, follows the mantra that built-in types and user-defined types should be interchangeable. There are no extra syntactic affordancies to built-ins, generally everything you can do with a built-in, you can do with a custom type (reborrowing is a notable exception).

In Zig, in contrast, built-in types are special. Slices, tupples and arrays have [] syntax, but you can’t simulate that in a user-defined types. So Zig APIs lean more heavily on exposing internal representation directly. E.g., see how std.ArrayList exposes the .items fields which you can use if you need slice syntax. Conversely, the cost of custom types are higher, as they are less convenient.

So, unlike Rust, Zig can’t implement an Utf8String as a library type which non-the-less feels native. It has to either:

Make it a library type with a worse interface than a slice
Make it a compiller built-in

But built-ins are costly, you generally have to have very good reason to have one, and []const u8 works fine.

Couple of over considerations:

String type leads to combinatorial explosion: you need a string slice, a heap allocated string, a heap-allocated growable string, a stack-allocated string, a stack-allocated growable string.
Zig is very representation-centered language. Representation-wise []const u8 and a hypothetical const str are the same, they only differ in invariants (utf8 validity). But Zig is not too big on enforcing semantic invariants in types.