Why no builtin string type?

Hi, quick question. why zig doesn’t have a builtin with UTF-8 encoding?

I know there is zig-string, but strings are one of the most basic types in a modern language.

is it something that can be added later, or the zig team want to keep the std smaller?

2 Likes

Because strings are complicated.

That zig-string library you linked, for example, would mishandle comparison:

var myString = String.init(allocator);
defer myString.deinit();

try myString.concat("Ç");
assert(myString.cmp("Ç"));

This assertion would fail, even though the strings appear to be identical. That’s because the first uses Normalization Form D: C (U+0043) + ◌̧ (U+0327), while the second uses Normalization Form C: Ç (U+00C7). To actually compare UTF-8 strings in ways a human might expect, decisions about normalization need to be made.

The above is just one example. This series of articles by @dude_the_builder details the complication of Unicode well:

(note that ziglyph has now been superseded by zg)

So, for Zig to have a ‘proper’ UTF-8 String implementation, it would need to embed the Unicode data and deal with all the complications of dealing with Unicode. My understanding is that’s not something that Zig-the-language or Zig-the-standard-library wants to take on if it doesn’t have to (especially since the Unicode data is a moving target).

Additionally, a UTF-8 String type is unable to handle arbitrary data, meaning the String type could not be used for a lot of the things Zig cares about: file paths, environment variables, etc. See Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc by squeek502 · Pull Request #19005 · ziglang/zig · GitHub for more details on that sort of thing.

21 Likes

library String

a String library that takes into account UTF8 and unicode

1 Like

I have an alternative answer!

There might be two different things which you could call a string type.

One is string as an abstract sequence of unicode codepoints. This type doesn’t really tell you how it is represented internally, but provides a rich set of unicode-aware operations. So, you could do normalization, iteration over grapheme clusters, lowercase/uppercase, etc. Maybe even right-lo-left!

This type of String is really hard for the reasons that @squeek502 mentioned. One other aspect of hardness here is that the semanitcs of operations boils down to “what the unicode standard says”, and unicode standard is not immutable! There are new versions released once in a while. So, to handle this properly, you’ll also need to bundle data from the current unicode standard, and have some means to update these data for existing apps. It’s pretty clear why this shouldn’t be in the stdlib for most programming languages, unless the language is explicitly about building UIs (that’s why Swift has exactly this type!).


But there’s a second string possible: an UTF-8 encoded byte buffer. So, this type guarantees UTF-8 validity, directly exposes underlying representation, but is otherwise agnostic about unicode. It’s comparasion operations are based on the underlying bytes. It might even have some conveniences like to_ascii_uppercase or split_ascii_whitespace which have fixed semantics, independent of the version of the Unicode standard (there are parts of standard, liki ASCII, that are explicitly designated as constant across standard versions).

This is still a useful type, for two reasons:

  • First, unlike []const u8, it knows how to print itself as text.
  • Second, validating utf8 validity is costly (linear), so you usually want to preserve “this was validated to be utf-8” in types. This is non-trivial — extrcting substring requires O(1) validation to make sure you don’t cut code point encoded as multiplle bytes in half!

This type is reasonable to have in a low-level progrmming language. That’s what Rust does: its string cmp is byte-based, its string-slicing is byte-offset based. This is not incorrect! Not all operations are human-directed! If you, eg, grep logs for a regex, you want to do a byte-wise matching, and do any unicode normalization before that.


So why wouldn’t Zig have this type, like Rust?

My answer would be that Zig doesn’t have one thing that Rust has: Rust, like C++, follows the mantra that built-in types and user-defined types should be interchangeable. There are no extra syntactic affordancies to built-ins, generally everything you can do with a built-in, you can do with a custom type (reborrowing is a notable exception).

In Zig, in contrast, built-in types are special. Slices, tupples and arrays have [] syntax, but you can’t simulate that in a user-defined types. So Zig APIs lean more heavily on exposing internal representation directly. E.g., see how std.ArrayList exposes the .items fields which you can use if you need slice syntax. Conversely, the cost of custom types are higher, as they are less convenient.

So, unlike Rust, Zig can’t implement an Utf8String as a library type which non-the-less feels native. It has to either:

  • Make it a library type with a worse interface than a slice
  • Make it a compiller built-in

But built-ins are costly, you generally have to have very good reason to have one, and []const u8 works fine.

Couple of over considerations:

  • String type leads to combinatorial explosion: you need a string slice, a heap allocated string, a heap-allocated growable string, a stack-allocated string, a stack-allocated growable string.
  • Zig is very representation-centered language. Representation-wise []const u8 and a hypothetical const str are the same, they only differ in invariants (utf8 validity). But Zig is not too big on enforcing semantic invariants in types.
14 Likes

I think Zig made the right choice here, for a different reason: I don’t believe that a minimalist UTF-8 encoded string, like Rust has, carries its weight.

You get two guarantees: the string is valid UTF-8, and cutting it will still result in well-formed UTF-8. The mechanism here is crude: it panics if you slice anywhere which will create invalid UTF-8, and if you ‘side-load’ an invalid string, this creates undefined behavior (meaning: the type system will not actually prevent this from happening, you have to use the API properly).

This is both too strict, and not strict enough. Creating a String with .from_utf8() validates the entire string, so there’s an escape hatch with .from_utf8_unchecked, but again, now you have undefined behavior if that assumption is wrong.

Meanwhile, there’s a need to handle, in particular, WTF-8, and ‘probably-valid’ strings, and so on. Rust does have mechanisms for all of that, but at the expense of a proliferation of string types.

This type of string has no opinion about two critical properties of a Unicode string: normalization, and grapheme clusters. There are tools to work with that, but you won’t get a panic if you cut a grapheme cluster in half, and if you lexicographically compare two differently-normalized strings, your answer will be incorrect.

Zig’s unicode library is a few functions short imho: in particular it should have thisIndex, previousIndex, and nextIndex (names borrowed from Julia and adapted to Zig convention), which collectively make it simple to ensure that a slice doesn’t truncate a codepoint. While I’m at it, I find the interface for utf8Decode annoyingly strict, and would like something which doesn’t make me slice out the exact character I want in order to get a codepoint. But that’s a pretty minor quibble.

Zig more generally would benefit from distinct types, so that we could make a type utf8_slice, for instance. That would make it easier to impose a validation barrier on functions which do expect correct encoding. Wrapping what you need in a struct is eh, an adequate substitute.

I’ve written a whole lot of low-level code which handles UTF-8 as bytes, and what’s interesting is that Unicode has a whole section on what to do when encountering invalid encoding. It isn’t actually any faster to deal with strings which are maybe-UTF-8 than ones which are definitely-UTF-8, you just need to decide what the policy is for invalid byte sequences.

So Zig treating strings as an abstract byte sequence, with UTF-8 as a privileged encoding, is IMHO the correct choice. This lets the library/package ecosystem explore various ways to deal with the full complexity of Unicode, with std sticking to the basics (as I said, I think it’s missing three basics, but not more than that).

I agree with you that the other stable equilibrium is a string type which genuinely puts all the work in to have a full suite of Unicode behaviors, and it does make sense for Swift to provide that. Raku does it as well in a sort of quirky way. But I’ve never been convinced of the value of the many halfway measures between slice-o’-bytes and index-by-grapheme-cluster.

5 Likes

I find this issue relevant: Improved handling of strings and unicode · Issue #234 · ziglang/zig · GitHub

2 Likes

The ability to flag a []const u8 (or []const u16) as being a string is something sorely lacking. I really hope Tags · Issue #1099 · ziglang/zig · GitHub will be implemented in 0.14.0.

1 Like

I’d say that #5132 and/or #5195 is what’s called for. I don’t want to get into the weeds on the tags proposal, but the ability for the type system to handle distinct builtin and primitive types, the way it can handle two structs with identical layout and field names as different types, would be pretty useful.

I used to think that not having a distinct “String” type was a limitation of Zig. After some hands-on experience trying to implement Unicode algorithms and library functionality (Ziglyph, Zigstr, and zg), I’m convinced that Strings are too high-level for such a low-level programming language to provide as a builtin / native type. If it were to be included, I see it in the standard library just like ArrayList, bit sets, and the other data structures.

10 Likes