Feature not a bug. A string on modern system is a bunch of bytes which are supposedly UTF-8, which is how Zig sees them.
This is probably the all-time difference between “thing language designers think users need to do” and “thing users need to do”. Languages have tied themselves into pretzels to provide this, but user code doesn’t need it almost ever.
To echo what @matklad said, strings are composed of substrings, where the boundaries are whatever you need them to be. “Characters” (Unicode scalar values) are very seldom interesting substrings in a string.
You’ll want to use a Writer or ArrayList(u8) for that, depending on specifics.
Mostly a hobby these days but it’s been a job as well.
The truth is that Unicode has a surprising amount of detail. I don’t agree at all with @andrewrk’s list of things you need Unicode handling for, there are a great deal more domains than just those, basically: “text”. If your program deals with text, you’ll need to embrace Unicode.
This was linked in one of @squeek502’s quoted blocks, but I wanted to draw your attention to zg, which covers a lot of the basics of working with Unicode.
The key here is that the baseline abstraction of UTF-8 is the codeunit, which is just u8
, and it’s not actually possible to reduce the essential complexity of Unicode text handling by trying to build a baseline above that.
Even just validating that you have UTF-8 is more opinionated than it looks. Rust ensures that any of its various string types are properly encoded, and I think that’s a mistake. An alternative is to just deal with any malformation one encounters when that happens, and I happen to think this is the better choice for most software, one which mandatory pre-validation precludes.
There’s room for better ergonomics in std.unicode
, and maybe another feature or two. But the “encoded bytes” model which Zig uses for string data is something I fully support.