The Incredible Unicode Mess

ericlang · November 9, 2024, 1:57pm

Up until now I almost never had anything to do with the complicated Unicode specs.
In Delphi and C# and Rust things run pretty smoothly (until you do some tricky Code Golf puzzles).

The first ‘problem’ in Zig is that you cannot see the difference between a string and any other bunch of bytes. Well ok, that is kind of solvable using good parameter / variable names.
Second ‘problem’ is we cannot index chars inside a string.
Third ‘problem’ is we cannot easily append stuff to a string.

There are probably more ‘problems’ like this.

I would like to know how people deal with this kind of stuff. I guess not everyone is diving into the unicode world. That’s a lifetime job
What would be a really good String type?
Or does nobody experience problems?

I had a short look at zigstring on github, but it did not really catch me.

squeek502 · November 9, 2024, 2:21pm

Some relevant links to posts I’ve made:

(also see the other responses in that thread)

Quoting from the first link:

This series of articles by @dude_the_builder details the complication of Unicode well:

Unicode Basics in Zig - Zig NEWS

Ziglyph Unicode Wrangling - Zig NEWS

Unicode String Operations - Zig NEWS

(note that ziglyph has now been superseded by zg)

pierrelgol · November 9, 2024, 5:45pm

I’m not dealing with unicode strings, but I do appreciate the fact that this is not a native type, especially considering the fact that Zig doesn’t support operator overloading, because that means that if you don’t like that type, you are kind of stuck with a type that has some sort of operator privilege while you might want another type to suit your needs. I’m currently developing a small String container on the side, but it’s more at the experimentation level than a Std grade string type, What I find is that it’s actually nice to have everything be a byte, as for indexing multiple bytes, you can probably use some @Vector and load them in a smarter way. There is this talk that I love Plain Text - Dylan Beattie where he dives into the details of text representation in programming, and all of its quirks. From previous conversation here on ziggit, I think most people agreed upon the fact that getting a standard baked in native unicode string type, is a very hard problem one that doesn’t seem to really align well with Zig’s zen.

matklad · November 9, 2024, 7:28pm

Not a string expert enough to give a full answer, but I can address this point!

In general, “characters” are a poor mental model. Strings do not consist of characters. Strings consist of other strings!

That is, if a given API needs to subdivide a string into components, it should return substrings, not individual chars.

Even if something looks like it should be a single “character”, it almost certainly should be modeled as a string internally. E.g, you could think that pressing a key on a keyboard types a single “character”, but, due to how input methods work, a “key press” might result in, eg, emoji, which is several unicode code points!

I think the only use-case where you should be concerned with individual characters is when converting from one encoding into the other. For every other use-case, a string is the atomic element, and there isn’t anything smaller than a string.

ericlang · November 9, 2024, 8:06pm

I don’t think that ‘characters’ are a poor mental model! Then an array would be poor as well.
More I think Unicode is a poor design where we got stuck with.
In the old simple days a character was one byte. Nowadays if we used a u32 we would have a lot of indexable unique characters in our hands.
And if we want to save memory we could maybe say to the stream of characters how big they are: 1, 2 or 4.
At least: that is how my simple mind works.

Anyway: the world is as it is and I will have a look at all the links.

Edit: I agree on the simple stream of bytes it now is. That is the best Zig could choose I believe!

gonzo · November 9, 2024, 8:11pm

Not exactly. What you see on the screen as a single entity (an emoji, for example), could be in fact several Unicode characters composed into that emoji; each of those characters could have more than one byte. So even if you could index into a string of emojis by Unicode character, you would not be guaranteed to iterate over each emoji.

I think this is what @matklad was trying to say.

ericlang · November 9, 2024, 8:14pm

O my god really…

dude_the_builder · November 9, 2024, 8:30pm

These compositions of code points are called Grapheme Clusters, and in most cases, a grapheme cluster is actually what a human would call a character. There is no limit to the number of code points that can compose a single GC. For example, the dark skinned woman astronaut is composed of the code points for woman, dark skin tone, and rocket, united with a special invisible code point called Zero Width Joiner (ZWJ). These types of combinations are not just for emoji, writing systems like Chinese, Japanese, Hindi, and Korean can combine code points to form the final glyph that you see printed. And yes, to make things just a little more interesting, the actual encoding used (UTF-8, 16, 32) will determine how many and which bytes make up each code point.

pierrelgol · November 9, 2024, 8:31pm

Maybe the best of both worlds, is to keep the mental model of arrays of bytes, and use an iterator as the interface to operate over each unicode element, or code point, I’m no expert on the appropriate scale at play. But the iterator gives you a handy and consistent abstraction over unicode manipulation, regardless of whether a language as a native string type.

andrewrk · November 9, 2024, 8:38pm

Unless you’re implementing a user interface framework (e.g. competing with Qt, Gtk, a web browser, or a terminal) or doing natural language processing, you should almost certainly treat all user input strings as opaque encoded bytes, and never try to understand what is inside of them other than looking for delimiters.

If you’re making a hand-rolled video game user interface, you should consider what languages you need to support, and then either use a solution that accomplishes that (such as harfbuzz), or use a simpler thing knowing that you’re limiting what languages your game will support.

To reiterate, it’s very likely a bug and/or poorly designed software if you are iterating over codepoints and doing something with them.

ericlang · November 9, 2024, 9:31pm

Ok, if we want dark skinned female redhaired greeneyed astronauts, 4294967295 characters is not enough.
I will have a look at ‘zg’ as well and what’s in the std library.
@andrew: I am implemening a simple 2d game in English only

dude_the_builder · November 9, 2024, 10:06pm

In that case you can actually do as in the “simple times” and just treat each byte in the []const u8 as an ASCII character. UTF-8 is a superset of ASCII, where the first 128 code points are the same single bytes as in ASCII, so string literals and UTF-8 input that you know is just English can be treated as ASCII. You can iterate over it with for, you can index into it with [], use the many utils in std.mem, etc.

pachde · November 9, 2024, 10:17pm

In your case, then, you don’t need any special Unicode awareness/support. You can index at will. If you want to append, use one of a number of string builder patterns (e.g. arraylist).

pachde · November 9, 2024, 10:21pm

English in general imports quite a few foreign words, including their accents. And in modern English, emojis are liberally used. If you want to support showing people’s names, then you need more support even if your users are reading and writing only English.

You can’t really blame Unicode for this. It’s hard to blame a committee for not creating a perfectly formed standard at version 1. Blame the amazing diversity and complexity of human writing systems.

mutech · November 9, 2024, 10:54pm

I love Zig’s approach to strings, even though I spend decades working with decent implementations of string libraries (decent = works without problems in my use cases).

The reason why Zig’s unicode implemenation (void) works so well, is because you almost never work on unicode strings.

In user interfaces, you almost always work with either labels or text fields and somebody else took care of unicode, or you are working with what users understand a “character” to be. A letter, smiley, a flag or whatever looks like a character and that is almost never something you as a programmer would want to consider “a character”.

When you do schema validation the first time, you almost certainly don’t understand what “character length” means, and it’s certainly not something you can print as a validation error to a user (length of a DB column being usually bytes vs. length of a text on the screen; tell the user it’s “too long” and they call the hot line because it isn’t too long according to their counting.

But if you actually have to work with unicode, you usually have to get it right. And then most of the implementations that works so well in day-to-day programming fall apart (look for unicode in javascript). It’s a challenge to understand what a space is in unicode and then to understand whether that understanding matches the requirements in a software project that processes text. Is a zero width space a word separator in your context? Or is it an escape character? Or something nobody uses anyway until this crazy guy does and writes a stupid bug report…

As for indexing: You can always convert UTF-8 into []u32. Or [][]u32 if your code thinks in graphemes. But whether your code is then more performant than code that just scans UTF-8 should be subject to profiling. I couldn’t guess. You can always use u16 and take care of code units instead of code points, like in JS. Is this actually better? May be.

I was surprised to see that Zig has no strings. But this is probably one of the wisest decisions made in decades.

mnemnion · November 9, 2024, 11:03pm

Feature not a bug. A string on modern system is a bunch of bytes which are supposedly UTF-8, which is how Zig sees them.

This is probably the all-time difference between “thing language designers think users need to do” and “thing users need to do”. Languages have tied themselves into pretzels to provide this, but user code doesn’t need it almost ever.

To echo what @matklad said, strings are composed of substrings, where the boundaries are whatever you need them to be. “Characters” (Unicode scalar values) are very seldom interesting substrings in a string.

You’ll want to use a Writer or ArrayList(u8) for that, depending on specifics.

Mostly a hobby these days but it’s been a job as well.

The truth is that Unicode has a surprising amount of detail. I don’t agree at all with @andrewrk’s list of things you need Unicode handling for, there are a great deal more domains than just those, basically: “text”. If your program deals with text, you’ll need to embrace Unicode.

This was linked in one of @squeek502’s quoted blocks, but I wanted to draw your attention to zg, which covers a lot of the basics of working with Unicode.

The key here is that the baseline abstraction of UTF-8 is the codeunit, which is just u8, and it’s not actually possible to reduce the essential complexity of Unicode text handling by trying to build a baseline above that.

Even just validating that you have UTF-8 is more opinionated than it looks. Rust ensures that any of its various string types are properly encoded, and I think that’s a mistake. An alternative is to just deal with any malformation one encounters when that happens, and I happen to think this is the better choice for most software, one which mandatory pre-validation precludes.

There’s room for better ergonomics in std.unicode, and maybe another feature or two. But the “encoded bytes” model which Zig uses for string data is something I fully support.

ericlang · November 26, 2024, 9:27pm

Some Slovenian words I am trying…

acefaličen
adergaški
žvordati
žvrčati

[a]=97 / [c]=99 / [e]=101 / [f]=102 / [a]=97 / [l]=108 / [i]=105 / [─ì]=269 / [e]=101 / [n]=110 / 
[a]=97 / [d]=100 / [e]=101 / [r]=114 / [g]=103 / [a]=97 / [┼í]=353 / [k]=107 / [i]=105 / 
[┼¥]=382 / [v]=118 / [o]=111 / [r]=114 / [d]=100 / [a]=97 / [t]=116 / [i]=105 / 
[┼¥]=382 / [v]=118 / [r]=114 / [─ì]=269 / [a]=97 / [t]=116 / [i]=105 /

Obviously I am doing it completely wrong.

  // Read file line by line
  var it = std.mem.splitAny(u8, file_buffer, &.{13, 10});
  while (it.next()) |word|
  {
      if (word.len == 0) continue;
      const view: std.unicode.Utf8View = try std.unicode.Utf8View.init(word);
      var uni = view.iterator();
      while (uni.nextCodepoint()) |u|
      {
          std.debug.print("[{u}]={} / ", .{u ,u});
      }
      std.debug.print("\n", .{});
  }

I also can’t make anything from a binary look…
slotest.txt (47 Bytes)

Basically it is for my scrabble program supporting non-standard cases.
Each character (I hope normal characters fit inside a u21) must be mapped.

rockorager · November 26, 2024, 9:36pm

Does your terminal have a font that can display those characters? I copied your code and it works for me:

(Also {} is printing decimals, so š which is U-0161 is printing as 353)

ericlang · November 26, 2024, 9:39pm

oh really…
It is the terminal in vscode.
And also the windows console (when i run the exe) shows the same.

rockorager · November 26, 2024, 9:45pm

That is strange, both of those show the characters just fine for me with your code. You might try a different font and see if that fixes it.

Your code is correct for showing codepoints, though. This is just a font issue