Possible diagnostics column misalignment with UTF-8 multibyte characters

Hi,

I might have found an issue related to diagnostics column reporting when the source code contains UTF-8 multibyte characters.

It seems that the internal byte offset is correct, but the visual column indicator (the caret ^) becomes misaligned after non-ASCII characters.

Minimal example:

const x = "áéíóú"

something comes out with:

test.zig:1:23: error: expected ';' after declaration
const x = "áéíóú"
                      ^

Observed behavior:
The error is reported with a column that appears shifted to the right compared to the actual visual position in the source line.

Hypothesis:
It looks like the column calculation might be based on byte offsets rather than UTF-8 codepoints, which causes misalignment when multibyte characters are present.

Environment:

- Zig version: 0.16.0

Questions:

  • Is this a known limitation or an intentional design decision?
  • Should diagnostics be byte-based, or is visual alignment expected to account for UTF-8?
  • If this is considered a bug, I’d be interested in working on a fix.

If needed, I can provide more detailed examples or test cases.

Thanks.

1 Like

Try asking it on https://zsf.zulipchat.com/. Zig core team hangs out on ziggit so they may respond here but it’s more of a community space not zig compiler development.

2 Likes

I didn’t know, thanks for the heads up.

There’s very slim chance that this will be addressed. And by slim I mean zero. The dev team has been adamant about keeping unicode support out of the standard library, because a proper implementation is both huge and a pain to do correctly.

4 Likes

Visual alignment of Unicode characters requires full Unicode support, but the Zig standard library will not include Unicode support dependencies, so we must expect diagnostic information to be based on bytes.

3 Likes

I wonder if it’s possible to ask terminal for how it rendered provided text. Zig already detects if it’s tty or a file so it can theoretically also ask it for cursor position after rendering number of bytes

2 Likes

It would be trivial to count Unicode code points instead of bytes, because the source code is UTF-8 encoded.

This isn’t absolutely perfect, but a very good compromise.

If you count bytes only, you might as well demand that Zig source must be ASCII.

Most languages based on Latin or Greek or Cyrillic need letters outside of ASCII, English is the exception.

And these normal letters are shown 1 unit wide in an UTF-8 terminal.

Maybe it would help to output an additional text after the marker if the bytes up to the marker are not all ASCII to explain, eg “the visual marker position might not be accurate because Zig counts bytes and there are symbols which might not be one-byte wide in this line’.

Zig requires ASCII for identifiers, and that’s fine (if you are as old as me you know everything else is only asking for trouble sooner or later, just like spaces in file names).

But it should be possible to use normal words in string literals and still get a reasonable marker on errors.

I don’t know about Asian and Arabic languages and such, but I guess that in many cases for these languages the rule of thumb “use one blank per code point” is still much better than “one blank per byte”.

I’m not talking about funny emojis - if you use those in source code, it’s your problem.

4 Likes

Counting codepoints is problematic even for the example given in OP. “é” can be encoded as a single codepoint or “e” + acute accent.

1 Like

Yes, that’s true, it’s not perfect, but even in this case it’s closer to the target position than counting bytes.

I’m German, and in my experience at least in 99%, the input data I saw from various quite different sources uses a single code point for umlauts.

What’s the experience in other languages?

OT: At work, I also stumbled over incorrectly used diaresis instead of combining diaresis in a file name, generated on a not-a-pear device.

I think aro (the C compiler used by Zig) keeps track of width that way for error reporting. Gets you most of the way.

So the trade-off seems clear:

  • byte-based columns are simple but visually inaccurate

  • full Unicode handling is too complex for Zig’s goals

  • codepoint counting isn’t perfect, but covers most real-world cases

Would a “best-effort” approach be acceptable? For example:

  • count UTF-8 leading bytes as 1 column

  • ignore continuation bytes

  • no grapheme clusters or display width rules

This would improve alignment for common cases like Latin-based languages without introducing Unicode dependencies.

Also, since aro already tracks width for diagnostics, is there prior art there worth reusing?

1 Like

Relevant: Grapheme Clusters and Terminal Emulators – Mitchell Hashimoto

So would you suggest to use this

Three, you can query the cursor position after outputing a series of text using CSI 6 n. The terminal will then report where the cursor is and you can use this to calculate the width of your text.

… after outputting the line’s bytes up to the error position and before the rest of it, and then the rest of it, and then used the terminal’s answer to get the umber of blanks needed?

This would require that Zig knows that output goes to a terminal (not to a file) and that the terminal supports this control sequence.

Would this work on MS Windows? I’m not sure.

Edit: Sorry, my first sentence sounds complicated, I hope it’s understandable.

According to Console Virtual Terminal Sequences - Windows Console | Microsoft Learn this “query the position” should work on Windows, too.

I’m pretty sure Zig already does some of this, at least for the fancy progress tree thing we see during builds.

1 Like