I might have found an issue related to diagnostics column reporting when the source code contains UTF-8 multibyte characters.
It seems that the internal byte offset is correct, but the visual column indicator (the caret ^) becomes misaligned after non-ASCII characters.
Minimal example:
const x = "áéíóú"
something comes out with:
test.zig:1:23: error: expected ';' after declaration
const x = "áéíóú"
^
Observed behavior:
The error is reported with a column that appears shifted to the right compared to the actual visual position in the source line.
Hypothesis:
It looks like the column calculation might be based on byte offsets rather than UTF-8 codepoints, which causes misalignment when multibyte characters are present.
Environment:
- Zig version: 0.16.0
Questions:
Is this a known limitation or an intentional design decision?
Should diagnostics be byte-based, or is visual alignment expected to account for UTF-8?
If this is considered a bug, I’d be interested in working on a fix.
If needed, I can provide more detailed examples or test cases.
Try asking it on https://zsf.zulipchat.com/. Zig core team hangs out on ziggit so they may respond here but it’s more of a community space not zig compiler development.
There’s very slim chance that this will be addressed. And by slim I mean zero. The dev team has been adamant about keeping unicode support out of the standard library, because a proper implementation is both huge and a pain to do correctly.
Visual alignment of Unicode characters requires full Unicode support, but the Zig standard library will not include Unicode support dependencies, so we must expect diagnostic information to be based on bytes.
I wonder if it’s possible to ask terminal for how it rendered provided text. Zig already detects if it’s tty or a file so it can theoretically also ask it for cursor position after rendering number of bytes
It would be trivial to count Unicode code points instead of bytes, because the source code is UTF-8 encoded.
This isn’t absolutely perfect, but a very good compromise.
If you count bytes only, you might as well demand that Zig source must be ASCII.
Most languages based on Latin or Greek or Cyrillic need letters outside of ASCII, English is the exception.
And these normal letters are shown 1 unit wide in an UTF-8 terminal.
Maybe it would help to output an additional text after the marker if the bytes up to the marker are not all ASCII to explain, eg “the visual marker position might not be accurate because Zig counts bytes and there are symbols which might not be one-byte wide in this line’.
Zig requires ASCII for identifiers, and that’s fine (if you are as old as me you know everything else is only asking for trouble sooner or later, just like spaces in file names).
But it should be possible to use normal words in string literals and still get a reasonable marker on errors.
I don’t know about Asian and Arabic languages and such, but I guess that in many cases for these languages the rule of thumb “use one blank per code point” is still much better than “one blank per byte”.
I’m not talking about funny emojis - if you use those in source code, it’s your problem.
Three, you can query the cursor position after outputing a series of text using CSI 6 n. The terminal will then report where the cursor is and you can use this to calculate the width of your text.
… after outputting the line’s bytes up to the error position and before the rest of it, and then the rest of it, and then used the terminal’s answer to get the umber of blanks needed?
This would require that Zig knows that output goes to a terminal (not to a file) and that the terminal supports this control sequence.
Would this work on MS Windows? I’m not sure.
Edit: Sorry, my first sentence sounds complicated, I hope it’s understandable.