Possible diagnostics column misalignment with UTF-8 multibyte characters

lucaas-d3v · April 23, 2026, 11:05am

Hi,

I might have found an issue related to diagnostics column reporting when the source code contains UTF-8 multibyte characters.

It seems that the internal byte offset is correct, but the visual column indicator (the caret ^) becomes misaligned after non-ASCII characters.

Minimal example:

const x = "áéíóú"

something comes out with:

test.zig:1:23: error: expected ';' after declaration
const x = "áéíóú"
                      ^

Observed behavior:
The error is reported with a column that appears shifted to the right compared to the actual visual position in the source line.

Hypothesis:
It looks like the column calculation might be based on byte offsets rather than UTF-8 codepoints, which causes misalignment when multibyte characters are present.

Environment:

- Zig version: 0.16.0

Questions:

Is this a known limitation or an intentional design decision?
Should diagnostics be byte-based, or is visual alignment expected to account for UTF-8?
If this is considered a bug, I’d be interested in working on a fix.

If needed, I can provide more detailed examples or test cases.

Thanks.

AndrewKraevskii · April 23, 2026, 12:51pm

Try asking it on https://zsf.zulipchat.com/. Zig core team hangs out on ziggit so they may respond here but it’s more of a community space not zig compiler development.

lucaas-d3v · April 23, 2026, 1:00pm

I didn’t know, thanks for the heads up.

invlpg · April 23, 2026, 1:27pm

There’s very slim chance that this will be addressed. And by slim I mean zero. The dev team has been adamant about keeping unicode support out of the standard library, because a proper implementation is both huge and a pain to do correctly.

npc1054657282 · April 23, 2026, 1:30pm

Visual alignment of Unicode characters requires full Unicode support, but the Zig standard library will not include Unicode support dependencies, so we must expect diagnostic information to be based on bytes.

AndrewKraevskii · April 23, 2026, 4:20pm

I wonder if it’s possible to ask terminal for how it rendered provided text. Zig already detects if it’s tty or a file so it can theoretically also ask it for cursor position after rendering number of bytes

hvbargen · April 23, 2026, 4:23pm

It would be trivial to count Unicode code points instead of bytes, because the source code is UTF-8 encoded.

This isn’t absolutely perfect, but a very good compromise.

If you count bytes only, you might as well demand that Zig source must be ASCII.

Most languages based on Latin or Greek or Cyrillic need letters outside of ASCII, English is the exception.

And these normal letters are shown 1 unit wide in an UTF-8 terminal.

Maybe it would help to output an additional text after the marker if the bytes up to the marker are not all ASCII to explain, eg “the visual marker position might not be accurate because Zig counts bytes and there are symbols which might not be one-byte wide in this line’.

Zig requires ASCII for identifiers, and that’s fine (if you are as old as me you know everything else is only asking for trouble sooner or later, just like spaces in file names).

But it should be possible to use normal words in string literals and still get a reasonable marker on errors.

I don’t know about Asian and Arabic languages and such, but I guess that in many cases for these languages the rule of thumb “use one blank per code point” is still much better than “one blank per byte”.

I’m not talking about funny emojis - if you use those in source code, it’s your problem.

invlpg · April 23, 2026, 5:47pm

Counting codepoints is problematic even for the example given in OP. “é” can be encoded as a single codepoint or “e” + acute accent.

hvbargen · April 23, 2026, 5:58pm

Yes, that’s true, it’s not perfect, but even in this case it’s closer to the target position than counting bytes.

I’m German, and in my experience at least in 99%, the input data I saw from various quite different sources uses a single code point for umlauts.

What’s the experience in other languages?

OT: At work, I also stumbled over incorrectly used diaresis instead of combining diaresis in a file name, generated on a not-a-pear device.

cryptocode · April 23, 2026, 6:09pm

I think aro (the C compiler used by Zig) keeps track of width that way for error reporting. Gets you most of the way.

lucaas-d3v · April 23, 2026, 7:17pm

So the trade-off seems clear:

byte-based columns are simple but visually inaccurate
full Unicode handling is too complex for Zig’s goals
codepoint counting isn’t perfect, but covers most real-world cases

Would a “best-effort” approach be acceptable? For example:

count UTF-8 leading bytes as 1 column
ignore continuation bytes
no grapheme clusters or display width rules

This would improve alignment for common cases like Latin-based languages without introducing Unicode dependencies.

Also, since aro already tracks width for diagnostics, is there prior art there worth reusing?

zmitchell · April 23, 2026, 9:22pm

Relevant: Grapheme Clusters and Terminal Emulators – Mitchell Hashimoto

hvbargen · April 24, 2026, 2:17am

So would you suggest to use this

Three, you can query the cursor position after outputing a series of text using CSI 6 n. The terminal will then report where the cursor is and you can use this to calculate the width of your text.

… after outputting the line’s bytes up to the error position and before the rest of it, and then the rest of it, and then used the terminal’s answer to get the umber of blanks needed?

This would require that Zig knows that output goes to a terminal (not to a file) and that the terminal supports this control sequence.

Would this work on MS Windows? I’m not sure.

Edit: Sorry, my first sentence sounds complicated, I hope it’s understandable.

hvbargen · April 24, 2026, 7:07am

According to Console Virtual Terminal Sequences - Windows Console | Microsoft Learn this “query the position” should work on Windows, too.

spiffyk · April 24, 2026, 7:20am

I’m pretty sure Zig already does some of this, at least for the fancy progress tree thing we see during builds.