Why Zig and other languages do not accept arbitrary byte sequences in their string literals?

I’m probably asking a bit silly question but since I cannot it answer clearly and I didn’t find the answer on the web, I’m asking here :slight_smile:

I’m reading Zig’s tokenizer zig/lib/std/zig/tokenizer.zig at master · ziglang/zig · GitHub and wonder, why Zig (and now, I wonder about other languages as well) do not accept arbitrary byte sequences in their string literals (say, between "..." on a single line)? For example, why Zig bans U+0085 (NEL), U+2028 (LS), U+2029 (PS) characters? Why don’t allow just anything as soon as that anything_byte_rubbish does not end with the closing "?

Also, Zig does not accept NUL (0) character but in this case it makes sense, at least because the source code is expected to be a string terminating with 0.

I guess if you had arbitrary bytes directly, you couldn’t have special syntax for your literals, if anything was allowed in a literal, how would you distinguish between somebody wanting to use something like "\u{1F354}" and somebody else using arbitrary bytes that happen to be the same thing.
Essentially string literals aren’t raw bytes, but they result in specific bytes, which may contain raw bytes.

String Literals and Unicode Code Point Literals

Because Zig source code is UTF-8 encoded, any non-ASCII bytes appearing within a string literal in source code carry their UTF-8 meaning into the content of the string in the Zig program; the bytes are not modified by the compiler. It is possible to embed non-UTF-8 bytes into a string literal using \xNN notation.

Indexing into a string containing non-ASCII bytes returns individual bytes, whether valid UTF-8 or not.

By requiring you to use "\xFF" to encode arbitrary bytes, the string literal syntax retains options to have other notations, for other ways to specify things.
That said, maybe some other literal syntax that always means raw bytes could be nice.

But then, I don’t really know where I would use it, maybe using @embedFile is more appropriate.

I asked ms copilot about it: (I put my comments along):

If the language allowed arbitrary byte sequences in string literals, it would need to handle cases where those sequences do not form valid characters in the chosen encoding. [Em…ok, so what?]

This could lead to several issues: [These are probably the consequence of the previous statement]

  1. Interoperability: Text data is often shared between different systems, each of which may use a different text encoding. If a string contains an invalid sequence of bytes, it may not be interpreted correctly by these systems. [Does it meant it may break some app, say an editor, if you try to open it?]

  2. Error detection: Invalid sequences can be the result of an error, such as a faulty network transmission or disk read. By disallowing such sequences, the language can help catch these errors early. [Does it mean it is better to invalidate bad sequences from the outset to be sure the file wasn’t corrupted?]

  3. Security: Certain byte sequences can have special meanings in some contexts, potentially leading to security vulnerabilities if they are not handled correctly. [Does it meant that some bad sequences (relative to some encoding) can corrupt the system? Like crash it?! (Perhaps, again, some application here might not just crash, but do something harmful if you open badly coded text)]

  4. Consistency: Allowing arbitrary byte sequences could lead to inconsistencies in how strings are handled within the language. For example, some string operations may assume that all strings are valid text, and behave unexpectedly if this is not the case. [This probably means that if our lang assumed strings to be a sequence of codepoints/runes/graphemes, then iterating over a string or concatenating two strings would be a problem.]

Do you think all of my guesses correct?

Also I tried to corrupt the file with a wrong sequence:

corrupt.zig:

var fd = try std.fs.cwd().openFile("test.zig", .{ .mode = .write_only });
defer fd.close();
try fd.seekTo(11);
_ = try fd.write("\xff");

test.zig:

const x = "0"; // 0 is a placeholder to accommodate corrupted byte
const std = @import("std");
pub fn main() !void {
    for (0..x.len) |i| std.debug.print("{x}", .{x[i]});
}

Running test after applying corrupt.zig:

test.zig:1:12: error: expected ';' after declaration
const x = "�";
           ^
test.zig:1:13: note: invalid byte: '"'
const x = "�";

Interestingly enough, after opening it in vscode and pressing ^S and re-running again, the result is:

efbfbd

Which is the “REPLACEMENT CHARACTER” or � we see above. It seems there is some rule how apps should handle invalid UTF sequences by replacing them with this char (as I saw this char quite a few times before in other places).

I think I would just ignore everything until I find this pattern \u{, then force valid ASCII sequence of n length, then }, and afterwards the freedom back again (ofc until the closing mark, say ", is reached).

I somewhat understand that but somewhat vaguely.

Again, if processing this \xNNNN patterns as I described before, other arbitrary byte sequences shouldn’t be a problem.

Do you have an example it would make some sense to have that?

I dislike reading and interacting with the often times generic garbage that is output by many LLMs, it feels draining and tiring, I would much rather interact with things, that were produced with careful consideration and pondering. (It makes me feel like being a teacher that has to deal with students giving clever sounding non-answers, feeling smug about it.)

It is annoying to have to read everything with a thought of “Is this text trying to bs me into believing some hallucination the LLM produced?”, I think until LLMs add some more levels of general adversarial rounds of arguing and incorporating its own level of confidence of the truthfulness of the answer, I won’t be conviced that these tools are actually that great to use.

In this particular case the answer is just completely generic and not at all nuanced to the specific case of how zig handles string literals, so it just doesn’t seem relevant at all. If I ask about how a monster truck works, I am not interested in your general safety advice for operating a go cart.


I think you need to distinguish between the string literal syntax and what it represents within the compiled program.

Essentially the literal syntax is a coded stream of data that is valid utf-8 and some of the codes within that stream get transformed into raw bytes while others just stay like they are and others are treated as code points, when this coded stream of data is processed and compiled into the program, it just results in a sequence of raw bytes and you can do with those what you want.

3 Likes

If you expect arbitrary then you don’t have room for something other, you either have raw bytes or you don’t.

Think of it in terms of the available alphabet, with raw bytes you have 0-255, but now you don’t have a value that means the start of an escaped sequence like \ because that would mean you need 257 values which doesn’t fit in a byte. You can’t use ascii \ because that already is used as just some byte with a value.
Basically you can’t treat bytes as both text and binary at the same time, it either is binary or it is text in some encoding.

You can do clever things like run length encoding and switch between different escape sequences, but that means you no longer have a simple raw binary string, instead you have different coded streams you can switch in and out of with multi byte escape sequences.

You could for example have something like

`0123456789ABCDEF` 

For an hex encoded string literal

Or you could have

<binaryBlob:1024>

as a marker that tells the compiler that the following 1024 bytes are just raw bytes.
But if you had special things like that, you would have to create editors that understand that, because else it would be a pain to deal with.

1 Like

It’s very helpful to use git blame to answer this sorts of question. In this case, this brings up

4 Likes

Tbh, I don’t tend to think this way. Essentially, I disagree with this statement:

I guess if you had arbitrary bytes directly, you couldn’t have special syntax for your literals

I could think of a string literal as a stream of arbitrary byte sequences as soon as they are confined between two specific bit patterns (in our case, they are " or 0010 0010 to be precise). Having the need for \ to be interpreted differently (other than arbitrary raw 0101 1100 bit stream) indeed forces you to use \\ for \ or \" for the previous instead but that just excludes few bit patterns here and there, leaving the entire set of others on board.

I think technically it is possible to go this way but the reason people don’t seems to be due to issues listed by gpt above. I think first, it is indeed compatibility. If we allow to have some sort of rubbish in between otherwise properly encoded " (including everything around that range), some editors might crush to open it or some other issues may appear on the glyphs rendering stage. Also, as it turned out, some editors like vscode force proper encoding across entire file and replace “outliers” with special bytes which will essentially make working with such raw-bytes-string-literal-files painful (or even impossible). But I might be wrong.

Don’t worry, I taught CS students myself during gpt times and I understand what you mean. However, I didn’t post it to smug but verify whether what has been given makes sense (and the question wasn’t originally aligned exclusively to Zig). Besides, I wouldn’t post it if it was a complete garbage. Actually, being agree with It is annoying to have to read ... hallucination the LLM produced, I should admit that the answer produced above certainly makes sense and give some hints that I myself might not have immediately guessed. On top of that, in the lack of any answer, it is better than nothing :smiling_face_with_tear:. (However, I admit again, sometimes it is better not to have any as you may end up with fighting with what you already had time to misunderstand).

I didn’t mean you, I am saying some LLMs answers come across like all knowing little oracles who are really proud of themselves and then you spend time comparing their answer to something, somebody knowledge in the area could have produced and it lacks context and precision. Sure better LLMs could improve upon that, I am just saying the quality often still is lacking. They sometimes provide very good output, just not necessarily consistently.

After reading half of it, this exactly what I was looking for! Thank you! How (or actually which line) did you blame to find this issue?

Update: Ok, I found it: [self-hosted] source must be valid utf8. see #663. The reference to an issue and the utf8 keyword are just cut off when you use github for blaming.

Then you can’t use 0010 0010 within the stream because it ends the stream. There is always something special, some rule etc. which makes it a coded sequence meaning there is some code (a bunch of rules how things are decoded), when you have these rules I don’t call that raw byte input, instead it is just another kind of encoding.

1 Like

The NEL one. This also required to manually hop through a single file rename. That is, opening repo at the commit before the one that changed the file, manually finding tokenizer.zig and blaming again.

As much as I love local smart editors, I must admit that I haven’t found anything even remotely close in terms of usability to (modern) GitHub’s web blame. Use t and b shortcuts all the time!

1 Like

sounds like you want byte literals. It would be nice to const binary_data:[]const u8 = b"@include(\"data.bin\")" for some things, but no dice.

you could do an asm file but not sure how you get that back into zig.

Not sure if I understood your path with rename and all… Could you please explain in more detail?

I just magically picked up the right line that somehow was relevant to utf8 validation (Blaming zig/lib/std/zig/tokenizer.zig at 1e5075f81296cccd469a0829259231cb34337a02 · ziglang/zig · GitHub) and there was a link to the right issue.

Why would you need that if zig got @embedFile?

Well, I might have just messed up: trying to reproduce, I pretty much get the right commit immediately, so I just made it more complicated than it needs to be, sorry! :slight_smile:

I ddin’t know that existed. But from the description it is still a string literal that is returned and has all the restrictions (and it appends a 0 byte). If embed file just returned the file contents that would be great! could possible even export a symbol for an icon or something that way.

I think you want it returned as a comptime array too so you could name it in the obj file. You dont’t want a slice.

I think you are misreading the description, it is binary data, as an array with statically known size, all you need to treat it as a blob of bytes, the extra zero at the end can just be ignored / isn’t part of the length but just a sentinel value.
It is both sentinel terminated and explicitly sized and you can use either, depending on what makes sense with your data.

1 Like

As @Sze states, it’s embedded as binary data. I’ve used it quite a bit to embed compressed deflate files and then decompress them at runtime when needed. And it’s great that you can @embedFile a module provided via your build.zig. That really allows you to do some awesome build time gen stuff to be embedded at compile time.

An example in the build system guide. In this case it does treat it as a string, but it can be just any bytes.

2 Likes