Why Zig and other languages do not accept arbitrary byte sequences in their string literals?

timfayz · April 15, 2024, 9:39am

I’m probably asking a bit silly question but since I cannot it answer clearly and I didn’t find the answer on the web, I’m asking here

I’m reading Zig’s tokenizer zig/lib/std/zig/tokenizer.zig at master · ziglang/zig · GitHub and wonder, why Zig (and now, I wonder about other languages as well) do not accept arbitrary byte sequences in their string literals (say, between "..." on a single line)? For example, why Zig bans U+0085 (NEL), U+2028 (LS), U+2029 (PS) characters? Why don’t allow just anything as soon as that anything_byte_rubbish does not end with the closing "?

Also, Zig does not accept NUL (0) character but in this case it makes sense, at least because the source code is expected to be a string terminating with 0.

Sze · April 15, 2024, 10:15am

I guess if you had arbitrary bytes directly, you couldn’t have special syntax for your literals, if anything was allowed in a literal, how would you distinguish between somebody wanting to use something like "\u{1F354}" and somebody else using arbitrary bytes that happen to be the same thing.
Essentially string literals aren’t raw bytes, but they result in specific bytes, which may contain raw bytes.

String Literals and Unicode Code Point Literals

Because Zig source code is UTF-8 encoded, any non-ASCII bytes appearing within a string literal in source code carry their UTF-8 meaning into the content of the string in the Zig program; the bytes are not modified by the compiler. It is possible to embed non-UTF-8 bytes into a string literal using \xNN notation.

Indexing into a string containing non-ASCII bytes returns individual bytes, whether valid UTF-8 or not.

By requiring you to use "\xFF" to encode arbitrary bytes, the string literal syntax retains options to have other notations, for other ways to specify things.
That said, maybe some other literal syntax that always means raw bytes could be nice.

But then, I don’t really know where I would use it, maybe using @embedFile is more appropriate.

timfayz · April 15, 2024, 10:49am

I asked ms copilot about it: (I put my comments along):

If the language allowed arbitrary byte sequences in string literals, it would need to handle cases where those sequences do not form valid characters in the chosen encoding. [Em…ok, so what?]

This could lead to several issues: [These are probably the consequence of the previous statement]

Interoperability: Text data is often shared between different systems, each of which may use a different text encoding. If a string contains an invalid sequence of bytes, it may not be interpreted correctly by these systems. [Does it meant it may break some app, say an editor, if you try to open it?]
Error detection: Invalid sequences can be the result of an error, such as a faulty network transmission or disk read. By disallowing such sequences, the language can help catch these errors early. [Does it mean it is better to invalidate bad sequences from the outset to be sure the file wasn’t corrupted?]
Security: Certain byte sequences can have special meanings in some contexts, potentially leading to security vulnerabilities if they are not handled correctly. [Does it meant that some bad sequences (relative to some encoding) can corrupt the system? Like crash it?! (Perhaps, again, some application here might not just crash, but do something harmful if you open badly coded text)]
Consistency: Allowing arbitrary byte sequences could lead to inconsistencies in how strings are handled within the language. For example, some string operations may assume that all strings are valid text, and behave unexpectedly if this is not the case. [This probably means that if our lang assumed strings to be a sequence of codepoints/runes/graphemes, then iterating over a string or concatenating two strings would be a problem.]

Do you think all of my guesses correct?

timfayz · April 15, 2024, 10:59am

Also I tried to corrupt the file with a wrong sequence:

corrupt.zig:

var fd = try std.fs.cwd().openFile("test.zig", .{ .mode = .write_only });
defer fd.close();
try fd.seekTo(11);
_ = try fd.write("\xff");

test.zig:

const x = "0"; // 0 is a placeholder to accommodate corrupted byte
const std = @import("std");
pub fn main() !void {
    for (0..x.len) |i| std.debug.print("{x}", .{x[i]});
}

Running test after applying corrupt.zig:

test.zig:1:12: error: expected ';' after declaration
const x = "�";
           ^
test.zig:1:13: note: invalid byte: '"'
const x = "�";

Interestingly enough, after opening it in vscode and pressing ^S and re-running again, the result is:

efbfbd

Which is the “REPLACEMENT CHARACTER” or � we see above. It seems there is some rule how apps should handle invalid UTF sequences by replacing them with this char (as I saw this char quite a few times before in other places).

timfayz · April 15, 2024, 11:08am

I think I would just ignore everything until I find this pattern \u{, then force valid ASCII sequence of n length, then }, and afterwards the freedom back again (ofc until the closing mark, say ", is reached).

I somewhat understand that but somewhat vaguely.

Again, if processing this \xNNNN patterns as I described before, other arbitrary byte sequences shouldn’t be a problem.

Do you have an example it would make some sense to have that?

Sze · April 15, 2024, 11:15am

I dislike reading and interacting with the often times generic garbage that is output by many LLMs, it feels draining and tiring, I would much rather interact with things, that were produced with careful consideration and pondering. (It makes me feel like being a teacher that has to deal with students giving clever sounding non-answers, feeling smug about it.)

It is annoying to have to read everything with a thought of “Is this text trying to bs me into believing some hallucination the LLM produced?”, I think until LLMs add some more levels of general adversarial rounds of arguing and incorporating its own level of confidence of the truthfulness of the answer, I won’t be conviced that these tools are actually that great to use.

In this particular case the answer is just completely generic and not at all nuanced to the specific case of how zig handles string literals, so it just doesn’t seem relevant at all. If I ask about how a monster truck works, I am not interested in your general safety advice for operating a go cart.

I think you need to distinguish between the string literal syntax and what it represents within the compiled program.

Essentially the literal syntax is a coded stream of data that is valid utf-8 and some of the codes within that stream get transformed into raw bytes while others just stay like they are and others are treated as code points, when this coded stream of data is processed and compiled into the program, it just results in a sequence of raw bytes and you can do with those what you want.

Sze · April 15, 2024, 11:34am

If you expect arbitrary then you don’t have room for something other, you either have raw bytes or you don’t.

Think of it in terms of the available alphabet, with raw bytes you have 0-255, but now you don’t have a value that means the start of an escaped sequence like \ because that would mean you need 257 values which doesn’t fit in a byte. You can’t use ascii \ because that already is used as just some byte with a value.
Basically you can’t treat bytes as both text and binary at the same time, it either is binary or it is text in some encoding.

You can do clever things like run length encoding and switch between different escape sequences, but that means you no longer have a simple raw binary string, instead you have different coded streams you can switch in and out of with multi byte escape sequences.

You could for example have something like

`0123456789ABCDEF`

For an hex encoded string literal

Or you could have

<binaryBlob:1024>

as a marker that tells the compiler that the following 1024 bytes are just raw bytes.
But if you had special things like that, you would have to create editors that understand that, because else it would be a pain to deal with.

matklad · April 15, 2024, 12:40pm

It’s very helpful to use git blame to answer this sorts of question. In this case, this brings up

github.com/ziglang/zig

Zig source encoding

opened 11:46PM - 23 Dec 17 UTC

closed 05:22PM - 22 Nov 18 UTC

thejoshwolfe

proposal accepted

This issue exists to document the rationale for Zig's source encoding. The rules… below will be added to the docs, but the rationale discussion will be linked from the docs to here. # Discussion ## Goals 1. Have some kind of unicode support. 2. It's acceptable for zig source code to be difficult to validate (e.g. by the compiler). 3. Once validated, zig source code should be easy to consume (e.g. analyzed by tools and displayed by editors). ## We want some kind of unicode support We want to support unicode in some contexts in Zig, such as string literals: ```zig // looks good print("Сделайте выбор.\n"); ``` So we don't want to force all bytes of a zig source file to be ascii: ```zig // this is so unreadable, it's unacceptable print("\xd0\xa1\xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0\xd0\xb9\xd1\x82\xd0\xb5 \xd0\xb2\xd1\x8b\xd0\xb1\xd0\xbe\xd1\x80.\n"); ``` Each rule in Zig's grammar is either defined with a character whitelist accepting only specific ascii characters (e.g. `[0-9A-Za-z_]` used in identifiers) or with a character blacklist accepting *any* character except for the terminator/escape characters (e.g. `//.*?\n` for comments). Here are the contexts where *any* character is allowed (using `#` as a placeholder for the characters): * character literal: `'#'` * string literal: `"#"`, `c"#"`, `\\#`, `c\\#` * comment: `//#` It's tempting to simply allow any byte value in those contexts while searching for the terminator. This allows utf8 in string literals and is easy to support in the compiler. This isn't as robust as providing a unicode string type, but it works well enough for some usecases, like the `print` example above. ## The problem with turning a blind eye to Unicode If we want an editor to display the `print` example above the intended way, then editors really need to be interpreting the zig source file as utf8. Additionally, it's very natural in many programming environments (e.g. Python 3, Node JavaScript) to read a file as a *string* rather than as *bytes*, and the obvious encoding to reach for is utf8. If the zig compiler simply tolerates any bytes where utf8 might be, then it's possible to have "correct" zig code with invalid utf8 sequences. This corner case will have undesirable consequences for naive consumers of the zig source, such as throwing an exception or crashing when simply trying to read the file as a string. If invalid utf8 sequences are valid zig, then zig really isn't utf8 compatible, which is an awkward situation for bytes-to-string conversion; there'd be no correct way to convert zig source bytes to a string. We want valid zig source to be easy to consume, and we want to support unicode in some way, therefore zig source code shall be encoded in utf8. ## Zig source is UTF-8 encoded It is a compile error for zig source to contain invalid utf8 byte sequences. There are plenty of examples of these: "//\xff", "//\x80", "//\xc2", "//\xc0\x80", "//\xc2\x00", etc. Although zig source code is technically in unicode, this not mean that zig grammar allows non-ascii unicode outside the "blacklist character" contexts outlined above. You cannot have identifiers in Russian, nor can you use NBSP to format your code. Outside of string literals and comments, it's always an error to have a byte value greater than `0x7f` (this is discussed more below.) ## Line endings are important Comments and multiline string tokens are terminated by the end of the line. Knowing where lines end is critical to understanding zig source code. In zig, all lines are terminated by an LF character, `"\n"`. It is an error for zig source to contain CR characters. This suits the goal of making valid zig source easy to consume. You can either look for simply `"\n"`, or you can use a general-purpose regex like `\r\n?|\n`. Either one will work, because all the complex alternatives to `"\n"` are guaranteed to not be present in the source. But we can't stop there. [Visual Studio recognizes](https://msdn.microsoft.com/en-us/library/dd409797.aspx) even more variations on line ending style: > * CRLF: Carriage return + line feed, Unicode characters 000D + 000A > * LF: Line feed, Unicode character 000A > * NEL: Next line, Unicode character 0085 > * LS: Line separator, Unicode character 2028 > * PS: Paragraph separator, Unicode character 2029 If NEL, LS, and PS are allowed to show up in zig comments without terminating the comment, then we've got a weird corner case for anyone making a VisualStudio plugin for Zig syntax. Therefore, we impose additional restrictions on valid zig source that zig source must not contain NEL, LS, or PS unicode points. These characters are encoded in multiple bytes, so this adds complexity to zig source validators. However, this complexity is justified, because it makes valid zig source easier to consume (remember the goals above.). ## Ascii control characters are mostly no good Control characters `'\x00'` through `'\x1f'` and `'\x7f'` are mostly useless. The only control character zig recognizes is `'\x0a'`, a.k.a. `'\n'`, which is always and only the line terminator. All the other control characters have either superfluous (CR), confusing (BS), inconsistent (VT), or otherwise obsolete (ENQ) behavior, and they are all banned everywhere in zig source code. (For the debate on windows line endings and hard tabs in zig, see #544.) ## Other crazy unicode stuff isn't as important There's a huge amount of weird stuff you can do with unicode, like right-to-left text, zero-width characters, and the poop emoji. Although Zig does want to be a readable language, there's a limit to how much we can enforce when it comes to obscure unicode craziness. You're going to be able to make pretty obfuscated unicode string literals if you try, and zig isn't going to try to stop that. The important thing is that the unicode doesn't interfere with the interpretation of zig's grammar. If some unicode craziness is found that zig allows that confuses naive editors or analysis tools, then we should consider imposing additional restrictions for the sake of keeping zig easy to consume. # The rules * It is an error for zig source to contain invalid utf8 byte sequences. * For every codepoint of zig source code, it is an error for the codepoint to be one of U+0000-U+0009, U+000b-U+001f, U+007f, U+0085, U+2028, U+2029 * It is an error for character literals to contain any source codepoint > U+007f. (e.g. `'й'`) Note: From the above rules, and from the zig grammar, it follows that: * Outside of a string literal or comment, it is an error for any codepoint to be > U+007f. # Implications for consumers If you have zig source that you know is valid, you can trust that: * The source is valid utf8, **or** you can interpret it simply as bytes. Every character that is significant to zig's grammar is a single-byte character. * You can trust that every line is terminated by an LF character, **or** you can use a generic line ending parser. * String literals and comments *might* contain non-ascii unicode characters, but you can ignore them, either as entire code points **or** as individual bytes, when scanning for terminators and escape sequences. * Whitespace between tokens is ignored. When looking for "whitespace", you can just check for `" "` and `"\n"`, **or** you can use a generic whitespace scanner that checks for `"\r"`, `"\t"`, and the 25 unicode whitespace characters.

timfayz · April 15, 2024, 1:25pm

Tbh, I don’t tend to think this way. Essentially, I disagree with this statement:

I guess if you had arbitrary bytes directly, you couldn’t have special syntax for your literals

I could think of a string literal as a stream of arbitrary byte sequences as soon as they are confined between two specific bit patterns (in our case, they are " or 0010 0010 to be precise). Having the need for \ to be interpreted differently (other than arbitrary raw 0101 1100 bit stream) indeed forces you to use \\ for \ or \" for the previous instead but that just excludes few bit patterns here and there, leaving the entire set of others on board.

I think technically it is possible to go this way but the reason people don’t seems to be due to issues listed by gpt above. I think first, it is indeed compatibility. If we allow to have some sort of rubbish in between otherwise properly encoded " (including everything around that range), some editors might crush to open it or some other issues may appear on the glyphs rendering stage. Also, as it turned out, some editors like vscode force proper encoding across entire file and replace “outliers” with special bytes which will essentially make working with such raw-bytes-string-literal-files painful (or even impossible). But I might be wrong.

Don’t worry, I taught CS students myself during gpt times and I understand what you mean. However, I didn’t post it to smug but verify whether what has been given makes sense (and the question wasn’t originally aligned exclusively to Zig). Besides, I wouldn’t post it if it was a complete garbage. Actually, being agree with It is annoying to have to read ... hallucination the LLM produced, I should admit that the answer produced above certainly makes sense and give some hints that I myself might not have immediately guessed. On top of that, in the lack of any answer, it is better than nothing . (However, I admit again, sometimes it is better not to have any as you may end up with fighting with what you already had time to misunderstand).

Sze · April 15, 2024, 1:34pm

I didn’t mean you, I am saying some LLMs answers come across like all knowing little oracles who are really proud of themselves and then you spend time comparing their answer to something, somebody knowledge in the area could have produced and it lacks context and precision. Sure better LLMs could improve upon that, I am just saying the quality often still is lacking. They sometimes provide very good output, just not necessarily consistently.

timfayz · April 15, 2024, 1:35pm

After reading half of it, this exactly what I was looking for! Thank you! How (or actually which line) did you blame to find this issue?

Update: Ok, I found it: [self-hosted] source must be valid utf8. see #663. The reference to an issue and the utf8 keyword are just cut off when you use github for blaming.

Sze · April 15, 2024, 1:38pm

Then you can’t use 0010 0010 within the stream because it ends the stream. There is always something special, some rule etc. which makes it a coded sequence meaning there is some code (a bunch of rules how things are decoded), when you have these rules I don’t call that raw byte input, instead it is just another kind of encoding.

matklad · April 15, 2024, 1:56pm

The NEL one. This also required to manually hop through a single file rename. That is, opening repo at the commit before the one that changed the file, manually finding tokenizer.zig and blaming again.

As much as I love local smart editors, I must admit that I haven’t found anything even remotely close in terms of usability to (modern) GitHub’s web blame. Use t and b shortcuts all the time!

nyc · April 15, 2024, 4:57pm

sounds like you want byte literals. It would be nice to const binary_data:[]const u8 = b"@include(\"data.bin\")" for some things, but no dice.

you could do an asm file but not sure how you get that back into zig.

timfayz · April 15, 2024, 5:00pm

Not sure if I understood your path with rename and all… Could you please explain in more detail?

I just magically picked up the right line that somehow was relevant to utf8 validation (Blaming zig/lib/std/zig/tokenizer.zig at 1e5075f81296cccd469a0829259231cb34337a02 · ziglang/zig · GitHub) and there was a link to the right issue.

timfayz · April 15, 2024, 5:02pm

Why would you need that if zig got @embedFile?

matklad · April 15, 2024, 5:18pm

Well, I might have just messed up: trying to reproduce, I pretty much get the right commit immediately, so I just made it more complicated than it needs to be, sorry!

nyc · April 15, 2024, 5:45pm

I ddin’t know that existed. But from the description it is still a string literal that is returned and has all the restrictions (and it appends a 0 byte). If embed file just returned the file contents that would be great! could possible even export a symbol for an icon or something that way.

I think you want it returned as a comptime array too so you could name it in the obj file. You dont’t want a slice.

Sze · April 15, 2024, 6:08pm

I think you are misreading the description, it is binary data, as an array with statically known size, all you need to treat it as a blob of bytes, the extra zero at the end can just be ignored / isn’t part of the length but just a sentinel value.
It is both sentinel terminated and explicitly sized and you can use either, depending on what makes sense with your data.

dude_the_builder · April 15, 2024, 8:35pm

As @Sze states, it’s embedded as binary data. I’ve used it quite a bit to embed compressed deflate files and then decompress them at runtime when needed. And it’s great that you can @embedFile a module provided via your build.zig. That really allows you to do some awesome build time gen stuff to be embedded at compile time.

An example in the build system guide. In this case it does treat it as a string, but it can be just any bytes.