Zig’s lack of a string type & invalid values

squeek502 · August 1, 2024, 10:06pm

I don’t think this is the correct way to look at it. Instead, I’d say string literals in Zig are arbitrary sequences of bytes, and Zig makes it convenient to create UTF-8 encoded string literals.

For example, take @embedFile:

@embedFile(comptime path: []const u8) *const [N:0]u8
This function returns a compile time constant pointer to null-terminated, fixed-size array with length equal to the byte count of the file given by path. The contents of the array are the contents of the file. This is equivalent to a string literal with the file contents.

I think an interesting question to consider for a theoretical string type is: should it be used for file paths? I think this question gets at the fundamental difficulty of a string type, even if its just a []const u8 wrapper.

Many people think of paths as strings in the “printable” sense, but that is not the case—on POSIX systems they are arbitrary byte sequences and on Windows they are arbitrary u16 sequences (see here for details on how Zig handles this). This means that string is fundamentally the incorrect type to use for paths.

This is a trap that Odin seems to fall into, meaning that its APIs either can’t handle all paths, or any user actually treating the fullpath/name from a File_Info as a string (i.e. using for in on it to iterate “runes”) is inadvertently introducing incorrect behavior.

So, let’s say you had a struct like:

struct {
    path: []const u8,
    foo: u32,
}

you’d “want” path to be automatically formatted as a string by std.fmt, but there is no canonical/portable way to format an arbitrary path as valid UTF-8 (i.e. invalid UTF-8 sequences can be converted into � using a variety of algorithms, but the user cannot ever use that output to reconstruct the actual path)

For Zig, in #19005 I added std.path.fmtAsUtf8Lossy and std.path.fmtWtf16LeAsUtf8Lossy. Here’s the fmtAsUtf8Lossy doc comment:

/// Format a path encoded as bytes for display as UTF-8.
/// Returns a Formatter for the given path. The path will be converted to valid UTF-8
/// during formatting. This is a lossy conversion if the path contains any ill-formed UTF-8.
/// Ill-formed UTF-8 byte sequences are replaced by the replacement character (U+FFFD)
/// according to "U+FFFD Substitution of Maximal Subparts" from Chapter 3 of
/// the Unicode standard, and as specified by https://encoding.spec.whatwg.org/#utf-8-decoder

However, that may not be the way you want to print paths depending on your use case. For example, ls on Linux prints them shell-escaped:

$ touch `echo 'FF FF FF FF' | xxd -r -p`
$ ls
''$'\377\377\377\377'

I’m mostly just rambling at this point, but the point I’m trying to get at is something like: a wrapper around []const u8 without guarantees about UTF-8 encoding seems like it’d inherit all the same complications: you think you can print a string, but you can’t, really; you think you can iterate over a string as UTF-8, but you can’t, really.