Zig’s lack of a string type & invalid values

lunacookies · July 30, 2024, 3:19pm

Hi everyone! I was reading a comment on here giving a rationale for why Zig doesn’t have a string type built into the language. I got carried away writing a response and ended up writing a post on my blog instead. I’d love to hear your thoughts!

mnemnion · July 30, 2024, 10:31pm

It’s a thoughtful post! Regarding the String Question, I have little to add to what I said in that thread, other than to point out @dude_the_builder’s very nice string library.

Enums: Rust’s enum is more like Zig’s tagged union type combined with its enum type. In Rust you can have an enum like you describe, closer to C enums in spirit, or your enums can have a payload type. In Zig these are different things. I’ll stick to the Zig side of the equation for simplicity.

There are good reasons for enums to have invalid values, one of which is exhaustive switching. The compiler will ensure that you cover every possible value of an enum when you switch on it, and if a new value is added, and you didn’t use an else branch, you’ll be compelled to go back and add the new value to each switch statement. That’s a valuable property!

But enums don’t have to be exhaustive, you can just say

const Birbs = enum(u8) {
   robin,
   oriole,
   crow,
   -,
};

And now every possible octet is a potential Birb. It means you still get exhaustive switching, if I add .eagle then switch statements which don’t use else have to add that, but there’s a distinct switch prong for every unnamed value which the integer type can carry.

But in many cases there’s no advantage to this. If I have the classic .spade, .heart, .club, .diamond enum, well I might want to add .joker later, but eventually all the cards are in the pack, and if some integer value greater than 4 gets in there somehow, it’s a bug.

The reason I appreciate and prefer Zig’s slice-o-bytes approach to strings is that there are too many invariants, and some of them are incompatible with each other. Strings get really complex!

Sometimes, frequently even, the invariant a string has to uphold is “this string is a validated email address from the database” and “this string is user input which is supposedly an email but we haven’t checked yet”. Which is why I think that distinct types of some sort would be a strong addition to the language. No one’s worked out a really tight proposal there, though, and making a struct which holds a slice as a member is adequate and might be all we actually need.

I actually agree with you that Zig should make zeroed memory available through the allocator interface. I don’t think that the API should privilege that value, but as your linked post points out, modern hosted environments generally have cheap zero pages, and it’s an important optimization for some kinds of code.

Don’t agree at all that zeroInit should be a default, though. For one thing, pre-zeroed memory only applies to the heap, and only sometimes: the stack will very seldom already be zeroed out, and if the allocator is returning memory which was recently freed, that won’t be zeroed either. If we write something like a buffer which is declared inside of a while loop, then those semantics require the buffer to be zeroed out for each pass on the while loop, which is no bueno performance-wise.

I don’t think you’ve made a good case that optionals are bad for performance. A ?*Something type has to be unwrapped before you use it, yes, but that’s the same thing as checking if a pointer is NULL in any of the billion dollar mistake languages, it turns into the same machine code. The difference is that you never have to check a *Something before you use it. If it was initialized as undefined and you try and use it, that’s safety-checked illegal behavior, but that’s a fairly shallow bug in most cases. Defensive NULL checks are all over responsible C code, because you kind of have to, since the type system won’t do it for you.

Zero can be a useful value, yes, but it’s just a value, I’ve never seen the point in trying to make it special. That’s definitely not worth giving up null-safety for.

squeek502 · July 30, 2024, 10:48pm

By the time I got to “A practical motivation”, I was ready to be convinced, but I feel like that section was way too short/vague for me to understand what practical benefits a string type could have.

I’m not super familiar with Odin, so I’m only going off its documentation, but as far as I can tell:

string is a very small wrapper around what Zig would call []const u8: “the odin string type is just a rawptr + len.”
string is UTF-8 by convention, but there’s no enforcement (?)
for in on a string iterates codepoints/runes: “When iterating a string, the characters will be runes and not bytes. for in assumes the string is encoded as UTF-8.” (unsure what happens if the assumption is wrong), but indexing operates on bytes

The benefits I can see:

Some convenience, e.g. for in instead of std.unicode.Utf8Iterator for codepoint iteration
string being a distinct type, meaning taking a string over a []u8 communicates some amount of information (but AFAICT it does not communicate “this is/must be UTF-8”; more “this probably is/probably should be UTF-8”)

Without more detail, those don’t seem super compelling to me, since they sidestep the hard parts of string handling (for example, “String values are comparable and ordered, lexically byte-wise.”). I feel like the distinct type part could be compelling, though, and that’s what I was hoping would be explored.

lunacookies · July 31, 2024, 12:05am

Thank you!

I agree completely, this is super useful. Maybe I should’ve expanded on this in the post, but you don’t need invalid values to have exhaustive switching. For example, unless you opt out with #partial switch, Odin’s compiler forces you to have a case clause for each enum variant, even in the presence of a catch-all case clause.

I agree here, too: I think trying to perfectly uphold invariants like “this string is always valid UTF-8” or “this string has been normalized” in the type system is doomed to fail in languages like C, Zig & Odin, so you’d be better off not trying to. As you mention, distinct types can still be helpful as guardrails you can easily step over when necessary, without making invariant violations outright UB. This is exactly what Odin’s string is: a distinct wrapper around []u8 that imposes no extra invariants.

One thing you might not have considered is that having distinct types in the language proper means that they can inherit all the operations you can perform on the base type. Go has distinct types and uses them to create clear, ergonomic, representation-oriented APIs like this:

package time

type Duration int64

const (
	Nanosecond  Duration = 1
	Microsecond          = 1000 * Nanosecond
	Millisecond          = 1000 * Microsecond
	Second               = 1000 * Millisecond
	Minute               = 60 * Second
	Hour                 = 60 * Minute
)

package main

import (
	"fmt"
	"time"
)

func main() {
	// it just works, no operator overloading necessary!
	d1 := 5 * time.Second
	d2 := time.Second / 4
	d3 := 250 * time.Millisecond
	sum := d1 + d2 + d3
	fmt.Println(sum) // => 5.5s
}

This is a good point, I’ll admit: zero initialization is the “natural” solution only when working with manually-created virtual memory mappings. In my experience, though, zero initialization is what you want such a high percentage of the time that I’m happy to impose this (straightforward, hardware-oriented) requirement in situations where it doesn’t arise naturally.

As a side note, one approach I like is to make my allocators zero out allocations the moment they’re freed. This way all vacant space is always zeroed regardless of whether it’s been reused or not. This helps reduce the impact of use-after-frees and is a nice code simplification to boot! Apple’s malloc implementation does this.

This is true, which is why languages like Odin let you opt out of zero initialization if you need to. I can’t help but mention, though, that a proposal to zero initialize stack variables in C++ found that there was a performance impact of less than 0.5%, so I think this doesn’t really matter in practice.

What I was getting at here is that zero initialization can be a performance improvement because you don’t need to explicitly go and write values into newly-allocated memory, it’s just ready to use as-is. Optionals mean you can’t use zero initialization without causing UB, so you miss out on this potential optimization. I will admit, though, that this is vague and would need to be qualified by some actual measurements to be a solid argument (which I’d argue also goes for your point about not having optionals leading to useless null checks!).

lunacookies · July 31, 2024, 1:02am

This is correct, and is the crucial point here.

To be honest there isn’t an earth-shattering benefit here or anything. Adding string as a distinct alias to []u8 achieves one thing: it communicates to the programmer and compiler that hey, this slice of bytes should be treated as a UTF-8 string, not a sequence of integers! It does not make the type system force you to uphold invariants in the way that Rust’s str does.

The readability benefit to the programmer that comes from the extra source-level information is debatable, so I won’t cover it here. The key thing to focus on here is that the compiler now knows you’re dealing with a string, which means metaprogramming-ish things know the difference between strings and slices of 8-bit integers.

In Zig, these sorts of things have no way of telling the two apart. As a result, std.fmt.format always interprets []const u8 as an integer sequence when using the default struct formatting, while std.json always interprets []const u8 as a string:

const std = @import("std");

const Person = struct {
    name: []const u8,
    age: u32,
};

const SomeNumbers = struct {
    eight_bit_numbers: []const u8,
};

pub fn main() !void {
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    const allocator = arena.allocator();

    {
        const person = Person{ .name = "Sarah", .age = 92 };
        const json = try std.json.stringifyAlloc(allocator, person, .{});
        std.debug.print("{}\n", .{person});
        std.debug.print("{s}\n", .{json});
    }

    {
        const numbers = SomeNumbers{ .eight_bit_numbers = &[_]u8{ 1, 2, 3, 4 } };
        const json = try std.json.stringifyAlloc(allocator, numbers, .{});
        std.debug.print("{}\n", .{numbers});
        std.debug.print("{s}\n", .{json});
    }
}

$ zig run main.zig
main.Person{ .name = { 83, 97, 114, 97, 104 }, .age = 92 }
{"name":"Sarah","age":92}
main.SomeNumbers{ .eight_bit_numbers = { 1, 2, 3, 4 } }
{"eight_bit_numbers":"\u0001\u0002\u0003\u0004"}

Ouch. The equivalent Odin program does what you’d expect:

package main

import "core:encoding/json"
import "core:fmt"

Person :: struct {
	name: string,
	age: int,
}

Some_Numbers :: struct {
	eight_bit_numbers: []u8,
}

main :: proc() {
	{
		person := Person{name = "Sarah", age = 92}
		json, _ := json.marshal(person)
		fmt.println(person)
		fmt.println(string(json))
	}

	{
		numbers := Some_Numbers{{1, 2, 3, 4}}
		json, _ := json.marshal(numbers)
		fmt.println(numbers)
		fmt.println(string(json))
	}
}

$ odin run main.odin -file
Person{name = "Sarah", age = 92}
{"name":"Sarah","age":92}
Some_Numbers{eight_bit_numbers = [1, 2, 3, 4]}
{"eight_bit_numbers":[1,2,3,4]}

Edit: I’ve added the above examples to the post.

chrboesch · July 31, 2024, 7:42pm

I also find the Zig version very clear:

const std = @import("std");
const print = std.debug.print;
const duration = std.fmt.fmtDuration;

const time = struct {
    const Nanosecond: i64 = 1;
    const Microsecond: i64 = 1000 * time.Nanosecond;
    const Milisecond: i64 = 1000 * time.Microsecond;
    const Second: i64 = 1000 * time.Milisecond;
    const Minute: i64 = 60 * time.Second;
    const Hour: i64 = 60 * time.Minute;
};

pub fn main() !void {
    const d1 = 5 * time.Second;
    const d2 = time.Second / 4;
    const d3 = 250 * time.Milisecond;
    const sum = d1 + d2 + d3;

    print("d1 = {d}\n", .{duration(d1)});
    print("d2 = {d}\n", .{duration(d2)});
    print("d3 = {d}\n", .{duration(d3)});
    print("sum = {d}\n", .{duration(sum)});
}

d1 = 5s
d2 = 250ms
d3 = 250ms
sum = 5.5s

kristoff · July 31, 2024, 8:36pm

Just to point out one thing about formatting strings in Zig: I agree that the current behavior of std.fmt is problematic and that having a distinct type for strings would be one way of making it better, but some here might remember that before the most recent rewrite, fmt did not have issues with strings as it does now.

This just to point out that having a string type is not the only way to fix fmt’s behavior and that maybe some of the dissatisfaction that people are feeling is tied to that more than anything else.

Lastly, in my opinion creating a dedicated type with suggested, but non-exactly clear nor enforced invariants is just a way of giving people misleading verbiage to produce the software equivalent of comedy (ie bugs).

As for “full-fledged” string types, I think this blog post sums it up: The string type is broken - Musing Mortoray

So in conclusion IMO the best one can do is pretty much Zig’s status quo (modulo the behavior of fmt that could be better).

lunacookies · August 1, 2024, 11:50am

The one thing I can say is that that code doesn’t actually create a new type for durations, meaning it doesn’t prevent against accidentally passing a duration where an i64 is expected, or vice versa. I’ll admit that I didn’t even consider implementing something like this in a language without distinct types because of this. Mentally I was comparing it against something with a Duration struct that has add, sub, etc methods. You’re totally right, point taken.

dude_the_builder · August 1, 2024, 1:47pm

A programming language that decides to add a string type to its type system is opting-in to putting on a straight jacket. It’s opting-in to imposing opinions such as:

UTF-8 is the best encoding.
Indexing should be at the byte level.
The length is in bytes not code points or grapheme clusters.
Iterating should be at the code point level.

These are all opinionated choices by a language designer, not any established rules defined by Unicode, ASCII, or any other scheme.

As I see it, Zig is saying: “Nope, not going there.” You just have a slice of bytes, and it’s you, as the programmer, who gets to make decisions as to how to treat those bytes. For example, the std.unicode gives you tools to validate those byts as UTF-8, encode, decode to / from UTF-8 to UTF-16, and iterate over the bytes as code points using different “views” into the raw bytes. This is a wise approach for a low-level language because it doesn’t impose anything on the programmer. It gives you the tools to work with those bytes as you see fit, but at the language level, just as in memory, they’re still just bytes.

Here’s a little test for your language of choice’s handling of strings. See what printing, iterating, and getting the length for this string produces:

"\u{1F469}\u{1F3FF}\u{200D}\u{1F680}" // 👩🏿‍🚀

In Zig, you only have the slice of bytes, so there’s no way to get confused or misled as to what you’re iterating over or what length means.

lunacookies · August 1, 2024, 3:51pm

I feel like we’re talking past one another. To me a lot of the arguments being presented don’t really fit with what I’m proposing.

Isn’t that the case with existing Zig code?

I imagine that a very high percentage of []const u8 uses across all Zig code are for strings, so I don’t think it’s a stretch to say that people assume them to be strings in the absence of surrounding context. []const u8 is Zig’s de-facto string type, one which doesn’t enforce the (I think, common) invariant that its contents are UTF-8 encoded.

With the exception of the last point, all of these apply to Zig. With the exception of \xNN escape sequences, string literals in Zig are UTF-8 encoded, indexed byte-by-byte, and measure their length in bytes. Several factors combine to make this happen:

the language mandates that source files are encoded as UTF-8
the contents of string literals aren’t modified by the compiler (except for escape sequence processing)
string literals have type *const [n:0]u8 which coerces to []const u8

What I’m proposing is to add a string type which is a distinct wrapper around []const u8 (I realize that this ignores the fixed length and null-termination of string literals as they are now, go with me here!). The semantics of string are identical to []const u8. It exists purely as “documentation” for the compiler (see: std.json and std.fmt) and for programmers.

UTF-8, byte-indexed strings are already special-cased by the language. Adding a string type like the one I’m suggesting admittedly increases the amount of special-casing, but only a bit.

Also: before anyone gets the wrong idea I should clarify that this isn’t an actual serious proposal I’m making to change Zig. I just like discussing language design and am curious about others’ points of view :‌D

Sze · August 1, 2024, 6:17pm

But that’s it []const u8 is arbitrary bytes you can’t just say, but no lets assume its utf-8 encoded, personally I am sort of fine with simply letting this be up to the programmer to choose what those bytes contain and then deal with it, it is some application specific choice.

Having Zig give you a UTF8bytes type for string literals doesn’t seem like it gives you much, additionally you would now need a byteliteral syntax which then gives you the original []const u8.

If Zig adds some possibility to have distinct types, then I think it would be good to have a distinct type that is a valid UTF-8 encoded string in the standard library, but personally I don’t really want that type to be a special builtin type.

If we had a type like that, I would want it to be implemented with language features, so that other libraries can create similar types, which then for example add specific iterators that operate on grapheme-clusters etc.

Just adding one hard-coded distinct type seems more like a hack to me, giving up on finding a good solution that is expressible in user space and instead giving the user a half baked, builtin solution. Personally I think it wouldn’t give enough benefits and isn’t a satisfying solution to me.

You already can do something like this, if you mostly care about the default printing in nested types:

const std = @import("std");

const String = struct {
    str: []const u8,

    pub fn init(str: []const u8) String {
        return .{ .str = str };
    }

    pub fn format(
        self: String,
        comptime _: []const u8,
        _: std.fmt.FormatOptions,
        writer: anytype,
    ) !void {
        try writer.print("{s}", .{self.str});
    }
};

pub fn main() !void {
    const str = "Hello World";
    const str2 = String.init(str);

    std.debug.print("{any}\n", .{str});
    std.debug.print("{any}\n", .{str2});
}

I think because String has only one field, it should be pretty similar to passing around a []const u8 directly, in terms of performance and you could for example add UTF-8 validation of the content in init in debug and safe mode.

What I would find potentially interesting, would be if we could create custom string literal types, something like this:

pub fn u(comptime literal:[]const u8) type {
    // verify that its utf-8 at comptime
    // else @compileError
    return String.init;
}
pub fn main() !void {
    const str = u"Hello World"; // would result in an instance of String.init("Hello World")
    std.debug.print("{any}\n", .{str});

    _ = scriptlang.Script
        \\puts "Hello World!"
        ;

    _ = std.json.JSONC
        \\ {
        \\    key: "Value",  // json with comments
        \\ }
        ;
}

But this is a half thought through idea, so it is quite possible that a bunch of things would have to be different about it.
Something like that in combination with distinct types, could allow for quite a bit of library implemented verification and ensuring of invariants.

squeek502 · August 1, 2024, 10:06pm

I don’t think this is the correct way to look at it. Instead, I’d say string literals in Zig are arbitrary sequences of bytes, and Zig makes it convenient to create UTF-8 encoded string literals.

For example, take @embedFile:

@embedFile(comptime path: []const u8) *const [N:0]u8
This function returns a compile time constant pointer to null-terminated, fixed-size array with length equal to the byte count of the file given by path. The contents of the array are the contents of the file. This is equivalent to a string literal with the file contents.

I think an interesting question to consider for a theoretical string type is: should it be used for file paths? I think this question gets at the fundamental difficulty of a string type, even if its just a []const u8 wrapper.

Many people think of paths as strings in the “printable” sense, but that is not the case—on POSIX systems they are arbitrary byte sequences and on Windows they are arbitrary u16 sequences (see here for details on how Zig handles this). This means that string is fundamentally the incorrect type to use for paths.

This is a trap that Odin seems to fall into, meaning that its APIs either can’t handle all paths, or any user actually treating the fullpath/name from a File_Info as a string (i.e. using for in on it to iterate “runes”) is inadvertently introducing incorrect behavior.

So, let’s say you had a struct like:

struct {
    path: []const u8,
    foo: u32,
}

you’d “want” path to be automatically formatted as a string by std.fmt, but there is no canonical/portable way to format an arbitrary path as valid UTF-8 (i.e. invalid UTF-8 sequences can be converted into � using a variety of algorithms, but the user cannot ever use that output to reconstruct the actual path)

For Zig, in #19005 I added std.path.fmtAsUtf8Lossy and std.path.fmtWtf16LeAsUtf8Lossy. Here’s the fmtAsUtf8Lossy doc comment:

/// Format a path encoded as bytes for display as UTF-8.
/// Returns a Formatter for the given path. The path will be converted to valid UTF-8
/// during formatting. This is a lossy conversion if the path contains any ill-formed UTF-8.
/// Ill-formed UTF-8 byte sequences are replaced by the replacement character (U+FFFD)
/// according to "U+FFFD Substitution of Maximal Subparts" from Chapter 3 of
/// the Unicode standard, and as specified by https://encoding.spec.whatwg.org/#utf-8-decoder

However, that may not be the way you want to print paths depending on your use case. For example, ls on Linux prints them shell-escaped:

$ touch `echo 'FF FF FF FF' | xxd -r -p`
$ ls
''$'\377\377\377\377'

I’m mostly just rambling at this point, but the point I’m trying to get at is something like: a wrapper around []const u8 without guarantees about UTF-8 encoding seems like it’d inherit all the same complications: you think you can print a string, but you can’t, really; you think you can iterate over a string as UTF-8, but you can’t, really.

lunacookies · August 2, 2024, 3:33am

I think I actually agree with this. Making the language simpler by pushing features into user code is often a good idea. And the point about it not being super beneficial is fair, too.

This is actually such a great idea, I’m kinda kicking myself for not thinking of it myself! :‌D Even if you don’t have the special string literal syntax from your post, writing u("Hello World") isn’t much more annoying than "Hello World". I wonder if something like this has a place in the standard library? For something like this to be effective (i.e. make std.fmt and std.json treat strings and u8 slices differently) it has to be adopted throughout a codebase.

mnemnion · August 2, 2024, 4:16pm

I’ve definitely considered it! Distinct numeric types would be the best reason to have distinct nominal types in Zig, but it’s also the part which raises multiple questions and, so far, no good answers.

So something like

const MyI32: type = @distinct(i32);

So far so good, right? And it inherits all the operators. What happens when you operate on it with a normal i32? If MyI32 is a Meter, then I want to be able to say length + 5 and get a Meter back, annoying otherwise. But that logic doesn’t always hold, and that’s the first of the decisions to be made.

And it gets worse. If it’s a Celsius, then temp + 10 is valid, but temp1 + temp2 is meaningless! We want to prevent that if we can. If it’s a BitMask, most arithmetic shouldn’t work, just shifts and bitwise operations. If it’s an Instant, then I want to be able to add a Duration to it, but not add two Instants, although subtracting one from another is ok, and what should it return? Why, a Duration of course!

Note that this isn’t about operator overloading, it’s about coercion and validity of applying operators. I’d like to see a proposal here which really suits the language, but so far, no dice.

It would be easier to add distinct slices and arrays, I think, but the advantage there is less compelling, because a struct-of-one-slice works just fine: we do have to access the slice through a field, but in return we can have decls which operate on the slice directly, and I think that’s a good tradeoff. Zig structs don’t impose overhead, so if a slice is the only field, you’ll get the same object code as you would using a slice directly.

This isn’t true of optional pointers in Zig:

An optional pointer is guaranteed to be the same size as a pointer. The null of the optional is guaranteed to be address 0.

So you could zero-init a struct just by making all pointers optional, and the behavior would be defined. But now you’re telling the compiler that anywhere that struct goes, the pointers might not be there, so you have to constantly unwrap them. Not a good fit for the language. There are certainly data structures where pointers are genuinely optional, but there are many more where a valid pointer is expected. Those should be initialized when they’re created, and Zig initialization syntax ensures that you do that, while letting you throw an undefined at the field if you absolutely have to.

I strongly prefer optional pointers over nullable ones, but that’s for correctness and security reasons, not actually performance. It would be difficult to measure any performance impact from null-checking in Zig vs. C, because it’s a fast operation and real code is dominated by other concerns.

But I can make a first-principles argument here, which I believe to be sound, that the Zig approach is always either the same, or better for performance, than C’s. It goes like this: it’s illegal to reference a NULL pointer, so both Zig optional pointers, and any C pointer which is not already known to be non-null, must be checked before referencing, and this uses identical machine code for our purposes.

The difference is that Zig provides a pointer which is statically known to point at something: these do not have to be checked before they get used. C code can only have pointers which are dynamically known to point at something: therefore, there will always be at least as many NULL checks in C code as in corresponding Zig code, but there may also be more.

Although you can pessimize the Zig code by making every pointer optional, which doesn’t make a lot of sense but would allow for zeroed initialization. So that’s a good reason not to use the zero-init pattern.

The actual difference comes from the design patterns it encourages. In C, a pointer is always at risk of being NULL, so you try to design the program so you don’t have to constantly check that, but the compiler won’t help you. In Zig, you don’t have an option (heh) about unwrapping an optional before you use it, so you try to design a program to unwrap it once, and then pass it around as a known-good value. I’m pretty confident that in Fast/Small builds, yolo_pointer.? just unconditionally tries to access the pointer, so in the event that you’re confident you do have a legal pointer in there already, you can just do that (I’ve used this pattern before).

We agree that allocators should be able to provide fast-path zeros for performance reasons, but I don’t think that the same logic holds at all for pervasive zero-init, I think that should be opt-in rather than opt-out. Which is in fact the current policy. I was going to write some stuff about the impact that a default like that would have on some of the performance oriented allocators, like memory pools, but this is getting long enough already.

AndrewCodeDev · August 3, 2024, 5:24am

Quick aside here… my favorite example of this was a financial industry application that I saw (written in C) where individual screens were assigned numeric macros automatically (it was all generated). Now, I thought to myself “there’s no way they’re using these as anything but identifiers” but I was wrong!

I found every integer operation in the book aside from multiplication and division. Want to check if you’re within a radio button group? They were using <,<=,>,>=. Want to activate another button on the screen? You bet they were adding values to buttons assuming the next identifier.

I was stunned and it was horrific to rewrite anything involving those screens. It was a complete rewrite each time. People can and do abuse semantics in even mission critical applications.