[]enum(u8){_} as string type

I read some of the other string proposals and find myself agreeing, that there are too many design decisions to be made for zig to have a standard String impl (like Rust for instance) that works in all use cases.

However i find myself writing a lot of comptime code (msgpack rpc, zigs formatting, etc.) that just needs to distinguish between a slice of bytes and text.

I think just a single typedef like this, would go a long way:

const utf8 = enum(u8) { _ };

although we should consider if we need a separate type for ascii. (This would make string literals really complicated though, unless we allow implicitly casting these)

The problems:

  • string literals would need to change
    • this would probably be a breaking change for all programs
    • the compiler would be aware of this typedef?
      • or it gets its own place in std.builtin.Type?
        • this maybe would allow implicitly casting ascii => utf8 => u8
  • std.ascii and std.unicode would need to use the new type(s)
  • you can’t just use strings in places that expect bytes
    (though this proposal has the same memory layout, so you can just cast)

This would allow:

  • not needing {s} in format strings
    (i always seem to forget the s modifier)
  • std.fmt to make text readable even if it is nested somewhere in the value
    (i find this always annoys me, but not enough to make writing a custom formatter worthwhile)
  • any other comptime code to have this information
    • for example here are the places where i wanted this:
      • msgpack rpc
      • js interop (string vs. ArrayBuffer)
      • gui debug inspector

Since the compiler enforces utf8 string literals, I think its fine if the compiler is aware of an utf8 type.

I do think a better solution would be terser syntax for converting between enum and their backing int, or even coercion. But this has its own problems.

Instead of removing {s} I think it should be repurposed to mean slice of T, calling a dedicated format function on the child type that formats a slice of itself. This is already the case with other format specifiers, and would not require special handling for slices of utf8 in the formatter logic.

An alternative is a wrapper struct, this wouldnt require any new features in the formatting logic, it’d just move the string formatting to the struct’s format function. there is already a Utf8View struct

2 Likes

If similar underlying designs need to be changed, I think the following issue should be decided first.

allow integer types to be any range ¡ Issue #3806 ¡ ziglang/zig

It allows limiting whether ‘\0’ in a string is legitimate.

1 Like

This is basically what i do to inform my comptime code to treat it as a String, but I dont really use it elsewhere, because if it is not in the std / the compiler it is unclear where you should use it and where not (not when interacting with std.unicode, or should i wrap that? what about exposing this type in the API?). So i end up using it as little as possible.

I just think that stringly typed things are enough of a problem in other languages and fixing it involves more boilerplate in zig (for good reason!), that we should at least not make everything []u8ly typed.

Just to be clear: My current proposal would not verify that the contents are valid utf8 at all.
This is basically just an annotation for the programmer / comptime code.
(I think this should be left to the programmer, like it currently is)

I agree that it is helpful to distinguish between text and bytes by type, but if there is a foreseeably better solution in the future, perhaps making a single breaking change later would be better than splitting it into multiple breaking changes. At the moment, Zig is busy with I/O-related work, so attention has not been focused on these issues.

Currently, only for the programmer’s own conventions, I agree with @vulpesx that std.unicode.Utf8View is a compromise worth considering.

Strings aren’t a type, and it’s a mistake to treat them like one.

Strings are a type category, like structs. But unlike structs in that struct-ure is not what distinguishes them, but rather, content, format, and provenance. Encoding is merely the base layer of format, and almost never is it the last.

Utf8View, for instance, makes the most common assumption about a string: it should be valid UTF-8.

I happen to disagree that this should be the most common assumption. For the majority of cases where code wants to treat a byte sequence as UTF-8, it is preferable to use Substitution of Maximal Subparts to repair any malformations of the string by replacing them with U+FFFD, the Unicode Replacement Character.

Usually, the reason you want a string to be UTF-8, is so you can do something with the codepoints. The ‘lossy’ strategy returns a sensible codepoint when it encounters malformation, and that meets the purpose, without having to handle an error which you might not care about.

But more importantly, being well-formed Unicode is almost never sufficient. Is it normalized? Which form is it normalized into? Is it an email address? Did it come from the database? Did it come from the network? Is it supposed to be JSON? Has it been validated as JSON?

Having a type which says “these bytes are A String” doesn’t help with any of that. Literally, it doesn’t help! Ok, so you have a megabyte of “off the network should be JSON”, do you care if it’s UTF-8? No! You care if it’s JSON! The parser will validate that, why run a UTF-8 validator first?

There’s no such thing as a (properly implemented!) lexer which doesn’t validate UTF-8, assuming its input format is decreed to be UTF-8. It has to look at multibyte sequences, so it has to decode them, so it has to handle errors in decoding. There is no advantage at all in doing so in advance, it’s just opening up a security hole, where if someone manages to slip an overlong encoding past pre-validation they can pwn your computer. It will never be faster to pre-validate and then make the dangerous assumption that the bytes are valid, rather than confirming it everywhere that condition needs to hold. Even if it were faster, it’s never safe. So don’t do it.

In fact a lexer should probably use U+FFFD lossy codepoint decoding as well. That way there’s one malformed-error case to handle, not a bunch, and it’s handled as just another codepoint.

Also, Utf8View has a cousin, Wtf8View. The fact that this exists, and we need it to exist, is another good reason not to impose coding on strings.

So a string is: a bunch of bytes which might be in the form which your programs needs them to be in, and you’ll need to check that they are. We spell that []u8. Wrapper types which track provenance into and through the program? Pretty smart idea. Should they be in the standard library? Mostly, no.

12 Likes
/// This struct's layout is identical to `[]const u8`, 
/// and they can be used interchangeably:
const Buffer = extern struct {
  buf: []const u8 align(1)
};

const StringPtr = *[]const u8;
/// BufferPtr can serve as a type marker when 
/// your serialization/deserialization code needs 
/// to distinguish between strings and buffers. 
/// Because the layout is identical, you can freely 
/// convert pointers to a Buffer to pointers to a 
/// string and vice versa.
const BufferPtr = *Buffer;

No, you don’t need a separate class for strings.

If you need to associate more information with the type `Buffer`, put that extra info into const variables with known names. They are trivially retrievable at comptime through metadata lookups, and you can vary the logic, or categorize your structs. You can even build a tree of categories.

Why align(1)?

It won’t remove padding if that was your goal, there is no padding to remove.

If anything, it negates the ability to reliably ptr cast to a “string”, casting from a “string” will still be reliable.

I would also reverse the types, and remove the extra pointers.

const String = struct { raw: []const u8 };
// just use []const u8 for raw buffers

But String on its own is meaningless, there are so many ways to do strings, as @mnemnion detailed.

std.unicode.Utf8View would be a better type, but it’s still not that useful most of the time, again, as @mnemnion explained.

4 Likes

+1 about std.unicode.Utf8View !

a `[]enum(u8){_}` could be sliced in the middle of a codepoint and that wouldnt be valid utf8, and for interop with js it would be needed utf-16 anyway

We’re very heavy on JS interop. In fact, 20+ Kloc heavy and keep growing, all for JS. Our findings so far confirm that the best string type for interacting with JS remains to be 1-byte latin1 strings, which translates exactly to `[]const u8` in Zig, and utf8 JS strings as the second best. In many scenarios, collecting those into a UTF-8 JSON is the fastest way to deliver a payload comprising many numbers and strings to JS land. You know what we never use? UTF-16, because those require translation when you interact with the underlying OS syscalls/API, and that drops exchange performance dramatically.

1-byte JS strings, which are, again, exactly `[]const u8` in Zig, let you use buffers as keys in JS objects. You can freely store any content in those, including ‘\0’, and use the entire JS string arsenal (joins, splits, regex) to work against `[]const u8` buffers in JS land. They are faster than utf-16 in our tests, consume less memory, and supported by the browsers too.

2 Likes

Out of curiosity, what makes Latin-1 superior in this case? It feels like a bad idea to be using it in 2026.

Performance-wise, it’s just an array passed directly from Zig to JS, so no memcpy, no memory allocation or conversion involved. You can push it into WASM and your tokenizer that parses and feeds the tokens into LLM works in exactly the same way as if you’re dealing with `[]const u8`. Plus it lets you put arbitrary binary data and structures as object keys and values, use all string JS functions, regexes, etc. Basically, they are Uint8Array, only all string.* functions work and the entire JS apparatus fully supports them. And their conversion into DOM strings are trivial, because they are just utf-8 underneath, and support all languages. The new JSON conversion that employs SIMD is really fast at handling the conversion into DOM and back.

I understand your skepticism, it also took me a while to get converted.

I’m still not clear what makes Latin-1 superior to UTF-8 here? Especially if you’re just passing it around as an opaque byte-string for the most part.

You can push it into WASM and your tokenizer that parses and feeds the tokens into LLM works in exactly the same way as if you’re dealing with []const u8

That’s true for UTF-8 too.

And their conversion into DOM strings are trivial, because they are just utf-8 underneath, and support all languages

Latin-1 very much does not support all languages, not even all European languages. French comes to mind as a typical example as it’s missing œ.

Yeah, you don’t get it. When I say latin-1, I don’t mean that it’s latin-1 encoding, it’s UTF-8 encoding underneath. It’s shared as latin-1 buffer under Arc, see node_api_create_external_string_latin1

In other words, the API calls are called latin-1, in a sense that it’s JavaScript’s way of saying `[]const u8`. The actual representation of the content is utf-8, but it’s not surfaced in the DOM as such, that’s only for interoperability.

It’s more of a performance trick to avoid memcpy, allocation, conversion, etc. We share the object under an Arc (atomic reference counter), which is .retained and .released by both JavaScript (finalizer) and Zig, so for as long as JS Garbage Collector has a reference, it sticks around. Once JS no longer references, the finalizer invokes Zig’s .release(), and the object gets .destroy() if Zig doesn’t hold a strong reference to it as well. We also have a directory with weak references to reuse and pass the same artifact to multiple JS realms. But that’s all implementation details.

2 Likes

So either you are smuggling utf-8 through a latin-1 API, or JS incorrectly named its API with the wrong string encoding.

Man am I glad I don’t touch JS :stuck_out_tongue:

4 Likes

I get it now, thanks. There’s no API to create a non-copied UTF-8 string, because, being externally managed, it would be effectively a duplicate of the node_api_create_external_string_latin1.

So the underlying encoding doesn’t really matter here, the function is just a way of saying “take this opaque byte buffer and treat it as a string object”, if I’m understanding correctly.

How does it get passed to WASM? Usually to pass a JS string object to WASM, you’d need to go through TextEncoder to get a Uint8Array out of it. I can’t see how you’d avoid a step where the JS engine has to interpret that String as Latin-1, and convert to UTF-8.

Unless of course node named its API incorrectly as @vulpesx suggests.

1 Like

Well, Buffer.from(str, 'latin1') on Node.js. But yeah, in those cases I tend to resort to napi_create_external_buffer

Man, tell you what. TypeScript is far much better a language than C++. There, I said that.

2 Likes

Or []u16 or []32 (or maybe something else I don’t know about) depending on what the smallest unit of your “string” is.

Well, that’s kind of a low bar to clear. And TypeScript came quite a bit later and was done by somebody who already had experience from creating other (well-used) languages, even if the JavaScript base makes things more difficult.

So it would be weird if TypeScript would be worse.

1 Like