Please help me understand a fixed array of wide strings in Zig

pachde · January 22, 2025, 5:48am

Technically, ASCII is 7-bit (not u8). Any byte with the high bit set isn’t ASCII. In utf8 encoding, a byte with the high bit set is a byte in a multibyte sequence.

vulpesx · January 22, 2025, 8:30am

I was keeping it simple, but it’s arbitrary data so it only means utf-8 when treating it as such :3

sohro_desu · January 22, 2025, 12:41pm

Okay, one last time, just to make sure I’m picking up correctly:

Anything inside the ‘’ (single quotes) is treated as a big enough integer that the character literal in quotes can fit in, without doing anything funny with special escapes. The size of the integer is thus determined by the compiler at compile time. (This is the idea of comptime, right?)
Anything inside the “” (double quotes) is treated as an array of arrays of bytes. So the multiple elements of the array inside an array can make up a character when dealing with UTF-8 encoded unicodes. But it’s just an array of arrays of bytes where you can put integers, booleans, pointers to functions, and so on.

I know, getting answers to these questions isn’t enough. I promise I keep learning more Zig from now on, so these trivial and rookie questions won’t arise ;')

sohro_desu · January 22, 2025, 12:46pm

Thank you, I tried ;') And of course feel free to use everything I wrote here! If you want me to fill in the gaps, just let me know where to start I’ll be more than happy to help.

dee0xeed · January 22, 2025, 1:46pm

const std = @import("std");
pub fn main() void {
    const str = "АБВГД";
    std.debug.print("{}\n", .{@TypeOf(str)});
}

$ zig run str.zig 
*const [10:0]u8

It’s a zero terminated array of u8 with comptime known size.
When being treated as UTF-8 string it is a sequence of code-points or sort of that.

Calder-Ty · January 22, 2025, 4:11pm

I have to apologize. I think I’ve made it more confusing.

Definitions

String Literal: literals that are enclosed with double quotes e.g `“へぅえ？”
Unicode Literal: literals that are enclosed with single quotes e.g ‘啊’. This can be confused with “characters” in other languages (Like C), but it’s best not to think of them that way, as characters are not a good abstraction over unicode.

What is a string literal

As @dee0xeed points out, a string literal caries the type *const [N:0]u8, where N is the number of bytes in the string. In shorthand we can consider them to be arrays or slices of u8’s.

What is a unicode literal

As you have found out, a unicode literal is really just a comptime_int, meaning it’s exact representation is determined at comptime based of what size will be needed. Based of the standard, a unicode literal can always fit in a u21,

Original Question

pub const cyrillic_abc = [_]u16{ 'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ё', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', 'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ч', 'Ш', 'Щ', 'Ъ', 'Ы', 'Ь', 'Э', 'Ю', 'Я', 'а', 'б', 'в', 'г', 'д', 'е', 'ё', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я' }

These values are all unicode literals, but since they only need 2 bytes to be represented, you can store them in an array of u16 (hence the type: [_]u16).

"йил"

This is a String Literal. Even though it is made up of “characters” that require more than 1 byte to represent, they can be represented as an array of bytes
indeed another way to write this is:

[_]u8{ 208, 185, 208, 184, 208, 187 }

you can verify this by running the following:

const std = @import("std");
pub fn main() !void {
    const text = [_]u8{208, 185, 208, 184, 208, 187 };
    std.debug.print("{s}\n", .{text});
    std.debug.print("{d}\n", .{text});
}

> zig build-exe test.zig && ./test
йил
{ 208, 185, 208, 184, 208, 187 }

So finally:

[_][]const u8{ "январ", "феврал", "март", "апрел", "май", "июн", "июл", "август", "сентябр", "октябр", "ноябр", "декабр", "йил" }

This is an array or slices (notice the two pairs of brakets [_][]const u8). Simply you have an array of string literals. (technically they are slices that point to the string literals).

sohro_desu · January 22, 2025, 4:19pm

This information confirms and adds to my understanding. I can’t thank each and every one of you enough for the support you have all shown me this week…

And no need to apologize for anything, because if anything, all of this has made me realize more things than the looks of them. I’ve immersed myself in this topic as much as I could, and I’m grateful for the knowledge I could get this way, with the help of everyone.

Tosti · January 22, 2025, 5:51pm

I’d like to note that there are implicit conversions. *const [N:0]u8 can be implicitly converted to [:0]const u8, which in turn can be implicitly converted to []const u8. This is why this statement compiles despite that @TypeOf("йил") == *const [6:0]u8

const s: []const u8 = "йил";

This is the same reason why this expression compiles as well

[_][]const u8{ "январ", "феврал", "март", "апрел", "май", "июн", "июл", "август", "сентябр", "октябр", "ноябр", "декабр", "йил" }

Even though each of the string literals is of type *const [N:0]u8 (note that each string literal has its own value of N), all of them can be implicitly converted to []const u8.

See type coersion rules for slices/arrays/pointers.

The same thing happens to integers. Unicode literals like 'й' have type comtime_int. According to the reference for comptime-known numbers, when a number is comptime-known (and comptime_ints are always comptime-known), it can be converted to uN or iN if the value is representable in this type. As @Calder-Ty stated, all unicode literals can be represented in u21, but some of them require less bits. That’s why, for example, ASCII 'a' can be stored in a u8.

squeek502 · January 23, 2025, 12:27am

Unicode is just the name of the text encoding standard.

A more correct term for what you’re talking about would be ‘Unicode scalar value’, which is essentially any code point that is not a surrogate code point. Depending on what you mean by ‘character’, though, another term that is probably even more correct is grapheme, which would also describe a ‘character’ that is composed of multiple code points/Unicode scalar values. Some examples:

Ç can be encoded two different ways: Ç (U+00C7), or C (U+0043) + ◌̧ (U+0327). This same thing is true for basically any character with diacritics. See Unicode Normalization Forms for more info
Some ‘characters’ must be made up of multiple code points, e.g. the pirate flag emoji () which is U+1F3F4 U+200D U+2620 ( Black Flag, Zero Width Joiner, and Skull and Crossbones) (plus an optional U+FE0F which is an emoji presentation selector; the version with the emoji presentation selector is considered ‘fully qualified’, see TR51 and emoji-test.txt)

Just to clarify, ‘needing 2 bytes to be represented’ only refers to the code point integer value and is not related to how many UTF-8 bytes are needed to encode that code point. There are code points that can fit in a u16 but are encoded by 3 UTF-8 bytes, e.g. any code point between U+0800 and U+FFFF, so for example:

const c: u16 = '€';
std.debug.assert(c == 0x20AC);
const as_utf8 = "€";
std.debug.assert(as_utf8.len == 3);

jibal · January 23, 2025, 2:52pm

No, it’s an array of bytes. I think the confusion is because there was earlier discussion of [_][]const u8 { ... } – that is an array of arrays of bytes (thus two sets of [ ]) … i.e., an array of strings. So in

const pog = [_][]const u8{ "Ы?", "へぅえ？", "汉字" };

you have a sequence of initializers that are strings. Each string, e.g., "Ы?", is a []const u8, and [_] says there are several such things, with the length of the array inferred (that’s what _ means) from the number of initializers.

The source text in your file contains bytes that are represented on your screen as, e.g., "へぅえ？" because your windowing system understands UTF-8 and how to select elements from the font to represent the text properly. How many bytes are needed for that string is already determined by how many bytes are in the source file. My point here is that the UTF-8 encoding has already been done in the source file (or in any text stream that you read the characters from at runtime) and Zig just takes those bytes and puts them into memory as is–it knows nothing about unicode or UTF-8 itself. This can be confusing because our mental model determined from what we see on the screen is different from the UTF-8 model by which what we see is represented in memory, on disk, etc., so it takes some mental effort to keep these things straight.

A bigger problem arises when you aren’t just moving arrays of bytes around but need to break UTF-8 strings up into characters in order to do operations like transliteration. That’s when you need to delve into std.unicode. You might be able to use the UTF-16 functions … I personally am not keen on UTF-16, which is a legacy encoding rooted in the days when people thought that 16 bits would be enough to encode every character – most codepoints take one UTF-16 code unit but the ones beyond the basic plane require 2. Russian probably only occurs in the basic plane but I haven’t worked with it. An alternative is std.unicode.Utf8iterator, which parses UTF-8 strings and lets you extract each codepoint into a u21 … which isn’t necessarily the same as a character because characters are grapheme clusters that can consist of multiple codepoints, but is probably good enough for your purposes because I don’t think Russian has characters that consist of multiple codepoints. If I’m wrong about that then you have another problem because the core Zig library doesn’t have support for grapheme clusters, character classes, normalization etc. that is needed for fully accurate processing of unicode … for that you would need to go to something like ICU (https://icu.unicode.org/). I imagine there are some Zig bindings for it around somewhere. But if Russian codepoints never need more than 16 bits then you can avoid such complications and just use UTF-16 … until some day you extend your code to deal with Asian scripts, emojis, etc.

sohro_desu · January 23, 2025, 4:27pm

After all is said and done, I’ve already started to dive deep into rewriting my transliterator written in C. And as you pointed out, I can now see that I’m going through something in Zig, whereas I took everything C offered for granted, and effortlessly dealt with multi-byte characters using the wchar header.

I’m not complaining, I’m enjoying it. I can go deep enough to look at the assembler output of my C or Zig code to see exactly what’s happening. My knowledge is super limited, but there are tons of good assembly guides out there. And I’m willing to learn whatever it takes to write good software.

Until recently, I’ve only coded in scripting languages like Python and pretty much ignored anything low-level. After discovering the beauty and sheer power of C, I’ve basically had a paradigm shift ;')

Right now I’m searching all the topics to find more info about debugging Zig slices to select enough bytes to print a character, or using the debugger’s memory examine tool and stuff. I may be doing it all wrong that that is not how it supposed to be done when it comes debugging a Zig code, but I trust the process and keep going anyways :')