Please help me understand a fixed array of wide strings in Zig

sohro_desu · January 19, 2025, 12:11pm

Hi there! I’m grateful that I can ask questions here.

A little about my background

I’m new to Zig, have 6+ months of experience with C, and I started rewriting my latinic-cyrillic transliterator utility in Zig. I’m currently going through resources like ziglings and zig.guide to learn the language. I’ve been advised to directly look at and start reading Zig’s stdlib, and I’m planning to do so when I get a little more comfortable with all the basic syntax.

I’m really new to Zig (5-6 days of knowledge), so pardon me if I’m asking stupid questions :')

The question

I dealt with wchar arrays in C when writing the project as I needed to store cyrillic wide characters such as “Январь” (January). Here, in Zig, looks like it’s much more easier to declare a wide string just with u16 or let the compiler infer it.

But one thing I didn’t understand about this snippet of code:

// these are from external latinizer.zig
pub const cyrillic_abc = [_]u16{ 'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ё', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', 'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ч', 'Ш', 'Щ', 'Ъ', 'Ы', 'Ь', 'Э', 'Ю', 'Я', 'а', 'б', 'в', 'г', 'д', 'е', 'ё', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я' };
pub const cyrillic_dates = [_][]const u8{ "январ", "феврал", "март", "апрел", "май", "июн", "июл", "август", "сентябр", "октябр", "ноябр", "декабр", "йил" }; // not Russian, but cyrillic Uzbek

// main.zig
const std = @import("std");
const transliterator = @import("transliterator.zig");

pub fn main() !void {
    std.debug.print("{u}\n", .{transliterator.latinizer.cyrillic_abc[0]});
    std.debug.print("{s}\n", .{transliterator.latinizer.cyrillic_dates[0]});
}

Output:

А
январ
--task finished--

[Process exited 0]

Can someone explain how does a [ ]u8 of [ ]u16 can contain array of wide strings? Or why is it not [_][]const u16, but [_][]const u8 as compiler says? If I’m relying heavily on syntactic sugar, can you please show me how to declare it in maximum verbosity?

Everything works, and I could continue writing the rest of the project, but I really want to understand what’s going on.

squeek502 · January 19, 2025, 12:49pm

“wide characters” are just one way to encode Unicode strings, and in particular wchar/u16 typically means using the UTF-16 encoding.

If you don’t have to use UTF-16, then I would recommend avoiding it and going with UTF-8 instead.

UTF-8 is what Zig source files are encoded as, so "январ" is encoded as UTF-8 (see this section of the language reference for more details), which means that the “code units” are bytes (u8), so the string can be stored as an array of bytes (technically, a Zig string literal is a constant pointer to an array of u8).

If you do need UTF-16, then there’s a handy function for turning a UTF-8 string literal into UTF-16 at comptime: utf8ToUtf16LeStringLiteral, which could be used like so if you wanted to keep cyrillic_abc as UTF-16 encoded:

pub const cyrillic_abc = std.unicode.utf8ToUtf16LeStringLiteral("АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя");

// Note that the type of cyrillic_abc will be *const [66:0]u16
// which can coerce to [:0]const u16 or []const u16

(there are also plenty of other UTF-8 ↔ UTF-16 conversion functions in std.unicode)

I actually wrote some transliteration stuff (including Cyrillic) a while back. I’ll dig it up and post it when I get a chance if that’d be helpful as a reference.

sohro_desu · January 19, 2025, 1:03pm

Oh, thank you so much for the quick reply!

But… I’ve already tried to create a u8 array of characters ;')

Here, look:

pub fn main() !void {
    const utf8_chars = [_]u8{ 'о', 'ш', 'и', 'б', 'к', 'а' }; // all cyrillic 
    std.debug.print("{u}\n", .{utf8_chars[0]});
}

But I get this:

run
└─ run zigxatolik
   └─ install
      └─ install zigxatolik
         └─ zig build-exe zigxatolik Debug native 1 errors
src/main.zig:5:31: error: type 'u8' cannot represent integer value '1086'
    const utf8_chars = [_]u8{ 'о', 'ш', 'и', 'б', 'к', 'а' };
                              ^~~~

If I change the u8 to u11 or u16, it starts working again. I have no clue what’s happening here…

And yes, I don’t have to use UTF-16 specifically. The less, the better.

vulpesx · January 19, 2025, 1:12pm

because a utf-8 character can be 1 to 4 bytes depending on the character, you cant encode a character that requires multiple bytes into a single byte

dee0xeed · January 19, 2025, 1:16pm

It’s kinda by chance, because all cyrillic letters are 2 bytes wide in UTF-8.

sohro_desu · January 19, 2025, 1:16pm

So, does that mean the [_] part of [_][]const u8 implicitly assigning u16 to every element of the array, for them to be able to store something 1086 or more? But the array itself can be u8 because of the number of elements it’s storing, right?

vulpesx · January 19, 2025, 1:18pm

no, [_] infers the array lenght, each item is still a slice of u8, which is utf-8 encoded, wide chars are split into multiple elements of the utf-8 slice

edit: to be more clear you have an array of slices, the slices are utf-8

sohro_desu · January 19, 2025, 1:19pm

Oof, I should have known this better from my C learning journey and the project I’ve wrote… But it’s been such a while since then, I forgot many things

squeek502 · January 19, 2025, 1:19pm

That is an array of bytes, not characters. Note that when working with Unicode, “character” is not a very useful concept. Instead, you might want to familiarize yourself with the terms code point and grapheme

These articles are a bit old, but they do a good job of introducing Unicode concepts: Unicode Basics in Zig - Zig NEWS

ш is the Unicode code point U+0448, so its “raw” representation is the integer 0x448. This integer can be encoded in a number of ways:

Encoded as UTF-32, it is one u32 code unit with the value 0x00000448
Encoded as UTF-16, it is one u16 code unit with the value 0x0448
Encoded as UTF-8, it is two u8 code units with the values 0xD1 0x88

In Zig, code point literals like 'ш' resolve to their code point value, so 'ш' is the integer 0x448, which is too large to fit in a u8. If you used a string literal like "ш", the type would be *const [2:0]u8, since it takes two bytes to encode ш as UTF-8.

Calder-Ty · January 19, 2025, 1:22pm

No the [_] tells the compilter to figure out the size of the array. The confusion here is with the treatment of “characters”. In zig the 'x' syntax is technically a unicode literal, which has the type comptime_int. That means for ASCII characters, they can be treated as u8. But for non ascii characters, they have to be treated as integers big enough to fit the unicode values they represent.

The reason why you can have:

pub const cyrillic_abc = [_]const u8 "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя";

be an array of u8’s is because zig source files must be utf-8. So the editor can interpret and replace the characters with their utf-8 values. But in the code, it is just a series of bytes.

vulpesx · January 19, 2025, 1:23pm

an array of u8’s does not have to be utf-8, string literals created with “” are utf-8. but you can insert arbitrary bytes with escapes

sohro_desu · January 19, 2025, 1:25pm

Thank you all, guys! I can’t believe I could get this much support

So, let me carefully read all these articles and reread the replies of you guys. I think I kinda got it.

I’ll post my understanding after I go through all that.

Calder-Ty · January 19, 2025, 1:31pm

Correct. I’m saying that the source file must be utf-8, which can cause confusion between what we see when we type strings, and what the compiler treats it as. For example, we see one unicode codepoint, but the compiler treats it as an array of 1,2,3 or 4 bytes.

dee0xeed · January 19, 2025, 1:37pm

You can do simply

const std = @import("std");
const log = std.debug.print;

pub fn main() void {
    const s = "привет"; // similar to C's char *s = "привет";
    log("{s}\n", .{s});
}

squeek502 · January 19, 2025, 1:43pm

Here it is; it was written in 2021 so it would need to be updated to work with any recent Zig version.

Other major caveats:

Looking back at it now, this code is not that good and I’ll probably end up rewriting it if I get around to using it (the plan was to use it as part of a music library management program)
ComptimeStringMap is what is now named StaticStringMap, but I’m abusing it here and some other lookup table method would probably be better (StaticStringMap is most optimal when the keys have a good amount of variance in their lengths)
ziglyph has been superseded by zg
I’m not dealing with normalization which is probably a mistake

The basic idea is fine, though–take in UTF-8 and output UTF-8, iterating over graphemes (and potentially doing word splitting, but I don’t remember now if that was absolutely necessary or not).

sohro_desu · January 19, 2025, 2:00pm

A draft of the post of “My understanding in this topic”

What is Unicode?

In simple terms and to my understanding, it’s a character that humans can recognize that is encoded into a number that machines can understand. Unicode has ranges. The first range contains ASCII characters, which can be encoded (to be able to work with them in machines) with just 1 byte. 1 byte = 8 bits. So, it can be encoded using u8 in Zig. The rest of the ranges have Cyrillic characters somewhere, but they can’t fit into 1 byte when encoding, hence we use more than 1 byte.

Unfinished

I’m sorry for everyone, but I just got a work assignment with a strict deadline. I will finish this topic and give a full and clear answer to myself and to everyone browsing the Internet as soon as I finish my work.

Again, thank you so much, everyone. I’m super excited and deeply grateful that we have this community!

dee0xeed · January 19, 2025, 2:15pm

Though, that does not work for “array of strings”, you have to write [_][]const u8 {, as in your OP:

const std = @import("std");
const log = std.debug.print;

pub fn main() void {
    const days_of_week = [_][]const u8 {
        "пнд", "втр", "срд", "чтв", "птн", "сбт", "вск"
    };
    for (days_of_week, 0..) |dow, k| {
        log("day[{}] = {s}\n", .{k + 1, dow});
    }
}

sohro_desu · January 21, 2025, 11:58pm

To answer my question, I first need to understand how the characters are represented in machines.

How are the characters represented in machines?

At the end of the day, what computers really understand are 0s and 1s, to compute them together. No pun intended. So we gotta find a way to represent characters in 0s and 1s. That’s how we get the character encoding, specifically UTF-8 scheme in my case.

And UTF-8 happens to be compatible with ASCII (Another character encoding scheme that existed long before UTF-8 was invented). What we mostly type, at least in English using latin characters, can be encoded using the good old ASCII, and it comprises of 128 characters. ASCII characters can be encoded using only 1 byte = 8 bits.

For example, in ASCII: the small ‘a’ encoded to 97 in decimal, and it’s 0110 0001 in binary that machines understand.

And this 97, or 0110 0001, or 0x61 in hexadecimal, is a code point that indicates what character it represents in a character encoding scheme. It’s like key/value in a Python dictionary {‘a’: 97}. I believe this is a simplified explanation. In actuality, there’s more to it, which I won’t go into here.

But modern systems use UTF-8 to encode characters(?), or rather, unicodes. Unicodes are basically the same thing, but not limited to only characters that we know from ASCII. Because only with ASCII, we can’t type “日本語ってマジかんたんよぉ”, or “но не могу же я сказать то же самое о русском”, or emojis our beloved! “” But with Unicode, we can.

If ASCII has only 128 characters and symbols, Unicode has whole of 1,112,064.

If you remember that one needs only 1 byte to encode any ASCII character, Unicode needs 1 or more bytes depending on what unicode we’re typing. And a Unicode can be encoded using the UTF-8 character encoding scheme, among many others.

For example, in UTF-8: to encode cyrillic ‘ё’, we now need two bytes, and here they are: 0xd1 and 0x91 in hex, or U+0451 and U+0451 in UTF-8 format. So we can’t have ё without those two bytes, ёмаё.

Interestingly, we need 3 bytes to encode ‘漢’: 0xe6 U+6F22, 0xbc U+6F22, 0xa2 U+6F22.

Now, we know what the characters are, and how they are represented in machines. But what now? Now is the time to rephrase my question.

Why can’t I have an array of unicode character literals that are [ ]u8?

For example, why can I have this:

const pog = [_][]const u8{ "Ы?", "へぅえ？", "汉字" };

but not this?

const unpog = [_]u8{ 'ы', 'ぇ', '啊' };
// the compilers asks you to change the u8 to u16

The answer is…

In short: because I don’t know the first thing about Zig arrays yet.

Or in detail:
As @Calder-Ty said, a single unicode character literal gets treated as comptime_int or a big enough integer if we’re to assign a type to it. And in arrays, “strings” are processed by the compiler to be converted into a UTF-8 array of u8 elements(?)…

…which actually leaves me with more questions ;')

Questions like:

Is the ‘a’ always an integer?
Are character literals in an array also integers?
Does only “strings consisting of character literals in double quotes” in an array get processed by the compiler to become the array of u8 elements? Assuming that I already know that accessing one element of such an array is a mistake.

I’m sorry for the many questions. As I’m writing this up, I look at the time and see 4:53 AM. I started researching the topic right after I submitted my assignment at about 1 AM today… ;') Okay, I’ll try to dig more tomorrow with a clearer mind.

But seriously, thanks to all your help and guidance, I was able to research more accurately and the results were on point. Thank you guys!

Calder-Ty · January 22, 2025, 4:13am

This is a great write up. I think with a few gaps filled it would be a good to be put int the Docs section.

Quickly some questions answered:

Is the ‘a’ always an integer?

Yes.

Are character literals in an array also integers?

Yes.

Does only “strings consisting of character literals in double quotes” in an array get processed by the compiler to become the array of u8 elements? Assuming that I already know that accessing one element of such an array is a mistake.

I Think this question helps get at the crux of the problem. When zig says it has no string type, it means it. From a type standpoint there is (almost*) no difference between these two values:

const string = "Hello World";
const array = [11]const u8{72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100 }

In zig there aren’t strings, just arrays of bytes that can be interpreted as utf-8.

This isn’t quite correct. Zig makes no guarantees about the encoding of data in string literals. You can put in invalid utf-8 bytes, using escapes. The only thing that must be utf-8 is the source file itself.

So getting back to the question at hand.

The reason why This:

const unpog = [_]u8{ 'ы', 'ぇ', '啊' };
// the compilers asks you to change the u8 to u16

Doesn’t work is because

unpog is not an array of u8’s, but an array of some integers bigger than u8.

const pog = [_][]const u8{ "Ы?", "へぅえ？", "汉字" };

This works because pog is an Array of arrays. Each of those string literals can be treated like an array of bytes.

So in the first case we are dealing with an array of integers larger than a byte, where in the second part we are dealing with an array of array of bytes

* Technically String Literals are guaranteed to have a null byte appended to them, for ease of use with C api’s. But this difference has no real bearing on the question here.

vulpesx · January 22, 2025, 4:19am

pretty much

yes, assuming the array element type can hold them

kind of, remember in utf-8 a character can take upto 4 bytes, depending on the character, so it isnt a 1 to 1 mapping of characters to bytes

sohro_desu:

For example, why can I have this:

const pog = [_][]const u8{ "Ы?", "へぅえ？", "汉字" };

but not this?

const unpog = [_]u8{ 'ы', 'ぇ', '啊' };
// the compilers asks you to change the u8 to u16

because, string literals are utf-8 (not necissiarily because of escapes) so they can be encoded into []const u8
but character literals can be in a variety of encoding formats depending on the character and the type you ask it to be, u8 is ascii(remember ascii is also valid utf-8) u16 is utf-16 and u21 for raw unicode, the compiler is telling you it cant represent those characters in a u8 but it can in a u16