Checking equality of null-terminated strings

n0s4 · October 23, 2024, 7:50pm

I’m working with C from Zig which involves null-terminated strings, and in more than one place I have to check the equality of the contents of these strings.

Of course, I can use std.mem.span to get a slice for use in std.mem.eql, but that involves iterating each string twice (which isn’t expensive but it bugs me). I also could iterate each string manually, stopping at a null char, but feels verbose for such a simple operation.

Could this qualify as something useful enough to be in std, seeing as std already provides many utilities for null-terminated strings? I’m thinking of something along the lines of ‘std.mem.eqlZ’.

dimdin · October 23, 2024, 9:10pm

It is easy to implement but it is too much C oriented.

const std = @import("std");

pub fn eqlZ(s1: [*:0]const u8, s2: [*:0]const u8) bool {
    var p1 = s1;
    var p2 = s2;
    while (p1[0] != 0 and p1[0] == p2[0]) {
        p1 += 1;
        p2 += 1;
    }
    return p1[0] == p2[0];
}

test {
    try std.testing.expect(eqlZ("", ""));
    try std.testing.expect(eqlZ("a", "a"));
    try std.testing.expect(eqlZ("ab", "ab"));
    try std.testing.expect(eqlZ("abc", "abc"));
    try std.testing.expect(!eqlZ("a", ""));
    try std.testing.expect(!eqlZ("", "x"));
    try std.testing.expect(!eqlZ("a", "x"));
    try std.testing.expect(!eqlZ("a", "ax"));
    try std.testing.expect(!eqlZ("ab", "ax"));
    try std.testing.expect(!eqlZ("ab", "abx"));
    try std.testing.expect(!eqlZ("abc", "abx"));
    try std.testing.expect(!eqlZ("abc", "abcx"));
}

geon · October 24, 2024, 6:17am

I would add arguments for the size of the respective buffers; l1: usize and l2: usize.

Or if all strings originally come from the same buffer, the end pointer of that buffer.

Sze · October 24, 2024, 6:43am

The whole point was that the size is unknown before scanning the null-terminated string, so I think it is sensible to assume that the buffers and their size is also unknown.

What is your intent?
Seems like you want to do additional bounds checks, to make sure the data is correctly zero terminated?

But that seems a bit redundant, either it is or it isn’t, personally I think testing / asserts would be enough to make sure it is correctly terminated.

geon · October 24, 2024, 7:27am

I think it is sensible to assume that the buffers and their size is also unknown.

No. You definitely should know how large the buffer is, even if you don’t know the length of the actual string.

You might not need a separate argument though. even a null terminated string has a compile time len property. Documentation - The Zig Programming Language

What is your intent?

To avoid reading outside of the buffer if the string is not terminated. The type says it should be, but you have no real guarantee.

What happens if you access outside the buffer in zig? I think you get a runtime check and a panic in debug mode, but nothing in release.

I think testing / asserts would be enough

You can’t test data outside of your control, like files or network. You could fuzz-test, but that would reveal the exact issue I’m pointing at.

Sze · October 24, 2024, 8:08am

When you work with c you often just get pointers to null terminated strings and you often don’t know anything more about the buffers that contain these strings, they may be allocated by some c code or they may point to some static read only data on the c side.

You avoid that by carefully thinking through invariants, yes it is possible to make mistakes, but this isn’t rust where you try to formally prove things in the type system. Avoiding bugs is the programmers responsibility, you are right that all the string literals are zero terminated and have a length property, but that already means that for those it is enough to either check for the zero or for the length.
Checking for both is unfounded paranoia, it doesn’t add any additional security.

You can test the code that you write and the other stuff by definition can’t be tested.

We have a problem statement of “hey lets compare two zero terminated strings” here and you are basically saying “But what if one of the strings isn’t zero terminated?”, then that is a different problem, but this topic isn’t about some other problem where somebody didn’t zero-terminate their string.

n0s4 · October 24, 2024, 10:15am

I don’t understand what “too C oriented” means, too C oriented to be in std? If so, what makes the std functions like mem.orderZ, mem.joinZ, Allocator.dupeZ, fs.Dir.openFileZ, etc. not too C oriented?

jibal · October 24, 2024, 12:38pm

There is no buffer, there’s just a pointer. In C it’s char* and in Zig it’s [*:0]const u8. There are no known lengths to pass in … the length is determined by where the NUL (“sentinel” in Zig) is relative to the pointer. It’s simply not true that “even a null terminated string has a compile time len property”. Your link to the Zig documentation references const arrays, where of course the length is known at compile time. A more relevant reference is Documentation - The Zig Programming Language (ziglang.org), which says " The syntax [*:x]T describes a pointer that has a length determined by a sentinel value" – so it must be calculated by scanning memory. This is very well known to all C programmers. In Zig one would only use [*:0]const u8 to interface with C … otherwise one uses slices, which contain both a pointer and a length and there’s no need for a sentinel or the O(n) time to scan for it.

dimdin · October 24, 2024, 1:09pm

The implementation is too much C oriented for my taste; it is actually the strcmp implementation with [0] instead of * for dereference.
This have nothing to do with its usefulness and if zig std lib must have it or not.

jibal · October 24, 2024, 1:14pm

openFileZ exists because the POSIX API takes NUL-terminated paths so that function is necessary, and in fact the regular openFile function calls openFileZ on POSIX systems.

dupeZ is necessary in order to allocate the extra byte for the NUL when converting a []T into a [:0]T.

joinZ does something similar but from several []const u8s.

Note that those two generate NUL-terminated strings which can then be passed to C. They don’t operate on C strings within Zig, which is not the idiomatic way to do things.

jibal · October 24, 2024, 1:26pm

If you use the same strings repeatedly then you can convert them into slices once, which reduces the number of scans. (In fact, if you’re comparing strings of different lengths then there will be fewer scans because eql will return false if the O(1) length comparison fails.)

The Zig library pretty much expects you to do so, so it doesn’t provide a bunch of functions that operate on C strings. If it’s really a bottleneck for your application then you can code up a comparison function as someone provided, but I suspect that it isn’t … as Donald Knuth said (somewhat hyperbolically), “premature optimization is the root of all evil”. This really does apply here because if you’re comparing one C string to a bunch of other C strings, the comparison will fail immediately if the lengths aren’t equal and if they are it will fail at the first unequal byte, so you will usually do less work by converting to a slice up front.

n0s4 · October 24, 2024, 1:36pm

That makes, sense. I can live without it.

jibal · October 24, 2024, 1:36pm

Cool! Glad I could help.

castholm · October 24, 2024, 4:41pm

The canonical way to compare two null-terminated C strings in Zig is to use std.mem.orderZ. To test for equality, simply do std.mem.orderZ(u8, retrieved_value, value) == .eq.

Depending on the specific needs of your program it might make more sense to use std.mem.span for reasons already mentioned in the thread (e.g. performance and safety), but for things like porting C code to Zig it’s usually fine to just use std.mem.orderZ as a stand-in for strcmp.