Understanding "sentinels" (slices, arrays, pointers)

chris · December 26, 2023, 11:31am

Dunno if the way I’m learning Zig is the right way, but I peek concepts from the reference documentation and try to grasp them.

When talking about slices sentinels I don’t get it.
I tend to considere that the only use case is for dealing with 0 terminated C strings

But, above this use case when should I use sentinel terminated slices ?

And second question : what is a sentinel terminated pointer ?

When I read the doc

The syntax [*:x]T describes a pointer that has a length determined by a sentinel value. This provides protection against buffer overflow and overreads.

Does it means that the compiler will generate code that prevents the pointer to be overwritten above its original size ?

mperillo · December 26, 2023, 12:36pm

No. Having a sentinel value means that a program in a low level programming language (like C) can easily check when to terminate a loop.

See Null-terminated string - Wikipedia and Sentinel value - Wikipedia.

An alternative (in low level languages) is to pass both the pointer and the length to a function.

castholm · December 26, 2023, 12:37pm

The purpose of a sentinel (such as the 0 in a C string) is to mark the end of the sequence of elements. You obtain the length/end of the sequence by counting each element in a loop until you encounter the sentinel.

Slices carry both a length and a pointer with them at all times, so since you already have a length and know where the sequence ends, sentinel-terminated slices are more or less unnecessary in pure Zig code. But they are useful if you have a slice that will eventually be passed to a C API that expects a sentinel-terminated pointer (such as a 0-terminated string).

With a []const u8 slice there’s no guarantee that the string will have a terminating 0 byte, so you would need to make a copy of it and append a final 0 byte (usually with Allocator.dupeZ) if you wanted to pass it to a C API. But with [:0]const u8 you already know for sure that there’s a terminating 0 byte, so there’s no need for the copy.

The safety is enforced via compile errors. If you have a function with a parameter of type [*:0]const u8, passing a pointer of type [*]const u8 (note: no sentinel) to it is a compile error.

But, if you were to cast the [*]const u8 pointer to [*:0]const u8 using @ptrCast, the compiler won’t insert any code that checks that the pointer actually contains a 0 byte (you are more or less telling the compiler that “I know better”), so this could potentially be unsafe and result in a buffer overflow.

When using the slice syntax like buf[0..end :0], however, there is additional safety-checking code inserted which checks for the presence of a sentinel and protects against buffer overflows (in Debug and ReleaseSafe mode).

chris · December 26, 2023, 1:00pm

So seems obvious that I need to make better choices when electing a Zig topic to learn about This was a wrong shot

Make sense for C strings, and all other use cases are edge/rare cases related with C libraries calls that might expect a terminal (sentinel) value

mperillo · December 26, 2023, 2:48pm

For your first question, sentinel terminated slices are used in the std when parsing a Zig source file, as it simplifies the code:

github.com

ziglang/zig/blob/94c63f3/lib/std/zig/tokenizer.zig#L337


      
                          .eof => "EOF",
                          .builtin => "a builtin function",
                          .number_literal => "a number literal",
                          .doc_comment, .container_doc_comment => "a document comment",
                          else => unreachable,
                      };
                  }
              };
          };
          
          pub const Tokenizer = struct {
              buffer: [:0]const u8,
              index: usize,
              pending_invalid_token: ?Token,
          
              /// For debugging purposes
              pub fn dump(self: *Tokenizer, token: *const Token) void {
                  std.debug.print("{s} \"{s}\"\n", .{ @tagName(token.tag), self.buffer[token.loc.start..token.loc.end] });
              }
          
              pub fn init(buffer: [:0]const u8) Tokenizer {

github.com

ziglang/zig/blob/master/lib/std/zig/tokenizer.zig#L456


      
          var state: State = .start;
          var result = Token{
              .tag = .eof,
              .loc = .{
                  .start = self.index,
                  .end = undefined,
              },
          };
          var seen_escape_digits: usize = undefined;
          var remaining_code_units: usize = undefined;
          while (true) : (self.index += 1) {
              const c = self.buffer[self.index];
              switch (state) {
                  .start => switch (c) {
                      0 => {
                          if (self.index != self.buffer.len) {
                              result.tag = .invalid;
                              result.loc.start = self.index;
                              self.index += 1;
                              result.loc.end = self.index;
                              return result;

mscott9437 · December 26, 2023, 3:14pm

I think this a great topic for learning Zig, even at the early stages. It should be emphasized how C does things, versus how Zig does things, with some clear guidance on interop. Null-terminated pointers are critical when working with C and character arrays, so even though it’s been de-emphasized in Zig, I think there is still a lot of value there for understanding how Zig works in general.

efjimm · December 27, 2023, 1:27pm

Null terminated pointers use half the memory of a slice, so there are legitimate use cases for reducing memory usage.

slonik-az · December 28, 2023, 4:23pm

This argument while true theoretically is a very bad advice in practice. 50 years of C taught us the null terminated arrays are major cause of serious security bugs. Also, getting length of the null-terminated array is O(n) runtime versus O(1) for a slice. Last but not least, memory savings from not storing length field is rarely realized, for allocator’s alignment will pad null byte all way up to 4 or 8 bytes, the same size as the lenght field.

TL;DR: Use null terminated arryas (pointers, slices) only when you absolutely have to (interface to C). Use slices in all other cases.

efjimm · December 28, 2023, 9:14pm

I disagree that it always bad. O(n) length check is only a problem for very large arrays and if you aren’t modifying the array after you create it then it’s almost impossible to introduce security bugs. Additionally, Zig’s sentinel system and various stdlib functions for working with sentinel terminated slices make it far safer than in C.