What is the difference between tokenizeSequence and splitSequence?

I’m learning Zig by working through Advent of Code 2022 and I have a need to parse strings. After some googling, I found std.mem.splitSequence as the standard library solution. However, I noticed that splitSequence had some unexpected semantics for handling a string that ended with the delimiter. Instead of just ending the slice, splitSequence would add one more blank element to the slice (which caused a bug in my instance). After doing some more digging the std.mem file, I also found std.mem.tokenizeSequence and it had the desired delimiter-ending string semantics I was after.

I feel like I’m missing some subtle details that aren’t stated in the documentation. What is the intended use case for tokenizeSequence vs splitSequence? Are there any other important details I should know?

1 Like

From the doc comments:

/// `splitSequence(u8, "abc||def||||ghi", "||")` will return slices
/// for "abc", "def", "", "ghi", null, in that order.
/// `tokenizeSequence(u8, "<>abc><def<><>ghi", "<>")` will return slices
/// for "abc><def", "ghi", null, in that order.

That is, split will return empty strings between consecutive delimiters, while tokenize will skip over consecutive delimiters.

8 Likes

One big difference between the tokenize and split functions is that split will produce items that are empty when two consecutive delimiters occur, whereas tokenize will skip over these occurrences. So if the delimiter is , and the text is "a,b,,c" , split will produce "a", "b", "", "c" whereas tokenize will produce "a", "b", "c".

7 Likes

It seems that slpit is more suitable for parsing something like NMEA-strings, where some fields may be absent, and tokenize is more suitable for parsing programming languages, for ex.
Am I right in this?

I guess it all depends on the data you’re working with. Sometimes you have user-provided data that contains lots of empty fields and you want to ignore them all. Other times you can’t afford to ignore the empty fields because you have to maintain a count of how many fields per record or something like that. So it all depends.

1 Like

Something else I noticed about split… it will return also return an empty string if the delimiter is at the end of the data, in addition to the one between delimiters. so

const string = "okay-okay-";

var it = std.mem.splitScalar(u8, string, '-');

while (it.next()) |chunk| {
   std.debug.print("{s}:", .{ chunk });
}

will produce the output “okay:okay::” with an extra ‘:’ showing that a blank string was printed, while if you change it to tokenizeScalar it will give you “okay:okay:”, without the extra ‘:’

I noticed this when I was trying to split over ‘\n’ to iterate line-by-line.

CSV’s are a perfect example of this.

2 Likes