I’m learning Zig by working through Advent of Code 2022 and I have a need to parse strings. After some googling, I found std.mem.splitSequence
as the standard library solution. However, I noticed that splitSequence
had some unexpected semantics for handling a string that ended with the delimiter. Instead of just ending the slice, splitSequence
would add one more blank element to the slice (which caused a bug in my instance). After doing some more digging the std.mem
file, I also found std.mem.tokenizeSequence
and it had the desired delimiter-ending string semantics I was after.
I feel like I’m missing some subtle details that aren’t stated in the documentation. What is the intended use case for tokenizeSequence
vs splitSequence
? Are there any other important details I should know?
1 Like
From the doc comments:
/// `splitSequence(u8, "abc||def||||ghi", "||")` will return slices
/// for "abc", "def", "", "ghi", null, in that order.
/// `tokenizeSequence(u8, "<>abc><def<><>ghi", "<>")` will return slices
/// for "abc><def", "ghi", null, in that order.
That is, split
will return empty strings between consecutive delimiters, while tokenize
will skip over consecutive delimiters.
8 Likes
One big difference between the tokenize
and split
functions is that split
will produce items that are empty when two consecutive delimiters occur, whereas tokenize
will skip over these occurrences. So if the delimiter is ,
and the text is "a,b,,c"
, split
will produce "a"
, "b"
, ""
, "c"
whereas tokenize
will produce "a"
, "b"
, "c"
.
7 Likes
It seems that slpit
is more suitable for parsing something like NMEA
-strings, where some fields may be absent, and tokenize
is more suitable for parsing programming languages, for ex.
Am I right in this?
I guess it all depends on the data you’re working with. Sometimes you have user-provided data that contains lots of empty fields and you want to ignore them all. Other times you can’t afford to ignore the empty fields because you have to maintain a count of how many fields per record or something like that. So it all depends.
1 Like
Something else I noticed about split… it will return also return an empty string if the delimiter is at the end of the data, in addition to the one between delimiters. so
const string = "okay-okay-";
var it = std.mem.splitScalar(u8, string, '-');
while (it.next()) |chunk| {
std.debug.print("{s}:", .{ chunk });
}
will produce the output “okay:okay::” with an extra ‘:’ showing that a blank string was printed, while if you change it to tokenizeScalar it will give you “okay:okay:”, without the extra ‘:’
I noticed this when I was trying to split over ‘\n’ to iterate line-by-line.
CSV’s are a perfect example of this.
2 Likes