The art of parsing

I really suck at writing parsing code. I always succeed but the codemess is mostly quite incredible.
So I would like to take a jump and write some understandable code.

Inspired by the chess FEN notation I invented my own scrabbleboard FEN to describe a position.

const fen = "10tafia/9vegan1/12z2/6xi3me2/5tun2be3/4qi3aLway1/5luge1o1wot/6safetied1/7m2C4/5y1potheens/4vigs2i3o/3p1r4el2l/2chord1curet1d/3i2en3a1bo/6no1ilk1o1 aaeijr dnrrsu 341 339 0";

It has these parts:

  • the board (15 rows divided by a '/', a sequence with numbers (empty squares) or letters)
  • rack 1 (the current rack of player 1)
  • rack 2 (the current rack of player 2)
  • score 1 (the total score of player 1)
  • score 2 (the total score of player 2)
  • zeromoves (the number of consecutive moves without a score)

Now Zig has some tokenizers. So the first part I can split.

   var tokens = std.mem.tokenizeScalar(u8, fen, ' ');

    const board_token = tokens.next() orelse return ParseError.NoBoard;
    const rack_1_token = tokens.next() orelse return ParseError.NoRack1;
    const rack_2_token = tokens.next() orelse return ParseError.NoRack2;
    const score_1_token = tokens.next() orelse return ParseError.NoScore1;
    const score_2_token = tokens.next() orelse return ParseError.NoScore2;
    const zero_moves_token = tokens.next() orelse return ParseError.NoZeroMoves;

Is that the way to go?

After that I run into problems…

  • Should I parse row by row (again splitting on the '/')? If so how to subsplit?
  • How do I get the numbers / letters sequentially from the 15 rows? (this is the really messy part)
  • The board and rack parts can contain unicode characters (for example Hungarian / Slovenian letters). How to handle these? Can I tokenize directly to unicode points?

I don’t expect a full parsing routine from the intelligent people here, just some hints how to write clean and understandable parsing code. I believe the std.mem must have usefull routines for it…

An extract of the digit mess I produced inside the board parsing.

                // digit
                if (u >= '0' and u <= '9')
                {
                    if (skip == 0 and u == '0') return ParseError.InvalidSkip;
                    skip = @truncate(u - '0');
                    if (next) |n|
                    {
                        if (n >= '0' and n <= '9')
                        {
                            if (skip > 1) return ParseError.InvalidSkip;
                            const plus: u8 = @truncate(n - '0');

                            skip = skip * 10 + plus;
                            i += 1;
                        }
                    }
                    if (skip + col > 15) return ParseError.InvalidSkip;
                    col += @intCast(skip);
                    skip = 0;
                }

2 Likes

If you use only ASCII characters as delimiters you don’t need to care about multi-byte UTF-8 characters in the text tokens - they’ll just be skipped over as not matching the delimiter (or digit).

You might find it cleaner not to convert digits to numbers inline in your tokenizer. Instead, parse the a slice that is the span of digits and pass that to a function for conversion. If that’s a perf bottleneck you can make it inline.

Of course, in real Scrabble, it comes in Language editions where the set of possible characters is far smaller than Unicode. You might be able to take advantage of that to simplify your representation.

4 Likes

I’d suggest making a Token tagged union, which identifies what kind of token you’re dealing with. That gives a nice place to put things like converting a numeric string into a number.

This is generally the output of a Tokenizer, which holds the input, an offset into that input, and provides, canonically, next() and peek(). More complex parsers let you ā€˜put back’ a token, but better to avoid that unless you can’t.

For what you’re working in you may not need peek either, and I doubt very much that you’d need a full AST implementation to hold the result. It looks like just tokenize and process into a data structure in a single pass.

Even so, ā€˜reifying’ your tokens will make the whole thing easier to reason about. IMHO, YMMV, etc etc.

1 Like

I’ve watched @Validark 's presentaions on parsing, and solved your example as a practice problem, won’t claim it’s easily understandable, but here it goes anyway:

In your case to split a alphanumeric sequence into letters and numbers you can:

  1. switch on the first byte
  2. create a bitstring of all the subsequent bytes of the same type
  3. count trailing ones to get the ending position of the first token, emit token, advance buffer cursor by token length and repeat

Here is the example tokenizer from the second presentation.

Quick example:

const std = @import("std");

pub fn main() void {
    var buf: [512]u8 = @splat(0);
    const V = @Vector(16, u8);
    const fen = "foobar123baz456/";
    @memcpy(buf[0..fen.len], fen);

    var cur: []const u8 = buf[0..];
    var vec: V = cur[0..16].*;
    // first byte letter so make letter bitstring
    var bitstring: u16 =
        (@as(u16, @bitCast(@as(V, @splat('a')) <= vec)) & @as(u16, @bitCast(vec <= @as(V, @splat('z')))));
    var len = @ctz(~bitstring);

    std.debug.print(
        \\{s} vector
        \\{b} bitstring in reverse
        \\len == {d}
        \\token{{ .type = .identifier .slice = {s} }}
        \\
        \\
    , .{
        cur[0..16],
        @bitReverse(bitstring),
        len,
        cur[0..len],
    });

    cur = cur[len..];
    vec = cur[0..16].*;
    // first byte number so make number bitstring
    bitstring =
        (@as(u16, @bitCast(@as(V, @splat('0')) <= vec)) & @as(u16, @bitCast(vec <= @as(V, @splat('9')))));
    len = @ctz(~bitstring);
    std.debug.print(
        \\{s}
        \\{b}
        \\len == {d}
        \\token{{ .type = .number .slice = {s} }}
        \\
    , .{
        cur[0..16],
        @bitReverse(bitstring),
        len,
        cur[0..len],
    });
}

Result:

foobar123baz456/ vector
1111110001110000 bitstring in reverse
len == 6
token{ .type = .identifier .slice = foobar }

123baz456/\0\0\0\0\0\0
1110001110000000
len == 3
token{ .type = .number .slice = 123 }

This is wasteful, you don’t need to recalculate the bitstring everytime, so later in the presentation there’s a much harder to understand example (presentations/hyper-optimizing-zig-parser/#/my-tokenizer-code), that reuses the bitstrings.

As a challenge i rewrote it to fit your problem:

pub fn tokenize(gpa: std.mem.Allocator, source: []align(64) const u8) ![]Token {
    var tokens: std.ArrayListUnmanaged(Token) = .empty;
    errdefer tokens.deinit(gpa);

    var prev: []const u8 = source[0..];
    var cur = prev[0..];
    var op_type: Tag = switch (cur[0]) {
        'a'...'z', 'A'...'Z' => .identifier,
        '0'...'9' => .number,
        ' ', '\t', '\n' => .whitespace,
        else => return error.UnexpectedStart,
    };
    var bitstrings: [4]u64 = undefined;
    var selected_bitmap: *u64 = &bitstrings[@as(u2, @truncate(@intFromEnum(op_type)))];
    var bitmap_index = @intFromPtr(source.ptr);
    bitmap_index -%= 64;

    while (true) {
        var aligned_ptr = @intFromPtr(cur.ptr) / 64 * 64;
        while (true) {
            while (bitmap_index != aligned_ptr) {
                bitmap_index +%= 64;
                const V = @Vector(64, u8);
                const vec: V = @as([*]align(64) const u8, @ptrFromInt(bitmap_index))[0..64].*;

                bitstrings[0] =
                    (@as(usize, @bitCast(@as(V, @splat('a')) <= vec)) & @as(usize, @bitCast(vec <= @as(V, @splat('z'))))) |
                    (@as(usize, @bitCast(@as(V, @splat('A')) <= vec)) & @as(usize, @bitCast(vec <= @as(V, @splat('Z')))));

                bitstrings[1] =
                    (@as(usize, @bitCast(@as(V, @splat('0')) <= vec)) & @as(usize, @bitCast(vec <= @as(V, @splat('9')))));

                bitstrings[2] =
                    @as(usize, @bitCast(vec == @as(V, @splat('\t')))) |
                    @as(usize, @bitCast(vec == @as(V, @splat('\n')))) |
                    @as(usize, @bitCast(vec == @as(V, @splat(' '))));

                bitstrings[3] = ~(bitstrings[0] | bitstrings[1] | bitstrings[2]);
            }
            const cur_misalignment: std.math.Log2Int(u64) = @truncate(@intFromPtr(cur.ptr));
            const bitstring = selected_bitmap.* >> cur_misalignment;

            cur = cur[@ctz(~bitstring)..];

            // If we made it to the end of chunk(s), wrap around, grab more bits, and start again
            aligned_ptr = @intFromPtr(cur.ptr) / 64 * 64;
            if (bitmap_index == aligned_ptr) break;
        }

        if (op_type == .eof) break;

        const len: u8 = @intCast(@intFromPtr(cur.ptr) - @intFromPtr(prev.ptr));

        try tokens.append(gpa, .{ .len = len, .kind = op_type });
        prev = cur;

        op_type = sw: switch (cur[0]) {
            'a'...'z', 'A'...'Z', '_' => .identifier,
            '0'...'9' => .number,
            '/' => {
                try tokens.append(gpa, .{ .len = 1, .kind = .slash });
                cur = cur[1..];
                prev = cur;
                continue :sw cur[0];
            },
            ' ', '\t', '\n' => .whitespace,
            0 => .eof,
            else => .unknown,
        };

        selected_bitmap = &bitstrings[@as(u2, @truncate(@intFromEnum(op_type)))];
    }

    try tokens.append(gpa, .{ .len = 0, .kind = op_type });
    return tokens.toOwnedSlice(gpa);
}

const Tag = enum(u8) {
    identifier = 0,
    number = 1,
    whitespace = 2,
    unknown = 3,
    slash,
    eof,
};

const Token = struct {
    kind: Tag,
    len: u8,
};

pub fn main() !void {
    var debug_allocator: std.heap.DebugAllocator(.{}) = .init;
    const gpa = debug_allocator.allocator();
    defer _ = debug_allocator.deinit();
    const stdout = std.io.getStdOut().writer();

    const fen = "foo/123/foo123bar/123foo/foo123bar456//123foo456 player GarriГарри!KasparowŠšŠ°ŃŠæŠ°Ń€Š¾Š²! 420 1337 96";
    const overaligned_size = std.mem.alignForward(u64, fen.len, 64);
    const buffer = try gpa.alignedAlloc(u8, .@"64", overaligned_size);
    defer gpa.free(buffer);
    // we rely on a nullbyte to know when to stop parsing
    if (builtin.mode == .Debug) @memset(buffer, 0);
    @memcpy(buffer[0..fen.len], fen[0..]);

    const tokens = try tokenize(gpa, buffer);
    defer gpa.free(tokens);

    var cur: []const u8 = buffer[0..];
    for (tokens) |token| {
        try stdout.print("{s}<", .{@tagName(token.kind)[0..1]});
        try stdout.print("{s}> ", .{cur[0..token.len]});
        if (token.kind == .slash or token.kind == .whitespace) try stdout.writeByte('\n');
        cur = cur[token.len..];
    }
    try stdout.writeByte('\n');
}
const std = @import("std");
const builtin = @import("builtin");

Result for input ā€œfoo/123/foo123bar/123foo/foo123bar456//123foo456 player GarriГарри!KasparowŠšŠ°ŃŠæŠ°Ń€Š¾Š²! 420 1337 96ā€

i<foo> s</> 
n<123> s</> 
i<foo> n<123> i<bar> s</> 
n<123> i<foo> s</> 
i<foo> n<123> i<bar> n<456> s</> 
s</> 
n<123> i<foo> n<456> w< > 
i<player> w< > 
i<Garri> u<Гарри!> i<Kasparow> u<ŠšŠ°ŃŠæŠ°Ń€Š¾Š²!> w< > 
n<420> w< > 
n<1337> w< > 
n<96> e<> 

I think these tokens are what you want, here printed as the first letter of the token type (i = identifier, n = number) and their content in the source buffer.
I have no clue about unicode (the u stands for unknown btw not unicode).
Needs master to compile (0.15.0-dev.646+ef35c3d5f).

2 Likes

Aha aha. I watched that hyperoptimizing video too.
A fast converter will be a great addition to my program for fast loading positions.
I will certainly study this. Thank you.
Currently reworking some movegenerator optimizations, but I will be back here!