Simulating goto behavior

I’m trying to write a HTML parser and there’s a specific pattern that comes up often like the following:

    fn script_data_escape_start_state(self: *Self) !void {
        anything_else: {
            const codepoint = self.peek() orelse {
                break :anything_else;
            };

            if (codepoint == '-') {
                self.switch_to(.script_data_escape_start_dash);
                try self.emit_character('-');

                _ = self.consume();

                return;
            }
        }

        self.switch_to(.script_data);
    }

Basically, I need to deal with a certain subset of possible characters for a state in a certain way and any other character in another way where any other character may be EOF. So here I’m basically emulating a goto the end of the anything_else block by breaking out of the block in case self.peek() is null.

Is this the right way to model this?

Perhaps a helper function?

// codepoint parameter type must be peek() return type
fn peek_codepoint(self: *Self, codepoint: u8) bool {
    if (self.peek()) |cp| {
        return cp == codepoint;
    }
    return false;
}

fn script_data_escape_start_state(self: *Self) !void {
    if (peek_codepoint('-')) {
        self.switch_to(.script_data_escape_start_dash);
        try self.emit_character('-');
        _ = self.consume();
        return;
    }

    self.switch_to(.script_data);
}
2 Likes

Great question. I’ve looked into this a bit myself so I’d like to hear more people’s thoughts on this, too.

In your case, I’d say it’s unnecessary from the example that was provided. That said, I understand that examples can become arbitrarily more complex. Let’s take a look at your code first though.

anything_else: {
    const codepoint = self.peek() orelse {
        break :anything_else;
    };

This is the part that actually models the goto. This can be reduced to:

const codepoint = self.peek() orelse {
    return self.switch_to(.script_data);
};

And you can completely avoid the block and simplify your code. Onto the other stuff…

Let’s say I have a switch statement and I’m working over a series of return codes - this can be very handy to break out of nested statements:

// struct members

    const Response = union(enum) { 
        code: usize,
        b: void,
    };

    pub fn foo(_: Self) ?Response {
        return Response{ .code = 42 };        
    }

// switch statement with optional block capture

    if (self.foo()) |e| blk: {
        switch (e) {
            .code => |code| {
                if (code == 0) break :blk;
                // continue inspecting response cod
            },
            .b => {
                // something else...
            },
            else => {
            // something else...
            }
        }  
    }

That’s starting to act more like a goto and is an elegant way to unpack things and break out if the need arises.

In short, for replacing if statements or direct calls… I wouldn’t prefer block or goto as an escape. For breaking out of nested code, totally. It would be good to have someone pitch some examples and then we could workshop them together to find solutions. @dee0xeed always has some good examples from C.

2 Likes

I also want to add that we’ve had discussions about this before: C goto vs Zig defer / errdefer / break

I strongly recommend you read that. If you feel like it doesn’t answer your question, we can certainly stay here and discuss - just wanted to point out that we have existing resources on this.

The reason I didn’t go for this is because I wanted it to be clear that EOF is handled the same way anything_else is (in this case, any codepoint that’s not a hyphen) whereas in some other cases EOF is specially handled. If I write it like you’d shown it’d be more difficult to understand whether this is a case of EOF being specially handled or if it’s just part of anything_else.

Also, in some cases the handling of anything_else is not simply switching to another state, but it may involve emitting multiple tokens etc. In such scenarios, I didn’t want to duplicate that logic for EOF and then again for the final anything_else scenario.

For example here’s another example:

    fn rcdata_end_tag_name_state(self: *Self) !void {
        anything_else: {
            const codepoint = self.peek() orelse {
                break :anything_else;
            };

            if (codepoint == '/' and self.is_appropriate_end_tag()) {
                self.switch_to(.self_closing_start_tag);
            } else if (codepoint == '>' and self.is_appropriate_end_tag()) {
                self.switch_to(.data);
                try self.emit_current_token();
            } else if (is_ascii_alphabet(codepoint)) {
                const lowercase = ascii_lowercase(codepoint);
                try self.append_to_tag_name(lowercase);
                try self.tmp_buffer.append(codepoint);
            } else {
                break :anything_else;
            }

            _ = self.consume();

            return;
        }

        try self.emit_character('<');
        try self.emit_character('/');
        try self.emit_tmp_buffer();

        self.switch_to(.rcdata);
    }

I hadn’t considered this, but in this case I’d have to call self.peek() multiple times which means decoding the codepoint repeatedly. Here’s what peek looks like:

    fn peek(self: *Self) ?u21 {
        // self.iterator = instance of Utf8Iterator
        const bytes = self.iterator.peek(1);
        if (bytes.len == 0) {
            return null;
        }

        return std.unicode.utf8Decode(bytes) catch unreachable;
    }

I believe the code I see here is best tackled by an if with a capture. So it’d be:

if(self.peek()) |codepoint| {
    ....
    // your code, interacting with the codepoint, here
    return;
}
// the rest here
1 Like

You can cache the result in self.

current: ?u21,

fn peek(self: *Self) ?u21 {
    if (self.current) |c| {
        return c;
    }
    self.current = null;
    const bytes = self.iterator.peek(1);
    if (bytes.len == 0) {
        return null;
    }

    self.current = std.unicode.utf8Decode(bytes) catch unreachable;
    return self.current;
}

fn consume(self: *Self) {
    self.current = null;
    ...