Sub-optimal asm code for string-scanning state machine

biosbob · May 22, 2024, 2:01am

here is a zig rendition of a string-scanning state machine used as part of the CoreMark benchmark suite:

    const StringBuf = [*]u8;

    const State = enum {
        START,
        INVALID,
        S1,
        S2,
        INT,
        FLOAT,
        EXPONENT,
        SCIENTIFIC,
    };

...

    fn nextState(pstr: *StringBuf, transcnt: [*]u32) State {
        var str = pstr.*;
        var state = State.START;
        while (str[0] != 0 and state != State.INVALID) : (str += 1) {
            asm volatile ("bkpt");
            const ch = str[0];
            if (ch == ',') {
                str += 1;
                break;
            }
            switch (state) {
                .START => {
                    if (isDigit(ch)) {
                        state = .INT;
                    } else if (ch == '+' or ch == '-') {
                        state = .S1;
                    } else if (ch == '.') {
                        state = .FLOAT;
                    } else {
                        state = .INVALID;
                        transcnt[ord(.INVALID)] += 1;
                    }
                    transcnt[ord(.START)] += 1;
                },
                .S1 => {
                    if (isDigit(ch)) {
                        state = .INT;
                        transcnt[ord(.S1)] += 1;
                    } else if (ch == '.') {
                        state = .FLOAT;
                        transcnt[ord(.S1)] += 1;
                    } else {
                        state = .INVALID;
                        transcnt[ord(.S1)] += 1;
                    }
                },
                .INT => {
                    if (ch == '.') {
                        state = .FLOAT;
                        transcnt[ord(.INT)] += 1;
                    } else if (!isDigit(ch)) {
                        state = .INVALID;
                        transcnt[ord(.INT)] += 1;
                    }
                },
                .FLOAT => {
                    if (ch == 'E' or ch == 'e') {
                        state = .S2;
                        transcnt[ord(.FLOAT)] += 1;
                    } else if (!isDigit(ch)) {
                        state = .INVALID;
                        transcnt[ord(.FLOAT)] += 1;
                    }
                },
                .S2 => {
                    if (ch == '+' or ch == '-') {
                        state = .EXPONENT;
                        transcnt[ord(.S2)] += 1;
                    } else {
                        state = .INVALID;
                        transcnt[ord(.S2)] += 1;
                    }
                },
                .EXPONENT => {
                    if (isDigit(ch)) {
                        state = .SCIENTIFIC;
                        transcnt[ord(.EXPONENT)] += 1;
                    } else {
                        state = .INVALID;
                        transcnt[ord(.EXPONENT)] += 1;
                    }
                },
                .SCIENTIFIC => {
                    if (!isDigit(ch)) {
                        state = .INVALID;
                        transcnt[ord(.INVALID)] += 1;
                    }
                },
                .INVALID => {},
            }
        }
        pstr.* = str;
        return state;
    }

in the official CoreMark suite, the buffer contains a variety of comma-separated numeric strings which nextState will scan, track state transitions, update a pointer to the buffer, and return the next state…

the buffer begins with the characters "5012, ...", so it will visit the states START, INT, INT, INT, INT before hitting INVALID which forces the function to return…

i’ve cross-compiled this code for a cortex-m0+ MCU (ReleaseSmall), and can easily measure execution time with my logic analyzer; i can also single step through the generated asm using a debugger…

so what’s the problem??? well, i have a c++ version of the same function compiled for the same MCU – albeit using’s segger’s LLVM-based compiler… and the c++ version performs SLIGHTLY BETTER for reasons i don’t fully understand…

here’s the c++ code, which i should note is actually GENERATED by a higher-level programming tool i’m in the process of “zigify-ing”:

State nextState(em_char** pStr, em$ArrC< em_uint32 > transCnt) {
    auto str = *pStr;
    auto state = State::START;
    for ( ; *str && state != State::INVALID; str++) {
        asm ("bkpt");
        auto ch = *str;
        if (ch == ',') {
            str++;
            break;
        }
        switch (state) {
            case State::START: {
                if (isDigit(ch)) {
                    state = State::INT;
                }
                else if (ch == '+' || ch == '-') {
                    state = State::S1;
                }
                else if (ch == '.') {
                    state = State::FLOAT;
                }
                else {
                    state = State::INVALID;
                    transCnt[ord(State::INVALID)] += 1;
                }
                transCnt[ord(State::START)] += 1;
                break;
                break;
            }
            case State::S1: {
                if (isDigit(ch)) {
                    state = State::INT;
                    transCnt[ord(State::S1)] += 1;
                }
                else if (ch == '.') {
                    state = State::FLOAT;
                    transCnt[ord(State::S1)] += 1;
                }
                else {
                    state = State::INVALID;
                    transCnt[ord(State::S1)] += 1;
                }
                break;
                break;
            }
            case State::INT: {
                if (ch == '.') {
                    state = State::FLOAT;
                    transCnt[ord(State::INT)] += 1;
                }
                else if (!isDigit(ch)) {
                    state = State::INVALID;
                    transCnt[ord(State::INT)] += 1;
                }
                break;
                break;
            }
            case State::FLOAT: {
                if (ch == 'E' || ch == 'e') {
                    state = State::S2;
                    transCnt[ord(State::FLOAT)] += 1;
                }
                else if (!isDigit(ch)) {
                    state = State::INVALID;
                    transCnt[ord(State::FLOAT)] += 1;
                }
                break;
                break;
            }
            case State::S2: {
                if (ch == '+' || ch == '-') {
                    state = State::EXPONENT;
                    transCnt[ord(State::S2)] += 1;
                }
                else {
                    state = State::INVALID;
                    transCnt[ord(State::S2)] += 1;
                }
                break;
                break;
            }
            case State::EXPONENT: {
                if (isDigit(ch)) {
                    state = State::SCIENTIFIC;
                    transCnt[ord(State::EXPONENT)] += 1;
                }
                else {
                    state = State::INVALID;
                    transCnt[ord(State::EXPONENT)] += 1;
                }
                break;
                break;
            }
            case State::SCIENTIFIC: {
                if (!isDigit(ch)) {
                    state = State::INVALID;
                    transCnt[ord(State::INVALID)] += 1;
                }
                break;
                break;
            }
        }
    }
    *pStr = str;
    return state;
}

simply counting instructions from "bkpt" to "bkpt", the first five state transitions in the zig version respectively execute [32,26,26,26,29] instructions… by contrast, the c++ version only executes [26,24,24,24,24,19] instructions…

(yes, some of these instructions take multiple cycles; but the disparity in instruction counts does line up with the slightly faster execution time of c++ over zig…)

some very simple experiements with basic string traversal suggest that zig is on parity with c/c++… perhaps the problem is in the implementation of the switch statement??? does zig do an extra runtime check of the state variable to ensure it remains in range???

you’ll also note that the c++ switch does NOT handle the INVALID case – since in fact this is already filtered in the top-level while loop… zig, of course, requires me to cover the entire domain… is this the problem???

any thoughts/insights/suggestions would be much appreciated…

why is this important??? because the zig version otherwise OUTPERFORMS c++ in the CoreMark list-processing and matrix-manipulation algorithms…

if we can address this problem, our OVERALL CoreMark score will be the clear winner… right now, the impact of the state-machine is putting zig slightly behind c/c++…

help me make the case that zig offers higher-level programming AND higher-levels of performance just like those old bud-lite ads – tastes great AND less filling

squeek502 · May 22, 2024, 2:19am

Note that you can let the Zig compiler know that INVALID is not actually possible with:

.INVALID => unreachable,

biosbob · May 22, 2024, 3:55am

good to know, but there was no improvement in performance…

dimdin · May 22, 2024, 8:27am

There was discussions in zig issue tracker regarding state machines using computed goto:

(in introduce labeled continue syntax inside a switch expression · Issue #8220 · ziglang/zig · GitHub change extern enum to enum(c_int))

IntegratedQuantum · May 22, 2024, 11:02am

I think it would help a lot if you could setup the two versions on godbolt.
There are also a few functions missing in your code that might be relevant (isDigit() and ord()).

biosbob · May 22, 2024, 12:19pm

good suggestion…

i also tried reducing the state space to just START, INVALID, INT, FLOAT…

you can still see the problem…

and here are the missing definitions (which are effectively inlined within the generated code):

fn isDigit(ch: u8) bool {
    return ch >= '0' and ch <= '9';
}

fn ord(state: State) usize {
    return @intFromEnum(state);
}

FWIW – std.ascii.isDigit uses a switch on ch, with '0'...'9'… while this code also inlines, it is one instruction larger than my if based version… something about those switch statements

biosbob · May 22, 2024, 2:56pm

here’s a godbolt version, with just three states…

output is (obviously) native intel asm; i was targeting arm-cortex-m0+ …

does this give you enough information??? i can try to create a c/c++ version, though getting the exact version of LLVM might be a little tricky for me…

one thing that i have noticed (in the zig and c/c++ version) is that the NUMBER of instructions executed in a INT -> INT transition increases as i add more states – even though these states are never entered…

not sure i grok the current state of #8220, but is there some alternative way to express the state machine using an enveloping while (true) loop with continue statements added to my switch arms???

gonzo · May 22, 2024, 3:25pm

Perhaps isDigit is a macro in C, and reduces to ch >= '0' && ch <= '9'?

biosbob · May 22, 2024, 3:38pm

it’s actually a small, one-line c++ function (like zig) that effectively is inlined at the call site…

i’m more convinced the problem lies with the switch statement

IntegratedQuantum · May 22, 2024, 4:18pm

A C/C++ version would be helpful otherwise it’s kind of hard to figure out what actually makes the C++ version faster. To avoid all version differences you could also just use zig cc or zig c++ both are supported in godbolt.

Also you can target that specific CPU in godbolt by adding something like
-mcpu=cortex_m0plus -target arm-linux to the command.

biosbob · May 22, 2024, 5:03pm

here’s my updated zig version…

let me see what i can do on the c++ end…

biosbob · May 22, 2024, 5:35pm

and here’s a c++ version, using a recent version of LLVM…

the generated asm matches what i’ve produced on my end, using segger’s LLVM-based compiler…

i also captured a “trace” of the exact instructions executing on each state transition…

i appreciate any insights you might offer; and let me know if there anything else you might need…

IntegratedQuantum · May 22, 2024, 5:53pm

One difference that I can see is that the C++ version uses a u8 for the enum.

If I also use a u8 in the Zig version, like this:

const State = enum(u8) {
...
};

then the total number of instructions shrinks significantly.
Upon further inspection the actual size doesn’t really matter, enum(usize) seems to produce the same code.
I think by default the backing integer of the enum is an odd integer, like a u2 in this case. For these odd integers there is a performance issue, introducing unecessary code: LLVM: Non power of two integer arithmetic emits slow assembly · Issue #19616 · ziglang/zig · GitHub

dimdin · May 22, 2024, 6:11pm

My understanding is that the code is not bloated when using for the backing integer 8,16,32,64,… bits

biosbob · May 22, 2024, 6:16pm

you nailed it !!!

to give you a sense of the impact, 10 iterations of my rendition of CoreMark written in c++ took 124.5 ms on a 48 MHz M0+… my earlier zig version took 132.6 ms…

forcing State to be a u8 now brings the zig version down to 117.6 ms

as i mentioned earlier in this post, the other two parts of the CoreMark mix were already beating c++ by ~10%… and now, the overall score is following suit…

this is HUGE… thank you, zig

AndrewCodeDev · May 22, 2024, 6:31pm

That’s a great catch. Honestly, that almost deserves a footgun post because I would have never guessed.