To share an additional datapoint, I grabbed the perftest and ran on an M3 (multiple runs shows the same result):
⚡hyperfine "./perftest std"
Benchmark 1: ./perftest std
Time (mean ± σ): 42.5 ms ± 0.3 ms [User: 42.2 ms, System: 0.2 ms]
Range (min … max): 41.9 ms … 43.6 ms 67 runs
⚡hyperfine "./perftest custom"
Benchmark 1: ./perftest custom
Time (mean ± σ): 43.3 ms ± 0.4 ms [User: 43.1 ms, System: 0.2 ms]
Range (min … max): 42.7 ms … 45.0 ms 65 runs
Which seems to slightly favor std.
Using a large input file like Sema.zig shows no meaningful difference.
Compiled with zig trunk.
Maybe the variance comes down to just system dependent changes to process data layout between the two implementations?