Writing a Brainf**k compiler in Zig (solved)

Brief aside: I apologize if f-bombs go against Ziggit’s TOS, but given that is the name of the language I am targeting, I am not sure what else I can do except for refer to it as BF for the remainder of the post.

I’ve decided that for my next exercise I’d like to try to write a compiler for the BF programming language in Zig.

Ideally I’d like it to be a freestanding binary with no dependencies, but given I’ve never written something like this before, that may require more effort than I’m willing to invest.

I’m not interested in cross-platform support, just targeting either x86_64 linux, or potentially arm linux if the assembly is eaiser.

With that in mind, here are the options I’ve come up with, in descending order of preference:

  1. Hand-roll an ELF at the machinecode level.
  2. Translate the BF code into Zig, and then build the Zig file.
  3. Translate the BF code into assembly, and then build that (with something).
  4. Translate the BF code into LLVM IR.

#2 seems like a natural starting point, but given that I’d like to avoid runtime dependencies, if I can transition into #1 I think that would be ideal. But again, having never done this, it could be that it is way more complicated than I’m giving it credit for.

Any advice would be greatly appreciated! So far I’ve written a simple (naive) tokenizer:

pub fn char_to_token(src: u8) ?Token {
    return switch (src) {
        '<' => .left,
        '>' => .right,
        '+' => .inc,
        '-' => .sub,
        ',' => .input,
        '.' => .output,
        '[' => .loop,
        ']' => .loopback,
        else => null,
    };
}

const Token = enum {
    //
    left,
    right,
    inc,
    dec,
    input,
    output,
    loop,
    loopback,
};

const std = @import("std");

pub fn tokenize(src: []u8) std.ArrayList(Token) {
    var tokens: std.ArrayList(Token) = undefined;
    tokens.initCapacity(src.len / 2);

    for (src) |char| {
        if (char_to_token(char)) |token| {
            tokens.addOne(token);
        }
    }

    return tokens;
}

Since it’s such a simple language, another cool approach is to JIT it straight to machine code, as Luuk did here:

Do NOT run this on macOS without adapting the machine code, because of the way syscalls numbers are mapped, this will forkbomb on macOS :grinning_face_with_smiling_eyes:

2 Likes

Thanks for the suggestion, this is a great starting point! As this is purely an educational exercise for me, I’d like to take it a step further, but this should get me on the right track.

1 Like

Alright, 30 seconds of googling found me a detailed article about this exact topic.

I need to stop posting problems to Ziggit until I’ve done my due diligence :sweat_smile:

1 Like