Optimizing a bytecode interpreter in Zig

mscott9437 · June 26, 2024, 1:40pm

Link to repo:
https://github.com/mscott9437/ChronoVM

So it’s a long story, but basically I was working on a bytecode interpreter in Zig awhile back. I took a break from it for awhile, but now I am ready to approach the project from a fresh perspective. Im hoping these notes will be useful for anybody reading this. My ultimate goal is to write something like a Java compiler in Zig. So in the way Zig is already used as a drop-in replacement compiler for C/C++, we can have something similar for Java.

Since this project is still in the very early research phases, I’m going to take the opportunity to go more into the theory, with code examples supporting those ideas, as opposed to working directly on the source. So just a disclaimer, I don’t have any official code or repo for this project yet.

Now for some history. I originally had the idea to write a scripting language about a year ago, and eventually set about trying to put together a simple interpeter which can parse input strings and execute a function based on the sequence of characters entered. I got a pretty good start, but didn’t touch it for awhile after that. However when I went to go do my research on the Java compiler idea, I realized that what I was working on for the scripting language was already essentially a bytecode interpreter. I just didnt know what to call it at the time.

mscott9437 · June 26, 2024, 1:46pm

I accentally posted this while writing the draft, but I’m just going to leave it open since apparently the history is already there anyway even if you delete it. I will update it later with the code examples.

lygaret · June 26, 2024, 4:38pm

The back half of Crafting Interpreters by Robert Nystrom has some great content about optimizing bytecode interpreters in general, and as far as I know, there’s some good examples in the zig source of bytecode dispatch - I studied code from this issue for a while, and it helped solidify some stuff about how zig managed control flow around the trampoline.

mscott9437 · June 26, 2024, 5:17pm

Thanks! I actually came across that book as well, courtesy of somebody else here on the Ziggit forums. It’s really good to have, especially since it’s a lot harder to find resources on interpreters than it is compilers. Just browsing through it a little bit really put everything else in perspective I had been working on previously, and provided me with a much clearer path going forward. Also thanks for linking the Github issue. It does look interesting

mscott9437 · June 28, 2024, 11:44pm

So I am still trying to put some code together, based on what I was working on before. But first I think it will be useful to talk a little bit about how I am approaching the concept of bytecode. This is based mostly on my logical understanding of what bytecode should be, while drawing in a minimum of outside sources. So I could be off a little bit here.

In it’s simplest form, bytecode is essentially code with all the strings stripped out and replaced with token identifiers. So while the code itself would be a byte array containing all of the code, the bytecode would be all tokens, and thus all integers, with each token representing a change in state of the running program. The tokens which represent strings will act as references to the string itself. The benefit of this is when the code is ran through an interpreter, it will be much faster since the strings can be referenced as single integers or pointers, as opposed to say 10 integers for a 10 letter word.

In order for this to work, the integers representing the strings must point to the reference string, so that it can be quickly accessed. This reference table containing all the strings must be stored along with the bytecode in the program to be executed. The interpreter or virtual machine will first load the reference data into RAM, and then step through the bytecode one integer at a time, while pulling the existing references as needed.

So, getting back to the title of this post, there are two methods I will be comparing to see what I can learn about optimizing a bytecode interpreter.

I will start with the method that is more commonly in use, from what I can tell, which is to store the references in a hashmap, with each key containing the index which is what the interpreter will read when it’s running the program. When it gets to a string reference, it will take that integer and look up the string in the hashmap, which will already be loaded into RAM. This method I have not actually tested out yet. There is a nice article here which goes into optimizing hashmaps for a string lookup table. So I will be going into that as well to see what else I can learn about the optimization. It also contains some fairly advanced, but not impossible Zig code, so that’s another reason while I’m waiting to go into it.

The second method of storing bytecode is the one that I came up with logically when I was working on my interpreter from scratch awhile back. This method is extremely simple, and involves loading the entire byte array into ram along with the bytecode. So instead of using hashmaps you are simply using the original code itself. And for the bytecode, instead of using index numbers which reference the hashmap key, you will simply use a slice which points to the byte array at the specified offset. So something like byteArray[3…6] would point to the string “data” in the byte array. I believe this still fits the definition of bytecode, since everything is still an integer or pointer when it is interpreted. A slice is simply a pointer with a length. Possibly larger than a regular pointer, I’m not totally sure. But it still avoids copying over the byte data. In fact you could represent your entire bytecode as single-length slices for tokens, i.e. byteArray[0…1] for a ‘+’ sign, and multi-length slices for strings such as byteArray[2…4] for a string value, which will help isolate the byte array as well, sort of like loading a protected memory kernel.

The benefit of the second way I think is it keeps the code as simple as possible. I was able to build a rudimentary loop function which can validate by entire AST. And this function itself is also basic and predictable enough that I can easily change the syntax I want to validate, simply by tracing through the loop logic and adjusting as necessary. This is basically how I prefer to build my applicatios, in different levels of complexity. It allows me to focus on the tricky aspects which are hard to get right, while the code is relatively simple. Then once I’m ready to start on the production-grade code, it’s easier to swap out those placeholder ideas (i.e. just using a page allocator by itself) with the more proper techniques which cover things like error reporting and thread safety.

The drawback of this second method is that it will be hard to expand on and add more features, and it might possibly use more RAM in certain situations. So thats why I like the second method to be used as sort of a testing environment for trying out other ideas.

I also want to point out that bytecode itself will also typically be a lot more complex than this, I’m sure, but by focusing on the idea that most code will consist of token operators or strings, I think this is a good place to start. So hopefully once I put these 2 or 3 ideas together, I will be able to run some tests which will allow me to see the advantages and disadvantages of each method, as well as to see what’s the fastest and what’s the most memory-effecient approach at this level. Then as I’m making the code more and more complex I can build on these 2 or 3 tests to get a more analytical view of exactly what I can do with these functions. Also I’m hoping that by maintaining these different templates, I will be able to build a strong boilerplate which I can adjust and fine-tune depending on the type of program I am writing. This will also let me avoid pulling in too many outside dependencies, which is a must for me at all times. Sure, it’s not the easiest or fastest way to get a product out the door. But I’m not in a rush anyway, and it’s more satisfying to know more about what’s going on under the hood. I’m hoping these ideas will improve my future projects by providing a strong foundation. Also I hope this will also benefit coders who already have a large portfolio, by giving them the opportunity to provide some feedback, as well as alternate suggestions or ideas for testing different concepts.

So yeah hopefully on the next post I will have my examples ready.

mscott9437 · July 6, 2024, 9:54pm

Just a minor update. I started working on this a few days ago, and every way I look at it, it just kept getting more complex, due to the number of options available.

So basically I have 3 different versions right now. I really want to take the opportunity at this level to do benchmarks comparing a pure C version to a pure Zig version, and also a Zig version which interfaces with the C abi. There are also some interesting things you can see here with the code style.

I made a repo with 4 src files. Since I also want to develop my unit tests along with the main code, the C version is split into a main and a test file. The other 2 src files are for the Zig and Zig/C versions.

Right now all the source files with a main function do the same thing basically, which is to duplicate a file and rename it, without changing it’s contents. I wanted to have all the IO set up good before I get into the actual compiling steps. So from here I can add the logic to do some rudimentary compiling. For this case I am going to be compiling a JSON-formatted file to a simple bytecode format, which strips out all the strings and replaces them with references to either a string table or a byte array. So that’s an extra 2 versions for each of the 3 versions I already have.

A few other things. I am using fgetc/fputc for the C code, and readByte/writeByte for the Zig code. For now. It seems to be the easiest and most direct way to visualize the process. Also for the Zig/C version, I still have to use the Zig stdlib if I want to support cmdline arguments. I couldn’t find another practical way to do this.

Finally, I also had a thought to make this project a cloud compiler, to keep things interesting. This would also open the door for other people to try it out quickly in a sandbox environment. So that’s something else I’m going to be looking into for next week.

Link for the repo:
https://github.com/mscott9437/ChronoVM

mntnmntn · July 6, 2024, 10:13pm

The way I see it, bytecode is essentially machine code specialized for a “higher-level” virtual machine (higher-level, as in one level above native code, i.e. JVM or Python’s virtual machine). Usually in these VMs, local variables are known at compile-time, meaning that for each function, you could map each local variable into an index. When the code is interpreted, and a function is called, then virtual machine would allocate an array enough to hold every local variable defined; and when a local variable is used, the VM would only need to access the index corresponding to that variable in the array.

I believe the “Crafting Interpreters” book also talks about this. It’s a good reference material for building VMs.

mscott9437 · July 6, 2024, 10:51pm

Thanks. I will admit my definition of bytecode here is not exactly conventional, and by some standards is not really bytecode at all. But its also true that the “compiled” file will be parsed at a 1 instruction/1 byte ratio. Also this is giving me a reason to work with the ideas i am already familiar with. Also i have some other stuff I want to test out which isnt necessarily related to interpreters, so thats why im taking this approach for now. I suppose it would be more accurate to call it a pseudo-bytecode intepreter, but the goal is to eventually have a genuine interpreter dealing with machine code, JIT, etc…