So I am still trying to put some code together, based on what I was working on before. But first I think it will be useful to talk a little bit about how I am approaching the concept of bytecode. This is based mostly on my logical understanding of what bytecode should be, while drawing in a minimum of outside sources. So I could be off a little bit here.
In it’s simplest form, bytecode is essentially code with all the strings stripped out and replaced with token identifiers. So while the code itself would be a byte array containing all of the code, the bytecode would be all tokens, and thus all integers, with each token representing a change in state of the running program. The tokens which represent strings will act as references to the string itself. The benefit of this is when the code is ran through an interpreter, it will be much faster since the strings can be referenced as single integers or pointers, as opposed to say 10 integers for a 10 letter word.
In order for this to work, the integers representing the strings must point to the reference string, so that it can be quickly accessed. This reference table containing all the strings must be stored along with the bytecode in the program to be executed. The interpreter or virtual machine will first load the reference data into RAM, and then step through the bytecode one integer at a time, while pulling the existing references as needed.
So, getting back to the title of this post, there are two methods I will be comparing to see what I can learn about optimizing a bytecode interpreter.
I will start with the method that is more commonly in use, from what I can tell, which is to store the references in a hashmap, with each key containing the index which is what the interpreter will read when it’s running the program. When it gets to a string reference, it will take that integer and look up the string in the hashmap, which will already be loaded into RAM. This method I have not actually tested out yet. There is a nice article here which goes into optimizing hashmaps for a string lookup table. So I will be going into that as well to see what else I can learn about the optimization. It also contains some fairly advanced, but not impossible Zig code, so that’s another reason while I’m waiting to go into it.
The second method of storing bytecode is the one that I came up with logically when I was working on my interpreter from scratch awhile back. This method is extremely simple, and involves loading the entire byte array into ram along with the bytecode. So instead of using hashmaps you are simply using the original code itself. And for the bytecode, instead of using index numbers which reference the hashmap key, you will simply use a slice which points to the byte array at the specified offset. So something like byteArray[3…6] would point to the string “data” in the byte array. I believe this still fits the definition of bytecode, since everything is still an integer or pointer when it is interpreted. A slice is simply a pointer with a length. Possibly larger than a regular pointer, I’m not totally sure. But it still avoids copying over the byte data. In fact you could represent your entire bytecode as single-length slices for tokens, i.e. byteArray[0…1] for a ‘+’ sign, and multi-length slices for strings such as byteArray[2…4] for a string value, which will help isolate the byte array as well, sort of like loading a protected memory kernel.
The benefit of the second way I think is it keeps the code as simple as possible. I was able to build a rudimentary loop function which can validate by entire AST. And this function itself is also basic and predictable enough that I can easily change the syntax I want to validate, simply by tracing through the loop logic and adjusting as necessary. This is basically how I prefer to build my applicatios, in different levels of complexity. It allows me to focus on the tricky aspects which are hard to get right, while the code is relatively simple. Then once I’m ready to start on the production-grade code, it’s easier to swap out those placeholder ideas (i.e. just using a page allocator by itself) with the more proper techniques which cover things like error reporting and thread safety.
The drawback of this second method is that it will be hard to expand on and add more features, and it might possibly use more RAM in certain situations. So thats why I like the second method to be used as sort of a testing environment for trying out other ideas.
I also want to point out that bytecode itself will also typically be a lot more complex than this, I’m sure, but by focusing on the idea that most code will consist of token operators or strings, I think this is a good place to start. So hopefully once I put these 2 or 3 ideas together, I will be able to run some tests which will allow me to see the advantages and disadvantages of each method, as well as to see what’s the fastest and what’s the most memory-effecient approach at this level. Then as I’m making the code more and more complex I can build on these 2 or 3 tests to get a more analytical view of exactly what I can do with these functions. Also I’m hoping that by maintaining these different templates, I will be able to build a strong boilerplate which I can adjust and fine-tune depending on the type of program I am writing. This will also let me avoid pulling in too many outside dependencies, which is a must for me at all times. Sure, it’s not the easiest or fastest way to get a product out the door. But I’m not in a rush anyway, and it’s more satisfying to know more about what’s going on under the hood. I’m hoping these ideas will improve my future projects by providing a strong foundation. Also I hope this will also benefit coders who already have a large portfolio, by giving them the opportunity to provide some feedback, as well as alternate suggestions or ideas for testing different concepts.
So yeah hopefully on the next post I will have my examples ready.