The World Meteorological Organization (WMO) has a standard binary file format for meteorological data called BUFR: Binary Universal Format for the Representation of meteorological data. It is a heavily structured, table-driven/described format that traditionally requires libraries to load lookup tables at runtime to decode the data.
For a number of reasons, I am attempting to write a BUFR decoder from scratch that requires zero runtime tables in order to be usable in sandboxed WebAssembly environments, as well as easily interfacing with existing C++ visualization code. Zig’s implementation of comptime and reflection, as well as it’s C/C++ interop and WebAssembly compile target made it really enticing over something like C++ or Rust.
I’m still deep in the weeds of it all, but I have been writing about the project and also using it as an opportunity to introduce Zig to the wider atmospheric science community. I thought maybe I would share it here to show others how Zig can be used for some relaitvely “niche” use cases in software-adjacent fields such as meteorology. I also welcome feedback, if there is any!
I’m writing for a software-adjacent field where software expertise, experience, and language exposure varies wildly… so my framing is maybe a bit different than anticipated. I’m also relatively new to Zig, so there may be some things I got wrong. Please let me know, if so!
I have the major structural decoding of BUFR messages done, but the actual data to be accessed is encoded in a byte stream using a series of “descriptors”. Those descriptors correspond to table elements that can define sequences, individual elements, replication factors, or mathematical operations. My strategy right now has been to parse the CSV tables and turn them into compiled Zig code… though I’m still trying to work out exactly how it should be structured. Let’s just say this file format is not for the faint of heart…
Hello,
I have first encountered your blog yesterday and i enjoy your style of writing a lot. BUFR decoder is cool project aswell, if you need any help feel free to ask here (altho you got really far already).
Only thing that caught my eye when reading, you mentioned that all other solutions make runtime loading of the protocol tables (i dont know, so i take your word for it). It got me really curious why is that so, it is possible in the other languages (altho not that nice) to generate decoding code statically based on the tables. Might it be a deliberate choice? The main benefit i see, is it allows you to just update the protocol tables on system where the decoder is already present and does not require a recompile. What do you think?
Your intuition is correct – the main benefit and design intention behind the runtime loading of tables is that it allows for the tables to be updated without necessarily having to change the code/decoder logic. No recompile or updates needed.
It’s definitely not a bad design decision, but it isn’t always easy to run in environments where you can ship and load tables. I admittedly could be wrong, but in my time messing with WebAssembly and Emscripten (with WASM ultimately one of the environments I intend to target), file IO is a bit of a royal pain in the rear. I’m sure I could find a way to ship the tables as JSON and pass them into the program stack, but that just kind of… sucks, for lack of a better description. I also need this decoder to be easy to embed and ship with a very lightweight C++ GUI built on Dear ImGui, and I would really prefer to not have to touch the user environment and install tables somewhere. With previous libraries I’ve written, that’s always been a pain point and foot gun, particularly when shipping cross platform. I know Zig has @embedFile, but that would ultimately mean updating and recompiling anyway!
Something else worth noting is that the tables are only officially updated twice a year, and usually it is only to add new entries for new data and observation types. In fact, most of the data I need to interact with are generated using older table versions anyway. Updating the WebAssembly binary twice a year doesn’t seem too bad!
It should be worth mentioning, too, that a more fully-fledged version of this library could and should allow for the runtime loading of tables, too. The data could be parsed and populate the same internal data structures and be used dynamically, serving both use cases. My use case is a bit of an “abuse” of the standard, but still compliant… and the hope is that it makes it far more portable and usable for developers than what is currently available.
For something like this, I don’t think you need a full Wasi style file IO environment. I would suggest exposing some functions from the WebAssembly module that allows the host to store the table contents in memory and then calls a updateTable (or similar) function that gets a pointer and a length. Then you do the parsing from there.
Then the host is responsible for file IO or network, however the new tables are fetched. And since you are having to write the table parsing logic anyway, this should be a wrapping function around that logic.
I still think your current plan of parsing at compile time is a good idea. Shipping with one already in the binary allows for the module to spin up without needing to do any loading from the embedder.
It should be worth mentioning, too, that a more fully-fledged version of this library could and should allow for the runtime loading of tables, too. The data could be parsed and populate the same internal data structures and be used dynamically, serving both use cases. My use case is a bit of an “abuse” of the standard, but still compliant… and the hope is that it makes it far more portable and usable for developers than what is currently available.
I think you are on the right trick design-wide. Embed default tables with @embedFile and if no tables are provided to you on initialization, use the embedded one.
Then parsing is just the same for the embedded or provided table.