Comptime XML parsing?

kj4tmp · August 8, 2024, 3:12am

I’m building a library for a real-time (need < 20-50 us latency typically) industrial ethernet protocol called EtherCAT.

The library should be usable on the following targets:

Linux
Windows
Micro-controllers

An ethercat controller (called the MainDevice) requires significant information about the subdevices it intends to control. This information can be provided in an EtherCAT Network Information file, which is an XML file.

But for a large network (about 200 subdevices) the XML file can be very large, reaching 400,000 lines or 20 MB.

I’m considering the following to support all of these platforms:

For embedded, it is reasonable to require users to re-compile their application if their network changes, thus we perhaps parse the XML files at comptime?
For linux/windows, we would like the users to have the capability to build an application that could accept arbitrary ENI.xml files, parsed at runtime.

My questions are:

How should I be thinking about the advantages and disadvantages of doing things at build-time, comptime, and runtime? I imagine that for extreme latency critical applications, users will stack allocate their memory (at comptime, but limited to < 8 MB typically).
What are the differences between build time and comptime?
What zig XML libraries do you recommend?
Can I parse XML files at comptime?

kj4tmp · August 8, 2024, 3:18am

One wrench to throw in:
The XML files are encoded in ISO-8859-1, not UTF-8

Sze · August 8, 2024, 6:28am

Personally, I would start with building an application that uses a c xml library to read the xml and generate the optimized internal representation that is hopefully a lot smaller, looking at what is possible to reduce the size.

Then you can use that as a build step to generate the data and include it in the build, here is a simple example of that Zig Build System ⚡ Zig Programming Language

Then optionally you could later expand that (on platforms where it is supported), with something that allows you to, instead of embedding the data directly, put it into a dynamic library that you can dynamically link, or alternatively use memory mapping to map a file, or use a file read to read it. For cases where you would want to update the file while the application is running, without recompiling it.

I don’t know how feasible that would be on the more memory constrained devices, but maybe you could instead have something which connects to those and carries out the update, by either replacing the full application, or patching the running application, if you can figure out how to do that (if you do let us know).

I don’t think comptime fits this usecase, I would especially avoid dealing with XML parsing at comptime, instead I would transform the xml into something less verbose, intern strings for attributes/tags you can’t get rid of (if they are used too randomly/sporadic) and eliminate strings where possible (for example by making them required struct fields, which compile down to an offset), generating enums for repeating values or tags, attributes, etc. A whole other research part would be to explore different compression methods.

The link also has an example for generating source code (next section), you could use that to also generate the code that is used for using/reading the data you generated from the xml.

I think the encoding doesn’t matter that much, unless you have specific requirements, for example if you want to use utf-8 internally then it would probably be helpful to already convert it to utf-8 when you are reading the xml.
But if you have no reason to change the encoding you could keep it as is (at least I think so, are there other concerns you have about the encoding?).

I guess the xml parsing might need to deal with some details about the encoding, I am not exactly sure whether xml libraries support arbitrary encodings or only specific ones. Maybe you can specify more what your constraints are related to encoding, e.g. is the code required to use ISO-8859-1? If so changing it to utf-8 and back might be not worth it. Also keeping it in ISO-8859-1 might make the data more compact, because utf-8 has more (unused) possibilities.

msw · August 8, 2024, 7:01am

a year ago or so, it only took minor tweaks to adjust andrewrk’s XML parser to run at comptime… although it seems to me zig would be moving into a direction where you’d transform your XML into something firsthand usable as a build step.

dimdin · August 8, 2024, 8:24am

There is zig-xml by @nektro. The parser expects utf-8 encoded xml.
To convert iso-8859-1 to utf-8 you don’t need a mapping table because the encoding is byte by byte equivalent to unicode code points.

pierrelgol · August 8, 2024, 8:24am

Maybe you could use the fact that thee Zig build system can call programs during compilation steps, so the simpler might be to make a tiny program in C or Zig which purposes is to parse the XML file, and within the resources constrained environment have that program generate some artifact to be available for the compilation, and for the other target simply do a little execve to get the updated config ? And I agree with @Sze I think comptime won’t be very useful in this case

chung-leong · August 8, 2024, 7:34pm

Comptime is a bad fit for XML processing due to difficulties of working with comptime pointers. It’s essential though for the actual use case here, which is handling a diverse range of different hardware in a potential resource limited situation.

Bundling the zig compiler with the app and rigging up a sort of JIT compilation mechanism is how I would deal with this.

kj4tmp · August 11, 2024, 5:04am

I’m thinking as a first pass, I will create a CLI tool to translate the XML file into some sort of zig source file, could be part of a build but I think would generally be done once and then tracked in source control.

Sze · August 11, 2024, 1:46pm

Just convert it to some data that you can easily access by embedding it with @embedFile the link shows exactly how to do it: Zig Build System ⚡ Zig Programming Language

You change the word_select example to read the xml and generate data that can be easily read, that gets embedded via the addAnonymousImport and @embedFile. You can even write some struct definitions in some separate module and share that between the modified word_select example and your application to both create the data and later read it.

Translating the data to Zig source code is an unnecessary step, when you can use @embedFile instead.

mnemnion · August 11, 2024, 4:13pm

It’s often helpful to see some code doing this kind of thing, so I’ll also point you to zg, @dude_the_builder’s Unicode string library, specifically the codegen folder. What it does is quite similar to your application, just using Unicode chart data instead of XML.

Sze · August 11, 2024, 4:26pm

Just in case somebody didn’t see it on the linked page there is:

tools/word_select.zig (click to expand/collapse)

Clicking on that shows the application that does the transform, from input to output, which is then used with @embedFile.

The reason I find that more preferable, then manually executing it once and then adding it as source, is that the start step (the xml file) then still remains the source of truth and the derived steps can be more easily verified, also if the xml file ever changes the derived steps can be updated more automatically and straightforward, instead of having to manually update that thing that was generated and added to version control.

But having a version of the xml committed to some repository would probably be good just in case it ever isn’t reachable from its url or suddendly changes in an unexpected way.