Lexbor in Zig

This is my second Zig (0.15.1) learning project: wrap the lexbor DOM emulator / HTML parser.

The core is 100% lexbor, to which I intented to add a DOM-only sanitizer process, a bit like DOMPurify. It is not string-based but DOM-based.

https://github.com/ndrean/z-html

Feedback welcome!

3 Likes

Very nice! I like your API :​D

I’m actually working on something similar, mostly a thin wrapper around the API with no new code, it’s unfinished, but the library build is done: cancername/lexbor - Codeberg.org

I might make a PR to zexplorer to compile lexbor in build.zig later!

The idea is to be close to JavaScript semantics. I did this to build a server rendering framework using Zig (and lexbor!) with HTMX client side.
Instead of proposing string interpolation as the standard backends, the server of the webapp loads (parses) the HTML once for all, and for each client request (via HTTP or websocket possibly), the server should send back an updated HTML string by querySelecting the target from its virtual immutable DOM. I do this simply by cloning the target.
I tested on a 40kB HTML string and Zig/lexbor parses it at 700kB/s (50µs for this 40kB HTML) and can respond to around 250-280k ops/s. An op is querySelect in the full HTML (DOM), get the HTML of the targeted element, clone it, interpolate with values (potentially from a database of key store) and deliver the modified fragment as serialized text.
This is in fact faster than string based interpolation (Go fmt.Sprintf or Rust) with string based operations. But here, I can sanitize the entry if needed, ie remove malicious scripts or suspicious “on-xxx” attributes with the same level of performance.
So it turns out to be a surprisingly performant solution.
The API to respond to HTMX requests can be worked to be easy to use I believe. I am now looking for a ready-to-use Zig webserver.

2 Likes

That’s pretty cool.

Do note that sanitizing HTML is infamously very difficult and complex and you’re better off escaping from the get-go.

For web servers, try http.zig or zzz.

Yes I saw that it is not easy but escaping is just producing useless text. lexbor already escapes things, but not all. I tried a whitelist way on DOM attributes, as href and src are XSS vectors. Still very experimental though.

I tried Zap as zzz works with 0.14 only and I fought with the writer with http.zig.
On my machine, I could not go higher than 70.000 req/s.
I used k6 to simulate concurrent users that reached a toy app, shopping cart where you select an item from a list and a counter +/-/rm on the items.
It is all Zig: just a string with HTMX attributes, some templates and a script to load HTMLX when the page is rendered, a Zap backend serving the endpoints, and my little library to parse the HTML and extract templates, and then simple string interpolation to populate these templates.
Concurrent users with an X-session-id cookie to build an in-memory user based cart, identified by a - fake - session-id set on connection or by k6 when testing.
I am a bit disappointed as I could not go higher than 12.500 VU interacting simultaneously with the app: the stress test makes a VU visit the list of products, and play with the item count.
But this is already serving 60-70k req/s and I have a few allocations to make to build responses.
I also had to limit the ramp up 12_500 in around 10s, one every ms. Above this, it fails.
Todo:

  1. use JWT instead of a cookie and HashMap shopping card.Could be beneficial.
  2. use SQLite instead of a hardcoded list of items. This will slow down significantly I guess when I render the grocery list to select products. Having async Zig would be beneficial.
1 Like