Vind: crazy-fast tiny embeddable full text search engine
I was working on a project that needed fast text searching in the browser. I wanted to use Typesense but its dependencies wouldn’t allow it to easily be compiled for WASM, so I wrote Vind (“find” in Dutch) originally as a library. Since then, it’s grown to have a CLI (originally for testing) and HTTP (originally to compare it to Typesense), in addition to the C ABI and a WASM target.
It indexes on trigrams and each distinct trigram owns a Roaring Bitmap (using CRoaring because ZRoaring is not fully there yet) of the documents that contain it, and every tag owns one too. Roaring does the heavy lifting, handling the set algebra and keeping the posting lists compact in memory. The blob is smaller still because it stores only the documents and interned strings and rebuilds the bitmaps on load (1M docs in 143 MB).
It supports text, tags, and dates. It handles fuzzy matching (“vice” to “voice”) simply by virtue of scoring mechanism. It’s pretty damn fast. In benchmarks, it handled 100% of the requests (using httpz) vs Typesense failing about half the time, and was still 8x-13x faster than Typesense.
I cannot claim this project to be AI-free (sorry), but it’s at a relatively good point now that AI is no longer used to write code, and PRs and bug reports are warmly accepted (be direct, but be kind).
Supported Zig versions
0.16.0
AI / LLM usage disclosure
LLM used as part of the code writing, through review of code and design, utilities for benchmarking, and proposing fixes to the oh so many bugs I had which I refused to put in a commit in shame. 
12 Likes
I’ve been working on my audio fingerprint search index for a long time, and have been planning to fork it into a text search engine at some point, so these kind of projects interest me.
However, I found something something that I don’t understand in your project. You are taking trigrams, essentially u24 and you are expanding them to u32 using FNV hashing. I guess this is for better distribution of the tokens, but since your engine is static, I don’t see how it improves anything. My intuition from my search engine is that the less data you index, the better. Did you find it to work better with the hashes?
2 Likes
I’m ashamed to say I honestly didn’t think of just packing them. I went with instinct of hashing to a size.
Roaring forces the doc ids in the bitmaps to be u32, so I just ended up using u32 to match the rest of the code (interner ids, doc ids, roaring bitmap functions). Then my brain said “well, key to u32, I should just hash it.” I went with FNV specifically since the collisions are supposed to be so rare and I never used it. But really, it’s pointless. u24 or u32, it stores all of them, so might as well pack to u24 with padding to u32.
Do you think there’s a benefit in keeping them strict u24 (then inflate to u32 on the spot) vs. just keeping them as u32 (u24+padding)?
I want to support Unicode too, but that’ll kick it up to u64.
I didn’t read the rest of the code, so I don’t know, this just jumped at me. The hashes make sense if you are sharding/segmenting the index somehow, and want more uniform distribution across the u32 space. If you have just one big table, then the three characters plus padding seem better to me, as you are not wasting CPU time on the hashes that just increase the chance of collisions over using the trigrams directly. And obviously if this was a larger index, using the fact that you only have three characters, not four, would be important.
1 Like