ZCSV: CSV parsing for Zig

mtolman · October 23, 2024, 6:12pm

I’ve been working on a a CSV parser for Zig, which ended up with a few different variants (including zero-allocation and allocating variants). It can be downloaded with Zon. This is my first Zig library, so feedback on how to improve my usage of Zig is appreciated.

ericlang · October 23, 2024, 6:45pm

I am not yet qualified enough to look at the code, but quite interested.

I have one question though (because the code reminds me of something I wrote myself in another language):
Do you handle fields embedded in quotes, which span over multiple lines because the field itself contains linefeeds?

That is where almost every program is going wrong I believe my list is: excel, libre-office writer, sql-server bulkinsert… I never found a way to preserve correct data in these.

mtolman · October 23, 2024, 7:24pm

Yes, I handle field embedding in quotes, including newlines. So the field "hello \r\n""world""" would be interpreted as hello \r\n"world". The parser does accept both CRLF and LF line endings for rows as well.

jules · November 12, 2024, 6:45pm

I’m curious, I will take a read through of ur code to probably answer my own question, is your implementation generic enough such that you could use to for any kind of delimited tabular data?

I would imagine something akin to the way Zig does in the stdlib, having the caller provide a type of the data being parsed and a delimiter which is an instance of that type provided, for example:

fn splitScalar(comptime T: type, buffer: []const T, delimiter: T) SplitIterator(T, .scalar)

And if not, I suppose that would by my suggestion to you! I have written non-generic CSV parsing impl in Zig for working with some data that happened to be in the format of tab-delimited data. I definitely had the thought to make it generic for the future but I haven’t found the time yet.

mtolman · November 26, 2024, 2:06am

No, I don’t currently support changing delimiters

mtolman · November 26, 2024, 2:07am

It is something I’ve thought about adding. I’m in a similar boat where I haven’t made it generic quite yet.

pachde · November 26, 2024, 8:21pm

Since there is no standard formal definition of the format, all you have is a variety of implementations that differ in the details of what they support. Plenty of implementations are nothing more than line-by-line string split on tab/comma, which is good enough for purely numeric data.

mtolman · November 27, 2024, 4:28pm

@jules I just pushed a new release to allow for customizing delimiters, quotes, etc. The change is for both parsing and writing data. Thank you for the suggestion!

mtolman · November 27, 2024, 4:40pm

Since there is no standard formal definition of the format, all you have is a variety of implementations that differ in the details of what they support. Plenty of implementations are nothing more than line-by-line string split on tab/comma, which is good enough for purely numeric data.

That is true, there isn’t anything formal about CSV files (or its variants). And yes, for pure numerical data doing a split is often enough.

For my use case, I needed something with more advanced quoting, such as having line endings and quotes inside quoted text. Those edge cases are accounted for by the related RFC. I did differ from the RFC when it came to parsing line endings since I allow both LF and CRLF while the RFC is just CRLF. When generating CSVs, I output CRLF. That means I can parse and generate RFC compliant CSVs, and I can parse CSVs which have non-compliant line endings.

The main reason I differed on how I parse line endings is I’m often on Unix systems, so most of my hand-rolled files end up getting LF endings from editors instead of CRLF. Instead of trying to fix all of my CSV files, I opted to just have a more flexible parser.

ericlang · November 27, 2024, 8:28pm

Sounds great.
I often am astounded how bad big applications can handle CSV with line endings in quotes. I also had to write my own parser at work

jules · November 27, 2024, 9:34pm

oooOOOoooo cool, thanks for letting me know, I will check it out!! And you’re very welcome for the suggestion, I hope it was a fun one to work on and implement. how did u find it, writing ur implementation? If there was anything interesting to note, i’d love to hear!

I think I will actually pull in your project as a dependency now. For the program I mentioned which requires TSV parsing. I had been meaning to expand it to do multiple delimiters to also support CSV and other similar formats. The program basically has to churn through millions of entries of chemical data in SDF, CSV, TSV, MDL, etc, etc and unify the data before it can work with it, and so I didn’t really wanna spend my time working on parsing formats that aren’t really relevant and are generic enough to have standard implementations. And now you’ve kindly helped me out with that, so thanks lol

I swear my suggestion wasn’t entirely selfishly motivated, hehe I just thought it was a fun bit of code to work on which was recently in my mind bc of that project.

I had another thought to improve my implementation which was to add comptime header detection and reflection to build a type to store the resulting parsed data, I ended up doing something a bit quicker and much simpler though, just to get it done. So I went with just, passing a bool and made the parse fn take a type parameter of the struct to return. But I feel like you could write something pretty elegant with Zig’s comptime, and I love getting a chance to work with comptime as I am not the best at taking advantage of it in my code just yet but want to get better since every time I do use comptime I find it super satisfying to implement and useful.

jules · November 29, 2024, 5:31pm

Hey, I just wanted to say, I read the code and used the library for the use case I was talking about - and I thought it was very well done! it’s clear you put a fair bit of effort into the design, performance, and usability of this library. I found it really easy to work with, and far more feature complete and customizable compared to the TSV parser I implemented (bodged).

If you are still interested in suggestions, I feel like there’s a design that would unify the many different implementations of your data structures representing like Row, Field, etc. I am not sure if that would actually be better? Maybe something for comptime and it could provide enough metaprogramming capabilities to construct and return the appropriate types/structs based on the params given to the builder fn?

When I was using the library, it stood out to me that it’s seemingly organized into many different Parser() implementations often times each with it’s own Row, Field, RowIter, etc impl. Now I understand why you did it this way, and FWIW they are pretty significantly different in what they do. I noticed you have Parser’s that write to HashMaps and some that stream to another writer, and some that deserialize the data into structs and some that don’t, and so on. I guess it took me a minute to realize what parser I needed and how they differed. And there’s nothing wrong with that, but I do enjoy when code is structured in such a way that it guides you through how to use it through it’s design and structure. That is hard to do though, obv, and I struggle to make APIs that clean.

I suppose maybe your design choice is potentially similar to the many different ArrayList’s and HashMap’s in Zig’s stdlib. But I did feel like it took me a second to figure out what each one did so maybe it’s just a matter of naming them better rather than refactoring the actual code. I’m not sure - and to be clear, this is not at all criticism, I actually really like your design and I absolutely understand this is just a learning project for you.

I am actually using your code as a dependency for my project rn and it’s going great! so tbh I am really just discussing for the sake of it, I like talking about Zig code hehe. That’s all.

Sze · November 29, 2024, 6:15pm

I think it is good to tend towards using less generic code, I think code that uses more generic types than needed becomes less readable and likely has worse documentation.

So I think it is good to write out some things manually, instead of trying to generate everything. Just wanted to say that I see a moderate amount of duplication in a library as a good thing, because the author resisted the temptation to turn everything into generic types.

That said, I think how much generic code is appropriate can change based on the project and what the author has planned or wants to explore.

jules · November 29, 2024, 8:14pm

Yeah you’re right.

I maybe didn’t communicate what I was thinking very well, I was mostly trying to get at the idea that it was a bit confusing to be presented with like a sizeable number of equally valid parsers and I suppose I was not really upset about the code duplication but rather the API being a bit unclear. Well, I’m not upset at all lol. I’m just chatting XD

So my thought was, maybe a builder function or something could be made to unify the API so that you just kinda walk through the builder and get out the parser you want.

Definitely was not suggesting to try and make it hyper-generic or abstract. My bad if my rambles read like that.

Thoughts?

pachde · November 29, 2024, 8:50pm

IMO a CSV parsing library should stick to the parsing CSV domain, extracting rows of textual columns, and leave everything else to the app or another library.

mtolman · December 4, 2024, 4:54pm

That is good feedback. Going into it I was mostly writing the parsers in a way that made sense for stepping stones in my own writing. I did the raw parser, then the stream parser, then the slices, then the stream fast parser since each step became more complex, and I wanted to make sure I was addressing the complex problems individually (e.g. CSV parsing, using a reader, SIMD) rather than everything at once. I haven’t thought about a way to really unify things, but I’ll give that a shot. I also might move some of the parsers that only existed for my benefit of writing the library into a separate repository just to clean things up.

jules · December 4, 2024, 5:25pm

I think that there is room for a CSV parser to add semantic meaning to the things it’s parsing through Zig’s type system. I understand your point, and I am generally on the side of less abstract and bloated code just for the sake of having a really long feature list on ur git readme.

That said, I feel like a parser is inherently meant to do this semantic analysis much the same as the parser/lexer of a programming language where each pass, generally speaking, adds more meaning and structure to the input which is initially just a text file with a specific syntax. The same is true of a CSV parser, and I think in that case it’s entirely reasonable for the user of such a library to provide a structure or other data type to the library code via whatever API is provided so that the CSV parser can then use that to turn the textual format into the meaningful structure you have created for that data. @mtolman’s implementation can also give you just a map of key value pairs representing the CSV if that’s desired - but, since most cases where a CSV is parsed, it will be parsed into a struct of some kind, this is also provided and @mtolman’s code actually does a good job of providing control over how and where allocations are made, though I think it still could be more clear/documented. You can also imagine a JSON parser as a similar example of adding semantic meaning and structure to a textual data format, you would want a JSON parser, and in fact the Zig JSON parser does this, to give you the ability to parse integers and arrays and as such instead of just providing the raw bytes for each field. And Zig also gives pretty good control over that behaviour if it isn’t what you need.

Yeah I assumed that is how it happened, I write code in a very similar way, and tbh I personally think you did a great job of breaking down the different parsing strategies into smaller and simpler, single-purpose, implementations of a more abstract parser interface. And this approach is idiomatic Zig, with many similar examples in the stdlib. My suggestion was specifically about making it more clear what each did, either through some kind of builder interface that lets you easily create a parser based on what inputs and data you have in ur specific use case, i.e do you have a reader and writer and want to stream, do you just have a reader, how is your data delimited, what structure do you want as output, etc. And I suggested that, maybe in a bit of a rambling and round-about way (thanks @Sze for helping clarify my point), because I found it created a tiny bit of friction where it didn’t really need to exist in trying to determine what exactly each parser did, and which was the most relevant to the parsing I needed done in my code.

mtolman · December 7, 2024, 10:27pm

I’ve made an updated (but breaking) change to how the parsers are organized. I also went back and revisited some of the methods around Row and Field structs and made them more unified (though Field still differs).

I moved the type conversions (asInt, asFloat, etc) off of the Field structs and put them in a separate namespace (and renamed them to fieldToInt, fieldToFloat, etc.). These conversions can be used on both field structs (with 2 niche exceptions).

I also added in a ParserBuilder. It doesn’t create the zero-allocation stream parser since that parser is really wonky with usage due to the zero-allocation constraint. However, the map, column, and zero-allcoation slice parsers can be made from the builder. The builder also has methods for customizing the CSV options (like delimiters, line endings, etc).

One thing I noticed was that each parser has different rules for when memory needs to be deinitialized (zero-allocation doesn’t deinitialize anything, column only deinitializes rows, and map deinitializes rows and fields). To help with that, the builder provides several methods to clean up parsers and rows, and those methods will intelligently become no-ops if nothing is needed. That way there’s a uniform way of cleaning up memory for all three main parsers. I found this helps quite a bit since when the code is first written all of the defer statements are added as normal. Then when a parser switch happens (e.g. column → zero allocating slice or vice versa), it’s generally a matter of fixing build errors and the memory cleanup is already handled (as opposed to fix build errors and then remember to add defer statements).

New version is published as 0.7.0. Release notes go into more details.