\n vs \r in readUntilDelimiterOrEofAlloc

ssilnicki-dev · November 2, 2024, 9:49pm

Hi all! I’m looking for a workaround of windows compatibility problem with stdin’s readUntilDelimiterOrEofAlloc method. Delimiter char (\n for *nix, \r for win), makes codebase platform dependant

dimdin · November 2, 2024, 10:31pm

The line separator on Unix is \n and on Windows is the sequence \r\n. It is common to find files on any system with both separators; handling both of them on all systems as separators is beneficial for any program.

The easiest solution I can think is to reduce the slice size by one if the last character is a \r.

    var slice: []u8 = try readUntilDelimiterOrEofAlloc(...).?;
    const len = slice.len;
    if (slice[len - 1] == '\r') {
        slice = slice[0..(len - 1)];
    }

ssilnicki-dev · November 2, 2024, 11:20pm

we’ve converged to the same solution. Yeah it is sad story that this kludge keeps traveling across all codebases…

gonzo · November 3, 2024, 8:46am

And on some other systems (was it on old Macs?), the line separator would be \r. So yeah, better handle all possible combinations!

(Shakes fist at Microsoft for using \ instead of / as path separator.)

jmc · November 3, 2024, 3:56pm

Text is hard. Some runtimes offer automatic conversion of byte sequences into the expected line terminators for the platform, but they can also lead to weird results.

I wrote three identical-ish tiny programs (in Zig, in C and in Python) that will read the first line from a text file and dump it to stdout. The C and Python ones also allow the user to override the read mode string passed to fopen() / open(), since they allow distinguishing between textual mode (t modifier, which is typically on by default) and binary mode (b modifier, for when you don’t want the runtime to mess with your bytes).

Here’s their outputs dumped through od -x -c so you can see the hex dump / C-style escapes. It runs them through a text file with CRLF endings and one with LF endings.

bin/main_zig data/test_crlf.txt              | od -x -c
0000000      6966    7372    2074    696c    656e    000d                
           f   i   r   s   t       l   i   n   e  \r                    
0000013
bin/main_c data/test_crlf.txt rt             | od -x -c
0000000      6966    7372    2074    696c    656e    0a0d                
           f   i   r   s   t       l   i   n   e  \r  \n                
0000014
bin/main_c data/test_crlf.txt rb             | od -x -c
0000000      6966    7372    2074    696c    656e    0a0d                
           f   i   r   s   t       l   i   n   e  \r  \n                
0000014
python src/main.py data/test_crlf.txt rt  | od -x -c
0000000      6966    7372    2074    696c    656e    000a                
           f   i   r   s   t       l   i   n   e  \n                    
0000013
python src/main.py data/test_crlf.txt rb  | od -x -c
0000000      6966    7372    2074    696c    656e    0a0d                
           f   i   r   s   t       l   i   n   e  \r  \n                
0000014
bin/main_zig data/test_lf.txt              | od -x -c
0000000      6966    7372    2074    696c    656e                        
           f   i   r   s   t       l   i   n   e                        
0000012
bin/main_c data/test_lf.txt rt             | od -x -c
0000000      6966    7372    2074    696c    656e    000a                
           f   i   r   s   t       l   i   n   e  \n                    
0000013
bin/main_c data/test_lf.txt rb             | od -x -c
0000000      6966    7372    2074    696c    656e    000a                
           f   i   r   s   t       l   i   n   e  \n                    
0000013
python src/main.py data/test_lf.txt rt  | od -x -c
0000000      6966    7372    2074    696c    656e    000a                
           f   i   r   s   t       l   i   n   e  \n                    
0000013
python src/main.py data/test_lf.txt rb  | od -x -c
0000000      6966    7372    2074    696c    656e    000a                
           f   i   r   s   t       l   i   n   e  \n                    
0000013
wine bin/main_zig.exe data/test_crlf.txt  | od -x -c
0000000      6966    7372    2074    696c    656e    000d                
           f   i   r   s   t       l   i   n   e  \r                    
0000013
wine bin/main_c.exe data/test_crlf.txt rt | od -x -c
0000000      6966    7372    2074    696c    656e    0a0d                
           f   i   r   s   t       l   i   n   e  \r  \n                
0000014
wine bin/main_c.exe data/test_crlf.txt rb | od -x -c
0000000      6966    7372    2074    696c    656e    0d0d    000a        
           f   i   r   s   t       l   i   n   e  \r  \r  \n            
0000015
wine bin/main_zig.exe data/test_lf.txt  | od -x -c
0000000      6966    7372    2074    696c    656e                        
           f   i   r   s   t       l   i   n   e                        
0000012
wine bin/main_c.exe data/test_lf.txt rt | od -x -c
0000000      6966    7372    2074    696c    656e    0a0d                
           f   i   r   s   t       l   i   n   e  \r  \n                
0000014
wine bin/main_c.exe data/test_lf.txt rb | od -x -c
0000000      6966    7372    2074    696c    656e    0a0d                
           f   i   r   s   t       l   i   n   e  \r  \n                
0000014

Some notable things:

Zig’s readUntilDelimiter*() functions return content up to the delimiter, but without including the delimiter itself
POSIX getline() returns up to and including the delimiter
Python actually does conversion of CRLF into LF (“universal newlines”) in text mode
I have no idea what’s up with that \r\r\n sequence in the test_crlf.txt/rb case for the Win32 build of the C program.

I can provide the sources if people want to play around with that (and potentially fix any mistakes).

david_vanderson · November 3, 2024, 4:51pm

I just recently ran into an issue where copying text with only \n into the windows clipboard (via sdl), gives \r\n when pasting it back out. So now I’m discarding \r characters and I’m sure at some point in the future it’s going to come back to haunt me somehow.

jmc · November 3, 2024, 4:58pm

I guess I forgot to add my takeaway from this experiment: if you only care about text files and you don’t care about the line terminator(s), then read the line, and trim any \r and \n characters from the end of the string. This will work on all platforms / runtimes.

ssilnicki-dev · November 6, 2024, 9:36pm

what actually triggered this topic, is kinda unexpected behav., when I tried to run my code in windows context. The code itself just reads user’s input from stdin and passes it to int parser… And yes, zig does not return delimiter, but we can only specify single char delimiter (‘\n’ in my case). And after it receives ‘\r’, ‘\n’ in the input, it “eats” ‘\n’ and innocently returns expected input + ‘\r’, which parser vomits as an error… I think that it is simply impossible (and maybe overkilling) to implement workarounds in the lang itself, but it really annoys to handle all this legacy stuff. Maybe it is worth mirgate signle char delimiter argument to something like [][]u8, which will describe all the options, being it single char, two char (three char!!!), etc. So, imagine having something like passing all the potential options we expect to handle in an array of slices… in this case, it is handy to pass something like this, covering all those weirdos’ ideas in the past about which method of delimiting kosher and which is not:

    const options = [_][]const u8{ &.{ '\r', '\n' }, &.{'\r'}, &.{ '\n', '\r' }, &.{'\n'} };

and then send it to the reader… dreams, dreams…