How to read an entire text file line by line into memory efficiently?

andrewrk · October 21, 2024, 8:05pm

Above some limit, it doesn’t make sense to attempt to load the entire file into memory, and the application should be designed differently. You should only load the entire file into memory if you know an upper bound on how big the file will be. Otherwise, a database would be more appropriate.

Also many systems will crash the application or worse, other applications rather than return out of memory.

andrewrk · October 21, 2024, 8:09pm

mmap is not a magic “make it go faster” button.

mmap should be avoided unless you have a good reason for it because it turns handlable errors into signals that are very difficult to handle correctly - especially if you are writing library code that has no business installing global signal handlers.

It’s also less portable than file system APIs.

JPL · October 22, 2024, 6:12am

4 octets(UTF8) = 1 characters * 120 / line = 480(number of characters) * 1_000_000 line my test is ok = 4,800 Gigabyte (GB)
It’s a very big book perhaps, I have a headache.

Don’t take it the wrong way

pierrelgol · October 22, 2024, 6:45am

I agree with you, in fact even the example that I provided showed that the performance benefit was negligible at best. Outside of speed like you mentioned It’s also very error prone, you have to know about msync, and how the updates to disk are async which can lead to data loss if you crash, the issue with other process truncating your file while you are mapping it and the potential latency issues, or the kernel pre-emptive behaviour around paging which can cause issues. So definitely not the simplest approach, and like I’ve said in my first response the standard library is more than good enough. But I think it’s still good to share the knowledge around mmap.

MarvellouslyExist · October 22, 2024, 7:58am

I agree. Considering zig is still in development, such examples of how to do things would be very valuable.

MarvellouslyExist · October 22, 2024, 8:40am

I have added a limit of how large the file can be. Any other suggestions?

kavika13 · October 22, 2024, 10:46am

Memory mapped files apparently don’t really run any faster on Windows, because it doesn’t actually map them into memory until requested. And it requests a page at a time, via page fault.
But they might run faster on Linux.
So, the answer of whether they’re even worth the time or not is OS dependent.

GigaGrunch · October 22, 2024, 11:59am

Depending on what you want to do with the result, the fastest way to read a file can be not to read everything at one, but to read into a smaller buffer (like 1 MB) and process the file in those chunks continuously. This way you greatly reduce the amount of page faults (because you only get page faults until that 1 MB is mapped) and might also benefit from the processor’s cache more. It’s obviously more complicated to do things this way, but if you really want to optimize, you should look into it.