Sortcp - Tool that sorts files prior to copying to lower disk fragmentation

I wanted to share this in case others find it interesting.
It’s nowhere near perfect and finished, more so I’m struggling to find a way how to stat a file in 0.15.2 that it does not follow the sym_links by default. Currently works only if you pass a directory as an argument instead of a file… :frowning:
0.16 branch is using a linear approach when processing directories. It recurses, then creates a path, then copies over sym_links and files to a destination. Theoretically should be faster but it’s slower than 0.15.2 mixed approach, that does path creation and copy within a recursion itself.
Please, write your thoughts either in the repo or here. Code suggestions, logic changes, If I use inlining too much, should I even use it, etc etc. I’m always learning and always feel bad with the lack of my knowledge :joy:
My goal is to lower disk fragmentation as much as possible while keeping the speed in comparison to GNU cp and rsync. Thinking when it gets to somewhere to add parallelism as well.

5 Likes

I managed to add hashing options, swapped to almost fully branchless execution within loops and used a bit of threading to parallelize hashing for the source file and writing to destination.

I need to start fixing Windows functions, currently not even a regular copy works there, don’t ask me why :joy:
Fallocate is left to be implemented on MacOS and windows because setEndPos is not yet in 0.16.0-dev
Take a look, write your thoughts :slight_smile:

I wonder what makes it “reduce disk fragmentation”? Isn’t commit interval of filesystem is more important in that regard?

Is there any specific file-system type that this project is intended for? For example, is there going to be differences between NTFS, EXT4, BTRFS, etc?

I do not profess to understand the complexity of how each type decides on how/when/where to write data, my naive assumption was simply that the benefits of sorting may be heavily dependent on whether they use journaling, copy-on-write, compression, etc.

File systems are much like allocators. Certain patterns of access and the size can have significantly different behaviors. Some file systems use very different strategies between small and large files where on the actual storage medium they’re placed.

Extremely file system dependent.

2 Likes

From the testing I’ve done so far, pre allocating disk space in advance and calling sync() gave me roughly 40% less fragmentation. But sync() slows down the overall process significantly. fallocate() showed improvements by itself, not slowing down the process as much but still lowering fragmentation

1 Like

I do intend to test on different filesystems. I just finished something that copies the data in shown patterns from the README.md file. It’s a first step at least. When I get that stable, which I think I did, I’ll focus on testing for fragmentation.
From investigating and brief testing on EXT4, order of writing and pre-allocation of the space seem to matter the most.
I do intend to test it on XFS, BTRFS, BcacheFS, BeeGFS and others and see which improvements are worth doing and which ones can be made

2 Likes

Surprised to hear that userland file operations matter to fragmentation that much. As @knightpp mentioned the commit interval allows filesystem itself to defragment the transaction to disk (the larger the interval, the bigger risk for data loss however).

1 Like