How to get a hash of a file?

What I was trying is something like this:

            var reader = src_file.readerStreaming(ctx.io, ctx.buf);
            var writer_wrap = dest_file.writer(ctx.io, ctx.buf);
            const writer = &writer_wrap.interface;

            var src_hash: [Blake3.digest_length]u8 = undefined;
            const verify = true;
            if (verify) {
                var hasher = Blake3.init(.{});
                while (true) {
                    const amt = try reader.interface.stream(writer, .limited(ctx.buf.len));
                    if (amt == 0) break;
                    hasher.update(ctx.buf[0..amt]);
                }
                hasher.final(&src_hash);

                // Is this needed? Resetting reader state
                reader.interface.seek = 0;
                reader.interface.end = 0;
            }
            std.log.info("Hash: {x}", .{src_hash});

But the hash is the same no matter which file I pass, so something is not correct here. I don’t quite understand why. Maybe the issue is that I’m passing a writer for a destination file? In that case is there a generic writer that uses /dev/null?

All in all, my intent is to use hashes to compare if source file and dest files are the same. Later I’m using while loop with sendFileAll() and I’d check a hash of the source:

            while (try writer.sendFileAll(&reader, .limited(@intCast(ctx.max_chunk))) > 0) {}
            try writer.flush();
            try dest_file.setPermissions(ctx.io, stat.permissions);

            if (acls) try copyMetadata(
                ctx.acl_list,
                ctx.acl_value,
                src_file.handle,
                dest_file.handle,
            );
        }

Also, not sure if I need to flush aftersendFileAll() as well.

std.Io.Reader.VTable:

stream:
Returns the number of bytes written, which will be at minimum 0 and at most limit. The number returned, including zero, does not indicate end of stream.

You break when you receive 0, but it doesn’t mean the stream has ended.
For example the stream could return 0 if it does so to internally reconfigure itself to better satisfy the next call to stream, or even if it doesn’t have data available and is expecting to receive more.

2 Likes

Thanks. I wrote this now just for testing:

            var reader = src_file.readerStreaming(ctx.io, ctx.buf);
            var writer_wrap = dest_file.writer(ctx.io, ctx.buf);
            const writer = &writer_wrap.interface;

            var src_hash: [Blake3.digest_length]u8 = undefined;
            const verify = true;
            if (verify) {
                var offset: usize = 0;
                var hasher = Blake3.init(.{});
                              while (true) {
                    const new_count = reader.interface.stream(writer, .limited(ctx.buf.len)) catch |err| switch (err) {
                        error.EndOfStream => 0,
                        else => |e| return e,
                    };
                    offset += new_count;
                    if (offset == stat.size) break;
                    hasher.update(ctx.buf[0..new_count]);
                }

                hasher.final(&src_hash);

                // Is this needed? Resetting reader state
                reader.interface.seek = 0;
                reader.interface.end = 0;
            }
            std.log.info("Hash: {x}", .{src_hash});

but still the hash is the same, which should not be the case:

j.markovic@MacBook-Pro-2 ~/GIT/sortcp (git)-[0.16-mixed] % zig build run -- README.md testhash -v
debug: field_name: sym_link
items.len: 0
debug: field_name: file
items.len: 1
debug: Copying README.md to testhash
info: Hash: 8d72b7b2b0db788318dd9f1268eb5c3202a09c83223f7df4fd8d5475b962e539
j.markovic@MacBook-Pro-2 ~/GIT/sortcp (git)-[0.16-mixed] % rm testhash
j.markovic@MacBook-Pro-2 ~/GIT/sortcp (git)-[0.16-mixed] % zig build run -- LICENSE testhash -v 
debug: field_name: sym_link
items.len: 0
debug: field_name: file
items.len: 1
debug: Copying LICENSE to testhash
info: Hash: 8d72b7b2b0db788318dd9f1268eb5c3202a09c83223f7df4fd8d5475b962e539

EDIT: Fixed slicing and while loop count. Hash was different after, but still the same on two different files

I’m not exactly an expert on std.crypto.Blake3, but my guess is that the problem is that you’re always initialising it with the same key for every single file.
It doesn’t seem to self-modify its key no matter what methods you call, and it does seem to affect its output.

I tried Sha256 and I got the same issue. Different hash of course, but same on different files. Readers clearly use different data and files are of different length even.

EDIT: The docu for .init() for Blake3 says that key is optional.

Another thing that sticks out is that you are sharing the same buffer between the reader and the writer, I think that would cause trouble.

1 Like

I tired various combinations, with &.{} for a reader, ctx.buf for a reader/writer, with internal_buf: [propper_len]u8 = undefined; All the same.
But you are right, first I was using &.{} for a reader which seemed correct since only a writer needs to have an actual buffer then I tried having the same on both places

Your slicing of ctx.buf[0...amt] is most likely incorrect, as the reader is not guaranteed to have written to the start of the buffer. Use std.io.Reader.buffered() instead to get the correct slice:

pub fn buffered(r: *Reader) []u8 {
    return r.buffer[r.seek..r.end];
}

But, I would hazard a guess, maybe your reader buffer size is generously large and you’re testing against a file that’s smaller than that, and your reads immediately hit EndOfStream and your hash update loop effectively doesn’t run?

For example, see how std.io.Reader.takeDelimiter() handles EndOfStream: by still handling whatever had been buffered up to that point.

pub fn takeDelimiter(r: *Reader, delimiter: u8) error{ ReadFailed, StreamTooLong }!?[]u8 {
    const inclusive = r.peekDelimiterInclusive(delimiter) catch |err| switch (err) {
        error.EndOfStream => {
            const remaining = r.buffer[r.seek..r.end];
            if (remaining.len == 0) return null;
            r.toss(remaining.len);
            return remaining;
        },
        else => |e| return e,
    };
    r.toss(inclusive.len);
    return inclusive[0 .. inclusive.len - 1];
}

For a more complete test case, I modified the code from your original post for local testing:

const std = @import("std");
const Blake3 = std.crypto.hash.Blake3;

fn hash(instream: *std.io.Reader) !void {
    var hasher: Blake3 = .init(.{});
    while (true) {
        const data = instream.take(instream.buffer.len) catch |err| switch (err) {
            error.EndOfStream => {
                const buffered = instream.buffered();
                if (buffered.len != 0) hasher.update(buffered);
                break;
            },
            else => return err,
        };
        hasher.update(data);
    }

    var src_hash: [Blake3.digest_length]u8 = undefined;
    hasher.final(&src_hash);

    std.log.info("Hash: {x}", .{src_hash});
}

pub fn main() !u8 {
    var gpa: std.heap.DebugAllocator(.{}) = .init;
    defer if (gpa.deinit() != .ok) std.log.err("memory leak detected!", .{});
    const allocator = gpa.allocator();

    const args = try std.process.argsAlloc(allocator);
    defer std.process.argsFree(allocator, args);

    if (args.len - 1 != 1) { // ignore argv0
        std.log.err("invalid number of args: {d} (wanted: 1)\n", .{args.len - 1});
        return 1;
    }

    var src = try std.fs.cwd().openFile(args[1], .{});
    defer src.close();

    var buf: [4096]u8 = undefined;
    var reader = src.reader(&buf);

    try hash(&reader.interface);

    return 0;
}

And it does seem to yield different hashes for different files, as expected:

$ zig run blake3.zig -- blake3.zig
info: Hash: cdbd325bd4464cc06eb2dab81b93571f5f0cab970b5b94d4801019afbfc68911
$ zig run blake3.zig -- ../6488/build.zig 
info: Hash: 9d6a4797c43706970e820d679c9cd18f91afdd13f50dd32c7238f6da6a12bbf4
$ zig run blake3.zig -- ../6488/zigc.zig 
info: Hash: f453903335f660a6a8da024faeeabdf739c4a322028c7d37be8a00ee183d63e2
3 Likes

…and for good measure, since I don’t have tools installed that can produce a Blake3 hash, I’ve updated the code to use SHA256 instead to verify it does work correctly:

$ zig run blake3.zig -- blake3.zig 
info: Hash: 9f570c4aaf669f2bdccf92f6baa2e317353147501f6d549428434916065954cd
$ sha256 blake3.zig 
SHA256 (blake3.zig) = 9f570c4aaf669f2bdccf92f6baa2e317353147501f6d549428434916065954cd
$ zig run blake3.zig -- ../6349/server.zig
info: Hash: d76adb998b81f70ca2a44d055fe70c0df72b6faf9143a5a9c1d18a79a024b428
$ sha256 ../6349/server.zig
SHA256 (../6349/server.zig) = d76adb998b81f70ca2a44d055fe70c0df72b6faf9143a5a9c1d18a79a024b428

If your goal is simply to get the hash of a file, you don’t need to use std.Io.Reader/Writer at all. You can just mmap the file and compute the hash in a single hash() call. This is tested with 0.16.0-dev.2349+204fa8959:

const std = @import("std");
const Io = std.Io;

pub fn main(init: std.process.Init) !void {
    const io = init.io;
    const args = try init.minimal.args.toSlice(init.arena.allocator());
    const dir = std.Io.Dir.cwd();
    const path = args[1];

    const file = try dir.openFile(io, path, .{ .allow_directory = false, .mode = .read_write });
    defer file.close(io);
    var file_map = try file.createMemoryMap(io, .{ .len = try file.length(io) });
    defer file_map.destroy(io);

    // Ensure the contents of the map are populated
    try file_map.read(io);

    var src_hash: [std.crypto.hash.Blake3.digest_length]u8 = undefined;
    std.crypto.hash.Blake3.hash(file_map.memory, &src_hash, .{});

    std.debug.print("hash: {x}\n", .{&src_hash});
}

Output:

$ zig build run -- ./src/main.zig 
hash: bd0a0de9738517f4c062ad0f6d781bcb3e3a7d1285581ab2bc6c7f539f1e0438
$ zig build run -- ./build.zig
hash: 50c65a7c4efe3d9ad0a478bc106db3a42dfd92b1ce627b88101ec05542c6ba0b

It’s probably possible to open the file as read only too, but I couldn’t get it to work. If I tried:

const file = try dir.openFile(io, path, .{ .allow_directory = false }); // Default is read only
defer file.close(io);
var file_map = try file.createMemoryMap(io, .{ .len = try file.length(io) });
defer file_map.destroy(io);

I got:

error: AccessDenied
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/Io/Threaded.zig:16548:27: 0x1088635 in createFileMap (std.zig)
                .ACCES => return error.AccessDenied,
                          ^
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/Io/Threaded.zig:16389:73: 0x106c3aa in fileMemoryMapCreate (std.zig)
            error.Unseekable, error.Canceled, error.AccessDenied => |e| return e,
                                                                        ^
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/Io/File/MemoryMap.zig:67:5: 0x11d4942 in create (std.zig)
    return io.vtable.fileMemoryMapCreate(io.userdata, file, options);
    ^
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/Io/File.zig:819:5: 0x11c52df in createMemoryMap (std.zig)
    return .create(io, file, options);
    ^
/tmp/hash-test/src/main.zig:12:20: 0x11c215b in main (main.zig)
    var file_map = try file.createMemoryMap(io, .{ .len = try file.length(io) });
                   ^
run
└─ run exe hash_test failure
error: process exited with error code 1
failed command: /tmp/hash-test/zig-out/bin/hash_test ./build.zig

Build Summary: 3/5 steps succeeded (1 failed)
run transitive failure
└─ run exe hash_test failure

error: the following build command failed with exit code 1:
.zig-cache/o/8e4d2a7c1391bbc16b86b37a4ffe20f1/build /home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/zig /home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib /tmp/hash-test .zig-cache /home/robby/.cache/zig --seed 0x9757c8af -Z4ce972384ce331ee run -- ./build.zig

Because it’s trying to create the memory map with read / write access, but if I try:

const file = try dir.openFile(io, path, .{ .allow_directory = false });
defer file.close(io);
var file_map = try file.createMemoryMap(io, .{ .len = try file.length(io), .protection = .{ .write = false } });
defer file_map.destroy(io);

I get:

Segmentation fault at address 0x7f4c3e568000
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/compiler_rt/memcpy.zig:170:17: 0x12599b3 in copyFixedLength (compiler_rt)
        d[i] = s[i];
                ^
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/crypto/blake3.zig:870:57: 0x11eb57e in fillBuf (std.zig)
        @memcpy(self.buf[self.buf_len..][0..take], input[0..take]);
                                                        ^
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/crypto/blake3.zig:890:38: 0x11e0189 in update (std.zig)
            const take = self.fillBuf(inp);
                                     ^
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/crypto/blake3.zig:1154:30: 0x11d576e in update (std.zig)
            self.chunk.update(inp);
                             ^
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/crypto/blake3.zig:989:17: 0x11c5562 in hash (std.zig)
        d.update(b);
                ^
/tmp/hash-test/src/main.zig:19:32: 0x11c22f3 in main (main.zig)
    std.crypto.hash.Blake3.hash(file_map.memory, &src_hash, .{});
                               ^
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/start.zig:718:30: 0x11c2ea0 in callMain (std.zig)
    return wrapMain(root.main(.{
                             ^
/home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib/std/start.zig:190:5: 0x11c1cf1 in _start (std.zig)
    asm volatile (switch (native_arch) {
    ^
run
└─ run exe hash_test failure
error: process terminated with signal ABRT
failed command: /tmp/hash-test/zig-out/bin/hash_test ./src/main.zig

Build Summary: 3/5 steps succeeded (1 failed)
run transitive failure
└─ run exe hash_test failure

error: the following build command failed with exit code 1:
.zig-cache/o/8e4d2a7c1391bbc16b86b37a4ffe20f1/build /home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/zig /home/robby/downloads/zig-x86_64-linux-0.16.0-dev.2349+204fa8959/lib /tmp/hash-test .zig-cache /home/robby/.cache/zig --seed 0x815dc2f5 -Zd28f190fd79a4981 run -- ./src/main.zig

which… seems like it might be a bug? It seems like using @memcpy to copy out of a read only memory region is triggering a segfault, and I don’t think that should happen. Either way, if you’re fine opening with .read_write access then the above code should work fine.

1 Like

Be aware that mmap() typically works for regular files and other resources you can do random access on. For a stream, that’s not going to work (you might get ENODEV or EACCES).

@jmc I wanted to experiment with a buffered() as well to see if that could be the case. What I like about your solution is that it uses returns directly to calculate the parts of the hash.

@Zambyte Your solution looks to be more efficient for my case since I generally don’t plan to process streams but actual files on the disk. This worked on MacOs with a .read_only flag set on src_file:

/// Uses mmap() to calculate and return hash for a file
fn hashMmap(io: Io, buf: []u8, file: Io.File, file_size: u64) !void {
    var file_map = try file.createMemoryMap(io, .{
        .len = file_size,
        .protection = .{ .read = true, .write = false },
    });

    try file_map.read(io);
    defer file_map.destroy(io);

    Blake3.hash(file_map.memory, buf, .{});
}

Only thing I had to change was .read = true for the dest_file, since I was using createFile().

Actually, after reading about mmap(), seems that streaming function is more efficient and safe. Anyways, both a re solutions and work. Thank you for both answers.

The other answers cover what was wrong with your original code, but since nobody mentioned std.Io.Writer.Hashed, I wanted to bring it up since it’s perfect for your use-case. Here’s a simple example of how you might use it:

const std = @import("std");

pub fn main(init: std.process.Init) !void {
    const args = try init.minimal.args.toSlice(init.arena.allocator());
    if (args.len != 3) {
        std.log.err("usage: hash_copy INPUT OUTPUT", .{});
        std.process.exit(1);
    }

    const input = try std.Io.Dir.cwd().openFile(init.io, args[1], .{});
    defer input.close(init.io);
    const output = try std.Io.Dir.cwd().createFile(init.io, args[2], .{});
    defer output.close(init.io);
    var reader = input.reader(init.io, &.{});
    var writer = output.writer(init.io, &.{});

    var buf: [1024]u8 = undefined;
    var hasher: std.crypto.hash.Blake3 = .init(.{});
    var hashed_writer = std.Io.Writer.hashed(&writer.interface, &hasher, &buf);
    _ = try reader.interface.streamRemaining(&hashed_writer.writer);
    try hashed_writer.writer.flush();
    try writer.interface.flush();

    var hash: [std.crypto.hash.Blake3.digest_length]u8 = undefined;
    hasher.final(&hash);
    std.log.info("hash: {x}", .{hash});
}

Also,

If you need a writer end that doesn’t actually write to anything, use std.Io.Writer.Discarding, or for the purpose of hashing, std.Io.Writer.Hashing which hashes anything written to it but discards the written data (not to be confused with Hashed above, which passes through to another writer as well).

9 Likes

I’d caution against jumping to mmap immediately, it’ll break on 32-bit systems with files larger than 4GiB. While not common on desktop these days, they’re still reasonably common on embedded Linux systems. I still have to deal with 32-bit embedded Linux at work, and will have to until we end support for the devices (~5-10 years from now).

It also complicates error handling as now you have to deal with signals when a read-error occurs rather than getting a return value. That’s okay if your response on IO error is to immediately abort without providing additional information to the user, but if not, then think twice before using it or about how to design a robust error reporting system when using mmap.

If memory bandwidth isn’t your bottleneck, then read(2) will be more or less as performant for sufficiently large block size.

I did see a hashed writer, I missed that it discards the data. I’m guessing how this works is that hasher reads the data, gets a hash, then discards it, then the writer flushes it to a file.
How I understand this is that it generates a hash from the read output? Then if I want to confirm that the copy of the file has the same hash, I’d have to read that file once again. Correct me if I’m wrong on that last one

Basically, std.Io.Writer.Hashed wraps around another writer, and its implementation of drain (which is the main function writers use to write data) writes data to the underlying writer and hashes it. So the data ends up in the underlying writer and in the hash. The other one, std.Io.Writer.Hashing, has an implementation of drain with no underlying writer, just a hash, so the written data is discarded.

Yes, if you want to confirm that the copy of the file has the same hash, you would need to read back the copy and calculate the hash from it. The hash that gets printed in my sample program above is calculated from the copy process. Of course, barring a filesystem issue or a different process modifying the copy of the file after it’s copied, the hash calculated from reading back the copy should always be the same as what was calculated while copying (it would just be checking whether the data you wrote out is still the contents of the copy).

5 Likes

I’m experimenting with std.Io.Writer.Hashing and think I may be doing something wrong.

There’s a file lorem_ipsum.txt with content:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

on which I’m using the below code to get its sha256:

test "get the hash of a file and print it" {
    const input = try std.Io.Dir.cwd().openFile(io, "lorem_ipsum.txt", .{});
    defer input.close(io);

    var reader = input.reader(io, &.{});

    var buf: [4096]u8 = undefined;
    var hasher: std.crypto.hash.sha2.Sha256 = .init(.{});
    var hashed_writer: std.Io.Writer.Hashing(std.crypto.hash.sha2.Sha256) = .initHasher(hasher, &buf);
    _ = try reader.interface.streamRemaining(&hashed_writer.writer);
    try hashed_writer.writer.flush();

    var hash: [std.crypto.hash.sha2.Sha256.digest_length]u8 = undefined;
    hasher.final(&hash);

    std.debug.print("{x}\n", .{hash});
}

const std = @import("std");
const io = std.testing.io;

The output is:

e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
All 1 tests passed.

However, running echo -n lorem_ipsum.txt | sha256sum and printf lorem_ipsum.txt | sha256sum produce:

c1adfec13efcc8b432f70ecba8d1fc31d929421b876bc7ec3ca1b8eccfc9a278

and sha256sum lorem_ipsum.txt produces bd284d3c50ef82bf57b9cf8144447d0fac68b7c0c59e7ae6eb81c002820bee41.

Whats the right way to compare the hash from the Zig program with one from the terminal ?

I experimented with that and was working for me here:

/// Unused
/// Hashed writer that returns a hash and optionally writes data
fn hashWriter(
    comptime T: type,
    io: Io,
    dest_file: anytype,
    buf: []u8,
    reader: *Io.File.Reader,
    max_chunk: u64,
    comptime pipe: bool,
) ![T.digest_length]u8 {
    // Writer setup, don't pass a buffer here.
    var writer_wrap = if (pipe) dest_file.writer(io, &.{});
    var discard_init = if (!pipe) Io.Writer.Discarding.init(&.{});

    // 2. Select the pointer to the interface
    const writer = if (pipe)
        &writer_wrap.interface
    else
        &discard_init.writer;

    var src_hash: [T.digest_length]u8 = undefined;
    var h = if (@hasDecl(T, "Options")) T.init(.{}) else T.init();
    var h_writer = Io.Writer.hashed(writer, &h, buf);
    std.log.debug("Started writing loop.", .{});
    while (try h_writer.writer.sendFileAll(reader, .limited(@intCast(max_chunk))) > 0) {}

    try h_writer.writer.flush();
    try writer.flush();
    std.log.debug("Flused writers.", .{});

    h.final(&src_hash);
    return src_hash;
}

But at the end I decided to use just a reader

1 Like

.initHasher makes a copy of the hasher when it takes it as an argument, so that variable is never updated by the writer. Try this instead:

    var buf: [4096]u8 = undefined;
    var hashed_writer: std.Io.Writer.Hashing(std.crypto.hash.sha2.Sha256) = .initHasher(.init(.{}), &buf);
    _ = try reader.interface.streamRemaining(&hashed_writer.writer);
    try hashed_writer.writer.flush();

    var hash: [std.crypto.hash.sha2.Sha256.digest_length]u8 = undefined;
    hashed_writer.hasher.final(&hash);

Also,

echo -n lorem_ipsum.txt | sha256sum
printf lorem_ipsum.txt | sha256sum

These commands get the sha256sum of the literal string “lorem_ipsum.txt”, not the contents of the file.

3 Likes