Autohashing of floats

pdoane · October 25, 2023, 5:50pm

Continuing discussion from Discord about hashing structures with floats in them. A common use case is finding exactly the same object for reuse, e.g. consider a struct like this:

typedef struct D3D11_SAMPLER_DESC {
  D3D11_FILTER               Filter;
  D3D11_TEXTURE_ADDRESS_MODE AddressU;
  D3D11_TEXTURE_ADDRESS_MODE AddressV;
  D3D11_TEXTURE_ADDRESS_MODE AddressW;
  FLOAT                      MipLODBias;
  UINT                       MaxAnisotropy;
  D3D11_COMPARISON_FUNC      ComparisonFunc;
  FLOAT                      BorderColor[4];
  FLOAT                      MinLOD;
  FLOAT                      MaxLOD;
} D3D11_SAMPLER_DESC;

When creating samplers, descriptors with exactly the same values can reuse existing objects. Using std.mem.asBytes() can have issues with embedded padding, and many descriptor objects have embedded pointers that need to follow.

The algorithm for autoHash works well for this use case, if it had an implementation of float hashing. I am not concerned about NaN behavior - these aren’t floats that are part of mathematical operations (although a debug.assert when encountered would be very reasonable).

AndrewCodeDev · October 25, 2023, 7:58pm

Hey @pdoane, thanks for joining us on Ziggit.

For context, this is the file being referred to: zig/lib/std/hash/auto_hash.zig at master · ziglang/zig · GitHub

And this is the line of code that’s being discussed:

.Float,
 => @compileError("unable to hash type " ++ @typeName(Key)),

dude_the_builder · October 26, 2023, 1:00am

I’m not sure hashing of floats is possible given the problems with precision and numbers that can’t be represented exactly as floating point. I once thought I could bit cast the float to an unsigned integer and hash that, but it still doesn’t work:

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer std.debug.print("{}\n", .{gpa.deinit()});
    const allocator = gpa.allocator();
    var map = std.AutoHashMap(u64, u64).init(allocator);
    defer map.deinit();

    const float_1: f64 = 3.1415;
    var key: u64 = @bitCast(float_1);
    std.debug.print("float_1 key: {b}\n", .{key});

    try map.put(key, 42);
    var value = map.get(key) orelse @panic("WAT?");
    std.debug.assert(value == 42);

    const float_2: f64 = 3.1415_0000000000_1; // 15 decimal places
    key = @bitCast(float_2);
    std.debug.print("float_2 key: {b}\n", .{key});
    std.debug.assert(map.get(key) == null);

    const float_3: f64 = 3.1415_00000000000_1; // 16 decimal places
    key = @bitCast(float_3);
    std.debug.print("float_3 key: {b}\n", .{key});
    value = map.get(key) orelse @panic("WAT?");
    std.debug.assert(value == 42);
}

output:

float_1 key: 100000000001001001000011100101011000000100000110001001001101111
float_2 key: 100000000001001001000011100101011000000100000110001001001110001
float_3 key: 100000000001001001000011100101011000000100000110001001001101111
heap.general_purpose_allocator.Check.ok

pdoane · October 26, 2023, 1:24am

Your example is with 3 keys that are initialized with different values though. Hashing floats is a commonly defined operation in other programming languages, e.g. see:

llvm-project/libcxx/include/__functional/hash.h at main · llvm/llvm-project (github.com)

which hashes floats through bitcasting outside of dealing with negative zero. I’m not arguing existence as a reason to do it for Zig, but this does work (at least in C/C++, but maybe Zig has different rules?).

There are many ways in which a float value will be equal, e.g. (reading the same memory location, literal values) that are important to the use case of detecting equivalent keys. Use of floating-point numbers as integers that are in range is completely fine too (e.g. both Lua and Javascript runtimes do this).

I’m not making up the use case about needing to detect duplicates. The Windows runtime takes instances of those structures I posted above hashes them and returns references to existing objects.

AndrewCodeDev · October 26, 2023, 2:43am

It took me a minute to read through the templated inheritance tree… but it looks like it essentially boils down to:

    size_t operator()(_Tp __v) const _NOEXCEPT
    {
        union
        {
            _Tp    __t;
            size_t __a;
        } __u;
        __u.__a = 0;
        __u.__t = __v;
        return __u.__a;
    }

Which, in zig, roughly translates to…

pub fn float64Hash(x: f64) usize {
    const HashUnion = extern union {
        source: f64,
        target: usize
    };
    var h = HashUnion { .target = 0 };
    h.source = x;
    return h.target;
}

pub fn main() !void {

    const x: f64 = 1982731982.9283742; // or whatever

    const h = float64Hash(x);

    std.debug.print("\nHash: {}\n", .{h});
}

Arithmetically speaking, floating points are inaccurate because of the inability to represent values because they are approximations. That’s why they are considered to be non-associative in the general case (if I subtract a really tiny number from a really big number, it may not be able to adequately represent that).

Returning the bits though is, to my knowledge, reliable.

This whole thing becomes a bit scary if you think about the case where arithmetic is involved. Because of the inability to precisely represent things in the general case, you could have two values that are (on paper) mathematically provable to calculate to different values but they end up with the same bit-set because of floating point representation issues.

@pdoane, you seem to be knowledge about this domain. Can you speak to this issue?

pdoane · October 26, 2023, 3:08am

Consider building a mesh vertex & index buffer for rendering. As input, you have an array of vertices and are trying to find duplicates, and putting the vertices into a hash table is a reasonable way to do it. If the bits are different, then they are different vertices. Maybe you want to quantize the floats ahead of time, but that’s unrelated to the duplicate finding process.

In many cases you could hash the struct memory directly but that’s scary to me because of padding bytes that might come along later.

Or back to the use case I listed originally, we’re looking for cases where a program fills them out in the same exact way. Consider:

  desc.MipLODBias = (a + b) / 2;

vs

  desc.MipLODBias = 0.5 *a + 0.5 * b;

I have no expectations that these are going to be the same values. But if someone writes:

   desc.MipLODBias = 1.0;

That had better work.

pdoane · October 26, 2023, 3:11am

Basically I’m just looking to avoid the hundreds (probably even thousands) lines of manual struct hashing functions I’ve had to write and maintain on previous projects in C++.

I can just fork Autohash and move along, but I’d be surprised I’m the only one to benefit here.

AndrewCodeDev · October 26, 2023, 3:17am

Okay - that’s the same hashing behaviour I anticipated as well.

I did a cursory search through the issues via github including “autohash” and “float hash” and I didn’t find any relevant issues. That said, I did notice that with the autohash search that there weren’t that many entries: Issues · ziglang/zig · GitHub

I think you have firm ground to make a proposal here. At the very least, you could draft one and see what the push-back is (if there is any) before making that decision.

pdoane · October 26, 2023, 3:17am

The proposal I made on the Discord channel was an opt-in, e.g.:

const FloatHashStrategy = enum {
   Disabled, // current behavior - float hashing is not allowed
   Value,    // satisfies property x == y => hash(x) = hash(y), resolving -0.0/0.0 issue
   Bitcast,  // not sure it's a good idea, but it's commonly done
   // Any others?
};

And then merge the existing Pointer/Float strategies into a struct. Compile errors/documentation can include plenty of comments about pitfalls.

And regarding NaNs, yeah they are a problem… but that’s pervasive and I don’t think we’re generally throwing out useful functionality. I’m not aware of an AutoLess but hopefully that wouldn’t skip support for floats because it might cause issues with a sorting function when encountering a NaN?

AndrewCodeDev · October 26, 2023, 3:23am

You could also add an epsilon compare of some sort (not sure if that should be parameterized). Something like ye ol’ |x - y| < e but that could be a hassle.

pdoane · October 26, 2023, 3:38am

For hash table operations to be well-defined, eq(x,y) => hash(x) == hash(y), so an approximate equality like that is challenging for writing the corresponding hash function. For my use cases, I really just want exact equality (which only requires handling the two representations of 0 and then going to the bitwise representation).

What’s surprising to me is that std.meta.eql supports float and all the same arguments/concerns apply there too, perhaps more so. While it would suck, return 0 is a valid hash implementation, so the concerns seem to really be about the equality side of this, and there are many valid ways to make use of exact floating point equality.

gonzo · October 26, 2023, 8:16am

I just wanted to chime in with a bit of “wow, surprising”. I never thought about hashing floats before, and it seemed obvious to me you could just use the bits representing the float, the rest be damned – I guess my PoV is using hashing to find equal values, so I agree that eq(x,y) => hash(x) == hash(y).

But from this discussion I gather there are (at least) two strange cases:

+0.0 == -0.0 – groan
NaN != NaN – although their bits are obviously identical; do we really want to cater for this, or do we want to be able to locate NaNs in our hash table?

I see std.meta.eql simply uses == for scalar types – it doesn’t work for NaN then?

andrewrk · October 26, 2023, 8:42am

AndrewCodeDev · October 26, 2023, 8:59am

Aha! There was one… and only 3 years ago lol. Somehow I missed that going through the list of issues.

Interesting, so it looks like what actually happened here is that the person making the PR just didn’t respond to the code review on the PR after 28 days. Also, there wasn’t a lot of review-suggestions being made… it was a fairly light code-review up to that point. I’m surprised they didn’t follow up.

So, @pdoane - at the very least, there was some interest in this topic at some point and it wasn’t outright shot down. It may be productive for you to look into this.

pdoane · October 26, 2023, 1:34pm

Yeah, so this has been an interesting path and the discussion has been helpful. I’ve been focused on hashing, but equality is the bigger issue where I can see people wanting different behaviors:

NaNs can never be retrieved
Bitwise equal NaNs can be retrieved
All NaNs are equal for key comparison

I haven’t been focused on it as std.meta.eql behavior is fine by me, and the only issue with hashing for it is dealing with negative zero.

So, @pdoane - at the very least, there was some interest in this topic at some point and it wasn’t outright shot down. It may be productive for you to look into this.

I’m not sure that PR is particularly interesting. Floats as keys is the easy case and I can just make my own context for it. Floats embedded in structures where Autohash is leveraging reflection is the useful part.

I’m happy to put up a PR if there’s a design people are happy with.

andrewrk · October 26, 2023, 5:47pm

I think you are bringing something new to the discussion which is the point about a nested structure containing floats. I think that if you take the time to come up with a good proposal, make a pull request, and follow-through on requested changes, there is a high likelihood of your use case being resolved.

Of course, only if putting in this effort is something you want to do.

mcadamy · October 30, 2023, 1:38pm

Would just be odd, asking a data structure to take this item and store, don’t toss is, and never let me retrieve it. I don’t understand the point.
Is the only sensible way of doing this. A hash table is made to return the same (or identical) object keys that share the same representation. Don’t turn it into something it isn’t There are plenty of other tree and hash like structures for proximity and fuzzy searches.
This is probably out since they have different bit representations, and that would be similar to asking a generic hasher to correctly hash “Robert” and “John” for you.

pdoane · October 30, 2023, 3:32pm

NaNs are weird. They do not compare equal by default so this behavior is a consequence of using std.meta.eql. NaNs cause all sorts of other problems too if you use default comparison (<) like breaking most sorting functions, tree based data structures, etc.

What I’ll be putting up later in the PR will just assert on NaNs which sidesteps the issue.

Hash tables work based on their definition of equivalence. The default implementation uses std.meta.eql() which uses equality and not bitwise comparison for floats. You could absolutely make a different default, but that’s not currently what Zig (or any other language that I’m familiar with) does for hash tables.

Maybe it got lost in the discussion here, but I have zero interest in proximity or fuzzy searching so those other data structures are not useful. I am looking for exact equality between objects (that happen to contain floating point values). A hash table is a great choice for that.

All three of those proposals handle -0.0 which has a different bit representation than 0.0.

It sounds like you may be advocating for a 4th option, which is just bitwise equality. It doesn’t exactly satisfy my use case although it’s close enough that i’d use it. Handling negative zero is one extra if-check so it’s not clear that it’s a good tradeoff, and it requires either a new std.meta.eql or options to it.

The requirements on the hashing function are dictated by the equality function, so you work backwards from there. I’m fine with std.meta.eql so am just looking to make the minimal change that supports floats.

I had previously proposed a more complicated option to the hashing function, but it’s not clear that complexity brings any value without also tackling equality questions.