Continuing discussion from Discord about hashing structures with floats in them. A common use case is finding exactly the same object for reuse, e.g. consider a struct like this:
When creating samplers, descriptors with exactly the same values can reuse existing objects. Using std.mem.asBytes() can have issues with embedded padding, and many descriptor objects have embedded pointers that need to follow.
The algorithm for autoHash works well for this use case, if it had an implementation of float hashing. I am not concerned about NaN behavior - these arenāt floats that are part of mathematical operations (although a debug.assert when encountered would be very reasonable).
Iām not sure hashing of floats is possible given the problems with precision and numbers that canāt be represented exactly as floating point. I once thought I could bit cast the float to an unsigned integer and hash that, but it still doesnāt work:
Your example is with 3 keys that are initialized with different values though. Hashing floats is a commonly defined operation in other programming languages, e.g. see:
which hashes floats through bitcasting outside of dealing with negative zero. Iām not arguing existence as a reason to do it for Zig, but this does work (at least in C/C++, but maybe Zig has different rules?).
There are many ways in which a float value will be equal, e.g. (reading the same memory location, literal values) that are important to the use case of detecting equivalent keys. Use of floating-point numbers as integers that are in range is completely fine too (e.g. both Lua and Javascript runtimes do this).
Iām not making up the use case about needing to detect duplicates. The Windows runtime takes instances of those structures I posted above hashes them and returns references to existing objects.
Arithmetically speaking, floating points are inaccurate because of the inability to represent values because they are approximations. Thatās why they are considered to be non-associative in the general case (if I subtract a really tiny number from a really big number, it may not be able to adequately represent that).
Returning the bits though is, to my knowledge, reliable.
This whole thing becomes a bit scary if you think about the case where arithmetic is involved. Because of the inability to precisely represent things in the general case, you could have two values that are (on paper) mathematically provable to calculate to different values but they end up with the same bit-set because of floating point representation issues.
@pdoane, you seem to be knowledge about this domain. Can you speak to this issue?
Consider building a mesh vertex & index buffer for rendering. As input, you have an array of vertices and are trying to find duplicates, and putting the vertices into a hash table is a reasonable way to do it. If the bits are different, then they are different vertices. Maybe you want to quantize the floats ahead of time, but thatās unrelated to the duplicate finding process.
In many cases you could hash the struct memory directly but thatās scary to me because of padding bytes that might come along later.
Or back to the use case I listed originally, weāre looking for cases where a program fills them out in the same exact way. Consider:
desc.MipLODBias = (a + b) / 2;
vs
desc.MipLODBias = 0.5 *a + 0.5 * b;
I have no expectations that these are going to be the same values. But if someone writes:
Basically Iām just looking to avoid the hundreds (probably even thousands) lines of manual struct hashing functions Iāve had to write and maintain on previous projects in C++.
I can just fork Autohash and move along, but Iād be surprised Iām the only one to benefit here.
Okay - thatās the same hashing behaviour I anticipated as well.
I did a cursory search through the issues via github including āautohashā and āfloat hashā and I didnāt find any relevant issues. That said, I did notice that with the autohash search that there werenāt that many entries: Issues Ā· ziglang/zig Ā· GitHub
I think you have firm ground to make a proposal here. At the very least, you could draft one and see what the push-back is (if there is any) before making that decision.
The proposal I made on the Discord channel was an opt-in, e.g.:
const FloatHashStrategy = enum {
Disabled, // current behavior - float hashing is not allowed
Value, // satisfies property x == y => hash(x) = hash(y), resolving -0.0/0.0 issue
Bitcast, // not sure it's a good idea, but it's commonly done
// Any others?
};
And then merge the existing Pointer/Float strategies into a struct. Compile errors/documentation can include plenty of comments about pitfalls.
And regarding NaNs, yeah they are a problem⦠but thatās pervasive and I donāt think weāre generally throwing out useful functionality. Iām not aware of an AutoLess but hopefully that wouldnāt skip support for floats because it might cause issues with a sorting function when encountering a NaN?
You could also add an epsilon compare of some sort (not sure if that should be parameterized). Something like ye olā |x - y| < e but that could be a hassle.
For hash table operations to be well-defined, eq(x,y) => hash(x) == hash(y), so an approximate equality like that is challenging for writing the corresponding hash function. For my use cases, I really just want exact equality (which only requires handling the two representations of 0 and then going to the bitwise representation).
Whatās surprising to me is that std.meta.eql supports float and all the same arguments/concerns apply there too, perhaps more so. While it would suck, return 0 is a valid hash implementation, so the concerns seem to really be about the equality side of this, and there are many valid ways to make use of exact floating point equality.
I just wanted to chime in with a bit of āwow, surprisingā. I never thought about hashing floats before, and it seemed obvious to me you could just use the bits representing the float, the rest be damned ā I guess my PoV is using hashing to find equal values, so I agree that eq(x,y) => hash(x) == hash(y).
But from this discussion I gather there are (at least) two strange cases:
+0.0 == -0.0 ā groan
NaN != NaN ā although their bits are obviously identical; do we really want to cater for this, or do we want to be able to locate NaNs in our hash table?
I see std.meta.eql simply uses == for scalar types ā it doesnāt work for NaN then?
Aha! There was one⦠and only 3 years ago lol. Somehow I missed that going through the list of issues.
Interesting, so it looks like what actually happened here is that the person making the PR just didnāt respond to the code review on the PR after 28 days. Also, there wasnāt a lot of review-suggestions being made⦠it was a fairly light code-review up to that point. Iām surprised they didnāt follow up.
So, @pdoane - at the very least, there was some interest in this topic at some point and it wasnāt outright shot down. It may be productive for you to look into this.
Yeah, so this has been an interesting path and the discussion has been helpful. Iāve been focused on hashing, but equality is the bigger issue where I can see people wanting different behaviors:
NaNs can never be retrieved
Bitwise equal NaNs can be retrieved
All NaNs are equal for key comparison
I havenāt been focused on it as std.meta.eql behavior is fine by me, and the only issue with hashing for it is dealing with negative zero.
So, @pdoane - at the very least, there was some interest in this topic at some point and it wasnāt outright shot down. It may be productive for you to look into this.
Iām not sure that PR is particularly interesting. Floats as keys is the easy case and I can just make my own context for it. Floats embedded in structures where Autohash is leveraging reflection is the useful part.
Iām happy to put up a PR if thereās a design people are happy with.
I think you are bringing something new to the discussion which is the point about a nested structure containing floats. I think that if you take the time to come up with a good proposal, make a pull request, and follow-through on requested changes, there is a high likelihood of your use case being resolved.
Of course, only if putting in this effort is something you want to do.
Would just be odd, asking a data structure to take this item and store, donāt toss is, and never let me retrieve it. I donāt understand the point.
Is the only sensible way of doing this. A hash table is made to return the same (or identical) object keys that share the same representation. Donāt turn it into something it isnāt There are plenty of other tree and hash like structures for proximity and fuzzy searches.
This is probably out since they have different bit representations, and that would be similar to asking a generic hasher to correctly hash āRobertā and āJohnā for you.
NaNs are weird. They do not compare equal by default so this behavior is a consequence of using std.meta.eql. NaNs cause all sorts of other problems too if you use default comparison (<) like breaking most sorting functions, tree based data structures, etc.
What Iāll be putting up later in the PR will just assert on NaNs which sidesteps the issue.
Hash tables work based on their definition of equivalence. The default implementation uses std.meta.eql() which uses equality and not bitwise comparison for floats. You could absolutely make a different default, but thatās not currently what Zig (or any other language that Iām familiar with) does for hash tables.
Maybe it got lost in the discussion here, but I have zero interest in proximity or fuzzy searching so those other data structures are not useful. I am looking for exact equality between objects (that happen to contain floating point values). A hash table is a great choice for that.
All three of those proposals handle -0.0 which has a different bit representation than 0.0.
It sounds like you may be advocating for a 4th option, which is just bitwise equality. It doesnāt exactly satisfy my use case although itās close enough that iād use it. Handling negative zero is one extra if-check so itās not clear that itās a good tradeoff, and it requires either a new std.meta.eql or options to it.
The requirements on the hashing function are dictated by the equality function, so you work backwards from there. Iām fine with std.meta.eql so am just looking to make the minimal change that supports floats.
I had previously proposed a more complicated option to the hashing function, but itās not clear that complexity brings any value without also tackling equality questions.