Casting floats to ints for discrete quantization

AndrewCodeDev · July 2, 2023, 11:55pm

TLDR: Is there an idiomatic Zig casting convention when going from larger floating point numbers to small integers where values are already guaranteed to have representations in the destination integer type?

I’m working on a problem for discrete value quantization to reduce overall memory and compute for tensor storage and operations. Here’s a primer on the idea: A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes

This is a very common practice and is well defined and quite necessary for modern tensor systems.

More formally, the guarentee comes from the following: s * round(v / ||v||inf) (or the infinity normalized vector multiplied by a scaling constant equal to the max value containable by some integer).

So then, we have a vector where all values are between some range (for an i8, we could choose [-127, 127] ignoring -128 for simplicity).

Now, I need to cast this range to a vector of integers. One possible solution is to do a manual manipulation via the mantissa and exponent and populate the integers that way, but I’m curious if Zig has a more idiomatic solution to this.

AndrewCodeDev · July 3, 2023, 12:39am

Turns out, the safety checks don’t fret if the guarantee holds, so good ol’ @intFromFloat works great here. They are runtime checked though, so I’ll have to optionally pull those out but yeah, this works - even for f64 → i8.