Looking for resources for fixed point and quantization

For context, I’ve been working on quantizing floating point numbers to 8-bit for math operations (think matrix multiplies for example). I’ve found a few projects that I’m considering rewriting in Zig. Here’s one that I’ve been reading:

That got me to thinking about getting some more resources together to get more into this subject in general though. So, I’m curious - does anyone have any resources/projects they particularly like towards this subject? I’m happy to hear any explanations and experiences people have as well.


I’ll also add that this looks like a good blog post: Fixed-point representation for quantization

I’ve had the Mike Acton (of data-oriented design fame) series on floating point numbers on my reading list for a while. He goes into fixed point numbers as well, and how fixed point is a form of compression. Of course, he’s approaching it from a game programming perspective and I’m not sure how applicable it is to a ML context, but if you’re just looking for intuition on fixed point numbers I think it’s a good resource.

1 Like

Awesome - and I’ll always make time to listen to a Mike Acton talk (haven’t read much of his writing but I can imagine it’s probably good).

My impression so far is that there isn’t something particularly special about floating point quantization in ML… but they have certain techniques that may not make much sense in an outside context to get scaling values. For instance, it’s common compute a small sample of the operations first and find min-max values based on the average for each tensor and just use those instead of requantizing each time during evaluation/training.

I’m probably going to do an example project in Zig to get some practical experience. If you (or others) are interested in that I’ll make a public repo and post it here anyone wants to write some code.


I’ve got a bit of example code for 8-bit quantization for signed conversions. I’ll work out a version for unsigned at some point here:

const std = @import("std");

// assume symmetric quantization
pub fn Quantized(comptime T: type) type {
    return struct {
        slice: []T,
        scale: f32,
        zero_point: T

pub fn min_max(slice: []const f32) struct { min: f32, max: f32 }{

    var min: f32 = std.math.inf(f32);
    var max: f32 = -std.math.inf(f32);

    for (slice) |x| {
        min = @min(min, x);
        max = @max(max, x);

    return .{ .min = min, .max = max }; 

pub fn abs_max(slice: []const f32) f32 {

    var max: f32 = -std.math.inf(f32);

    for (slice) |x| {
        max = @max(max, @abs(x));

    return max;

pub fn quantize(
    comptime T: type, 
    slice: []const f32,
    allocator: std.mem.Allocator,
) !Quantized(T) {

    if (slice.len == 0) {
        return .{ .slice = &.{}, .scale = 0.0, .zero_point = 0 };

    const qslice = try allocator.alloc(T, slice.len);

    switch (T) {
        i8 => {

            // this is symmetric quantization where the range of
            // the i8 is between [-127, 127]. This omits the lowest
            // value of -128 to keep min and max equivalent.
            const scale: f32 = blk: {
                const cap: f32 = @floatCast(std.math.maxInt(i8));
                break :blk abs_max(slice) / cap;

            for (slice, qslice) |x, *y| {
                y.* = @intFromFloat(@round(x / scale));

            return .{ .slice = qslice, .scale = scale, .zero_point = 0 };
        u8 => {
            @compileError("Todo: unsigned 8-bit");
        else => {
            @compileError("Unsupported quantization output type.");

pub fn main() !void {

    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();

    const q = try quantize(i8, &.{ -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3 }, arena.allocator());

    std.log.info("q: {any}", .{ q.slice });    
    std.log.info("s: {}", .{ q.scale });    
    std.log.info("z: {}", .{ q.zero_point });

    for (q.slice) |x| {
        const u: i8 = x + q.zero_point;
        const v: f32 = @as(f32, @floatFromInt(u)) * q.scale;
        std.log.info("{}", .{ v });
1 Like