Casting float to float, unexpected behavior?

Sorry for my beginner problem, but can someone enlighten me, please.

In short, I want to convert one type of float to another and I used @floatCast.

Assuming the value is good:

  • conversion from a big type to a small type works
  • conversion from a small type to a big type works but returns wrong results, imo ofc

I tested with several numbers (arrays) and different float bit widths, in the best cases the first few numbers are converted correctly, i.e. keep the original value, the rest change the value.

Just a simple example of how I used the function:

const std = @import("std");
const print = std.debug.print;

pub fn main() !void {
    const val_f32: f32 = 100_000_000_000_000_000_000_000_000_000_000_000_000;
    print("f32  to f64 {d}\n", .{@as(f64, @floatCast(val_f32))});

    const val_f128: f128 = 100_000_000_000_000_000_000_000_000_000_000_000_000;
    print("f128 to f64 {d}\n", .{@as(f64, @floatCast(val_f128))});
}

Output:

$ zig build run
f32  to f64 99999996802856920000000000000000000000
f128 to f64 100000000000000000000000000000000000000

Please help me understand, what am I doing wrong!

As always, many thanks.

The problem is not the cast. f32 doesn’t have enough precision to store such a high number. Converting integers to f32 is only lossless if the number fits in 22 bits (u22 or i23). The loss of precision is already happening at this line:

3 Likes

Thanks for the quick response.

I will say how I understood the problem:

  • that value is in the range supported by f32, but it also happens at much lower values
  • I don’t understand where the precision is lost, the value is defined directly f32
  • the conversion from f32 to f32 does not lose precision, it is done correctly even at high values

However, am I to understand that before being f32, that value is an integer somewhere “behind the scenes” and is converted to f32?

It’s important to understand that “inside the range of” and “exactly representable by” are different things for floating point numbers. There are many numbers within the range of floats that cannot be represented exactly.

From GNU:

Floating point numbers are approximate. They are based on a sign bit, a mantissa, and an exponent. I just worked on a quantization utility that takes values from large float types to 8-bit with scale and zero-point representation. This is exactly the type of behvaior that one can expect.

This is also why floating point arithmetic is not associative. If I do something like so:

giant_number  - (giant_number + small_number)

It may appear that the small number has completely disappeared from the calculation. The small number may have pushed the float into a value that it cannot properly represent. However, this will make the small value appear again:

(giant_number - giant_number) + small_number

note: subtraction is not associative over addition… this is just to demonstrate precision loss. This has effects on associativity, however, such as long chains of addition in unordered reductions.

In the words of Gerald Jay Sussman: “Nothing brings fear to my heart more than a floating point number.”

3 Likes

Ok, in conclusion, I need to learn more about how floating point types are represented and stored in memory.

I knew they were approximate, but I still didn’t think there were such big differences, i.e. a kilogram of sugar is after all a kilogram (+/- 1%), not a hundred grams. :laughing:

Anyway, thank you all for pointing me in the right direction!

The literal you used is of type comptime_int, which is being coerced (i.e. implicitly cast) to comptime_float. This conversion is lossless, as comptime_float has infinite precision. The comptime_float is then being coerced into an f32, which is lossy. Like @AndrewCodeDev mentioned, being in the supported range does not mean the value can be converted losslessly. After all, there is only so much information that can be stored inside 32 bits. Such a high number wouldn’t even fit into a u32.

1 Like

Now I understand and it makes sense to me, sorry I have limited programming knowledge.

Thank you very much!

2 Likes

These numbers (binary floating point) are approximations of real numbers.
The problem is that the number 100000000000000000000000000000000000000 cannot be represented exactly; it becomes 99999996802856920000000000000000000000.

The solution to this problem is to use decimal floating point numbers, these can accurately represent real numbers. Unfortunately a lot of people find binary floating point numbers good enough and decimal floating point numbers hard to implement :frowning:

At the risk of being “that guy”, real numbers aren’t representable in the general case. π is the usual example.

Decimal floats do have advantages when dealing with naturally-decimal values, notably currencies, since even a number as simple as 0.1 is actually 0.1000000000000000055511151231257827021181583404541015625 in f64.

1 Like

For the sake of being that guy, I’ll also add floating point cannot represent the reals (such as transcendental numbers (like pi or e) or any irrationals). We work entirely in the field of Q not R within some absolute epsilon. The only way to approach that would be with a mantissa and exponent using an infinite bit width and that’s only for numbers that are computable (people like Cantor had things to say about this: Cantor's diagonal argument - Wikipedia) or representable as an infinite sum because floating point presents as a finite significand times a finite base raised to a finite exponent… that said… let’s let sleeping dogs lie :slight_smile:

1 Like

I mean that decimal floating numbers are exact representations of real numbers, not that all real numbers can be represented by decimal floating point numbers.

1 Like

My problems came from trying to find the number of digits for the integer part of a floating point number. I’ve used functions like @abs() and @floor() and in theory I don’t care if 1.555 is represented as 1.55 or 1.56, but if 1000 is 90 then it’s a problem.

So at this point, without further investigation, the obvious solution seems to be to use f128 even though it is not optimal for all numbers.

I will look for more information on how floating point numbers are used to correctly represent the integer part at least and how they can be converted to integers.

@7R35C0, I’ll send you a dm to follow up on this. I may have some resources to help you.

1 Like

Both types of floating point are actually exact representations of rational numbers. Which are a subset of the reals, yes, but so are integers.

Every floating point value, decimal or binary, is a rational. It can be expressed as an exact fraction of some other value, whether that fraction is big and messy, or not. So, just like we wouldn’t say that i64 expresses rational numbers (even though integers are rationals), we wouldn’t say that f64 expresses real numbers (even though rationals are reals).

The difference between the two is that, since decimal floats use base 10, more of the rationals which we happen to write using decimal notation can be represented exactly. That’s a useful property sometimes.