# Casting float to float, unexpected behavior?

Sorry for my beginner problem, but can someone enlighten me, please.

In short, I want to convert one type of float to another and I used @floatCast.

Assuming the value is good:

• conversion from a big type to a small type works
• conversion from a small type to a big type works but returns wrong results, imo ofc

I tested with several numbers (arrays) and different float bit widths, in the best cases the first few numbers are converted correctly, i.e. keep the original value, the rest change the value.

Just a simple example of how I used the function:

``````const std = @import("std");
const print = std.debug.print;

pub fn main() !void {
const val_f32: f32 = 100_000_000_000_000_000_000_000_000_000_000_000_000;
print("f32  to f64 {d}\n", .{@as(f64, @floatCast(val_f32))});

const val_f128: f128 = 100_000_000_000_000_000_000_000_000_000_000_000_000;
print("f128 to f64 {d}\n", .{@as(f64, @floatCast(val_f128))});
}
``````

Output:

``````\$ zig build run
f32  to f64 99999996802856920000000000000000000000
f128 to f64 100000000000000000000000000000000000000
``````

As always, many thanks.

The problem is not the cast. `f32` doesnâ€™t have enough precision to store such a high number. Converting integers to `f32` is only lossless if the number fits in 22 bits (`u22` or `i23`). The loss of precision is already happening at this line:

3 Likes

Thanks for the quick response.

I will say how I understood the problem:

• that value is in the range supported by f32, but it also happens at much lower values
• I donâ€™t understand where the precision is lost, the value is defined directly f32
• the conversion from f32 to f32 does not lose precision, it is done correctly even at high values

However, am I to understand that before being f32, that value is an integer somewhere â€śbehind the scenesâ€ť and is converted to f32?

Itâ€™s important to understand that â€śinside the range ofâ€ť and â€śexactly representable byâ€ť are different things for floating point numbers. There are many numbers within the range of floats that cannot be represented exactly.

From GNU:

Floating point numbers are approximate. They are based on a sign bit, a mantissa, and an exponent. I just worked on a quantization utility that takes values from large float types to 8-bit with scale and zero-point representation. This is exactly the type of behvaior that one can expect.

This is also why floating point arithmetic is not associative. If I do something like so:

``````giant_number  - (giant_number + small_number)
``````

It may appear that the small number has completely disappeared from the calculation. The small number may have pushed the float into a value that it cannot properly represent. However, this will make the small value appear again:

``````(giant_number - giant_number) + small_number
``````

note: subtraction is not associative over additionâ€¦ this is just to demonstrate precision loss. This has effects on associativity, however, such as long chains of addition in unordered reductions.

In the words of Gerald Jay Sussman: â€śNothing brings fear to my heart more than a floating point number.â€ť

3 Likes

Ok, in conclusion, I need to learn more about how floating point types are represented and stored in memory.

I knew they were approximate, but I still didnâ€™t think there were such big differences, i.e. a kilogram of sugar is after all a kilogram (+/- 1%), not a hundred grams.

Anyway, thank you all for pointing me in the right direction!

The literal you used is of type `comptime_int`, which is being coerced (i.e. implicitly cast) to `comptime_float`. This conversion is lossless, as `comptime_float` has infinite precision. The `comptime_float` is then being coerced into an `f32`, which is lossy. Like @AndrewCodeDev mentioned, being in the supported range does not mean the value can be converted losslessly. After all, there is only so much information that can be stored inside 32 bits. Such a high number wouldnâ€™t even fit into a `u32`.

1 Like

Now I understand and it makes sense to me, sorry I have limited programming knowledge.

Thank you very much!

2 Likes

These numbers (binary floating point) are approximations of real numbers.
The problem is that the number `100000000000000000000000000000000000000` cannot be represented exactly; it becomes `99999996802856920000000000000000000000`.

The solution to this problem is to use decimal floating point numbers, these can accurately represent real numbers. Unfortunately a lot of people find binary floating point numbers good enough and decimal floating point numbers hard to implement

At the risk of being â€śthat guyâ€ť, real numbers arenâ€™t representable in the general case. `Ď€` is the usual example.

Decimal floats do have advantages when dealing with naturally-decimal values, notably currencies, since even a number as simple as `0.1` is actually `0.1000000000000000055511151231257827021181583404541015625` in `f64`.

1 Like

For the sake of being that guy, Iâ€™ll also add floating point cannot represent the reals (such as transcendental numbers (like `pi` or `e`) or any irrationals). We work entirely in the field of Q not R within some absolute epsilon. The only way to approach that would be with a mantissa and exponent using an infinite bit width and thatâ€™s only for numbers that are computable (people like Cantor had things to say about this: Cantor's diagonal argument - Wikipedia) or representable as an infinite sum because floating point presents as a finite significand times a finite base raised to a finite exponentâ€¦ that saidâ€¦ letâ€™s let sleeping dogs lie

1 Like

I mean that decimal floating numbers are exact representations of real numbers, not that all real numbers can be represented by decimal floating point numbers.

1 Like

My problems came from trying to find the number of digits for the integer part of a floating point number. Iâ€™ve used functions like `@abs()` and `@floor()` and in theory I donâ€™t care if 1.555 is represented as 1.55 or 1.56, but if 1000 is 90 then itâ€™s a problem.

So at this point, without further investigation, the obvious solution seems to be to use `f128` even though it is not optimal for all numbers.

I will look for more information on how floating point numbers are used to correctly represent the integer part at least and how they can be converted to integers.

@7R35C0, Iâ€™ll send you a dm to follow up on this. I may have some resources to help you.

1 Like

Both types of floating point are actually exact representations of rational numbers. Which are a subset of the reals, yes, but so are integers.

Every floating point value, decimal or binary, is a rational. It can be expressed as an exact fraction of some other value, whether that fraction is big and messy, or not. So, just like we wouldnâ€™t say that `i64` expresses rational numbers (even though integers are rationals), we wouldnâ€™t say that `f64` expresses real numbers (even though rationals are reals).

The difference between the two is that, since decimal floats use base 10, more of the rationals which we happen to write using decimal notation can be represented exactly. Thatâ€™s a useful property sometimes.