Sorry for my beginner problem, but can someone enlighten me, please.
In short, I want to convert one type of float to another and I used @floatCast.
Assuming the value is good:
conversion from a big type to a small type works
conversion from a small type to a big type works but returns wrong results, imo ofc
I tested with several numbers (arrays) and different float bit widths, in the best cases the first few numbers are converted correctly, i.e. keep the original value, the rest change the value.
The problem is not the cast. f32 doesn’t have enough precision to store such a high number. Converting integers to f32 is only lossless if the number fits in 22 bits (u22 or i23). The loss of precision is already happening at this line:
It’s important to understand that “inside the range of” and “exactly representable by” are different things for floating point numbers. There are many numbers within the range of floats that cannot be represented exactly.
From GNU:
Floating point numbers are approximate. They are based on a sign bit, a mantissa, and an exponent. I just worked on a quantization utility that takes values from large float types to 8-bit with scale and zero-point representation. This is exactly the type of behvaior that one can expect.
This is also why floating point arithmetic is not associative. If I do something like so:
giant_number - (giant_number + small_number)
It may appear that the small number has completely disappeared from the calculation. The small number may have pushed the float into a value that it cannot properly represent. However, this will make the small value appear again:
(giant_number - giant_number) + small_number
note: subtraction is not associative over addition… this is just to demonstrate precision loss. This has effects on associativity, however, such as long chains of addition in unordered reductions.
In the words of Gerald Jay Sussman: “Nothing brings fear to my heart more than a floating point number.”
Ok, in conclusion, I need to learn more about how floating point types are represented and stored in memory.
I knew they were approximate, but I still didn’t think there were such big differences, i.e. a kilogram of sugar is after all a kilogram (+/- 1%), not a hundred grams.
Anyway, thank you all for pointing me in the right direction!
The literal you used is of type comptime_int, which is being coerced (i.e. implicitly cast) to comptime_float. This conversion is lossless, as comptime_float has infinite precision. The comptime_float is then being coerced into an f32, which is lossy. Like @AndrewCodeDev mentioned, being in the supported range does not mean the value can be converted losslessly. After all, there is only so much information that can be stored inside 32 bits. Such a high number wouldn’t even fit into a u32.
These numbers (binary floating point) are approximations of real numbers.
The problem is that the number 100000000000000000000000000000000000000 cannot be represented exactly; it becomes 99999996802856920000000000000000000000.
The solution to this problem is to use decimal floating point numbers, these can accurately represent real numbers. Unfortunately a lot of people find binary floating point numbers good enough and decimal floating point numbers hard to implement
At the risk of being “that guy”, real numbers aren’t representable in the general case. π is the usual example.
Decimal floats do have advantages when dealing with naturally-decimal values, notably currencies, since even a number as simple as 0.1 is actually 0.1000000000000000055511151231257827021181583404541015625 in f64.
For the sake of being that guy, I’ll also add floating point cannot represent the reals (such as transcendental numbers (like pi or e) or any irrationals). We work entirely in the field of Q not R within some absolute epsilon. The only way to approach that would be with a mantissa and exponent using an infinite bit width and that’s only for numbers that are computable (people like Cantor had things to say about this: Cantor's diagonal argument - Wikipedia) or representable as an infinite sum because floating point presents as a finite significand times a finite base raised to a finite exponent… that said… let’s let sleeping dogs lie
I mean that decimal floating numbers are exact representations of real numbers, not that all real numbers can be represented by decimal floating point numbers.
My problems came from trying to find the number of digits for the integer part of a floating point number. I’ve used functions like @abs() and @floor() and in theory I don’t care if 1.555 is represented as 1.55 or 1.56, but if 1000 is 90 then it’s a problem.
So at this point, without further investigation, the obvious solution seems to be to use f128 even though it is not optimal for all numbers.
I will look for more information on how floating point numbers are used to correctly represent the integer part at least and how they can be converted to integers.
Both types of floating point are actually exact representations of rational numbers. Which are a subset of the reals, yes, but so are integers.
Every floating point value, decimal or binary, is a rational. It can be expressed as an exact fraction of some other value, whether that fraction is big and messy, or not. So, just like we wouldn’t say that i64 expresses rational numbers (even though integers are rationals), we wouldn’t say that f64 expresses real numbers (even though rationals are reals).
The difference between the two is that, since decimal floats use base 10, more of the rationals which we happen to write using decimal notation can be represented exactly. That’s a useful property sometimes.