double precision: subtleties leading to incorrectness
Pavel Gurkov
Posted on July 4, 2020
I was solving some entry-level problems on codewars, one of the many resources dedicated to helping programmers practice. There was one very easy problem that I solved in 5 minutes... As it turned out, solution was incorrect, but all the tests passed. :) I got thinking about it and didn't move on for the next couple of hours.
It's stated in the following way:
Given a long long
integer N
, write a function, that
- if
N
is a perfect square (meaning there's integern
which satisfiesn^2=N
), return(n+1)^2
- if
N
is not a perfect square, return-1
.
Sounds really simple, right? Just as simple as FizzBuzz. But it's actually a bit more complicated than this.
The actual correct solution is not the aim of this article, and it's relatively easy to come up with one once you understand the shortcomings of dirty-and-fast version.
So how would one solve that?
The vast majority, I believe (at least based on what I saw on codewars itself), will come up with something like (let's think c++):
#include <cmath>
long long f(long long N) {
double n = std::sqrt(N);
return (long long)n == n ? (n + 1) * (n + 1) : -1;
}
What can possibly go wrong here?
First, let's make a couple of sensible assumptions. double
and long long
are both 64-bit wide, and double
is implemented adhering to IEEE 754 standard.
When in the first line of our function we invoke std::sqrt(N)
, N
gets implicitly converted to double
. Due to how double
is implemented, it may lead to precision loss.
The number encoded by the standard may be written as (-1)^s * (2^e) * (1 + m)
, where s
equals to sign bit value, e
is encoded by exponent bits and m
is encoded by significand bits. We're omitting here some special cases such as denormalized and special values like Inf
and NaN
, as they're irrelevant to the problem. m
is encoded as a fractional part of the number with implied 1 before the decimal point, e.g. 10000...000
significand bits encode decimal 1.5
.
Now the question is, what is the largest integer number that may be represented by double
without loss of precision, with all the integer numbers less than that also may be represented as such?
Let's look at 2^53
(52 bits + 1 leading 1
). It can be stored for sure, all zeroes in significand bits and e
representing 53
.
What about 2^53+1
? We have e
still giving us 2^53
, but now we have to add a 1
to the right of significand bits. The thing is, it's not sufficient to have 52 bits for that, we need 53. Intuitive way to think of it is to think of exponent as right shift. Significand bits 00000...000
* 2^53
encodes 2^53
, but to encode 2^53+1
, we don't have enough width available. There's no problem with 2^53+2
, however - the rightmost available bit is sufficient now.
Similarly, all the numbers below 2^53
have sufficient bit width to be encoded without any loss of precision.
Guess now you get the idea: if N
supplied to our function is greater than 2^53
, we're not guaranteed at all to successfully determine if that's a perfect square or not due to possible precision loss.
But it gets even funnier!
If we pass a "magical number" (2^26 + 1) * (2^26 + 1) - 1
to our function, it'll happily follow the "perfect square" path. The fractional part of sqrt(N)
will be zero. Why? This number is well below 2^53
, so there's no precision loss due to conversion.
The trick is that the delta against the perfect square here reduces the square root by around 2^-27
(derivative of f(x)=x^(1/2)
is f'(x)=1/(2*x^(1/2))
). It happens to be about 2^53
times smaller than the actual result (which is just above 2^26
). Again, we're facing the limit of double precision here, right on the edge.
As you could guess, it's not the only "magical number" leading to incorrect results. Try 2^26 + 2
or 2^26 + 3
.
So, our little program there is incorrect. As it was stated before, it's not the point to provide correct solution here, but here's a thought: we could use integer arithmetics strictly to make it correct. Making it efficient however is a whole another problem.
There's also long double
type, which may or may not be implemented as 80-bit (or more) wide floating point number. Its 63 fraction bits are sufficient for maintaining 64-bit integers conversions; furthermore, it's sufficient to encode those errors that make not-so-perfect squares look perfect. But that's only if you're lucky and long double
isn't just a double
synonym.
The main point here is that if the program is incorrect only for a handful of corner cases, it can't be considered correct as a whole. And sometimes to identify those corner cases, you have to think about implementation of low-level details all the way down to hardware.
Bonus (if you think it's somehow restricted to c/c++):
Python:
Python 2.7.10 (default, Oct 6 2017, 22:29:07)
>>> (2**26 + 1) * (2**26 + 1) - 1
4503599761588224
>>> math.sqrt(4503599761588224)
67108865.0
Golang:
import (
"fmt"
"math"
)
func main() {
fmt.Printf("%f", math.Sqrt(4503599761588224))
}
Output:
67108865.000000
JavaScript:
console.log(Math.sqrt(4503599761588224))
67108865
etc.
Posted on July 4, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.