YouTube gave me the video “Finding the BEST sine function for Nintendo 64” yesterday, and while I thought the title was very intriguing, I came away a bit disappointed; the video is hardly about the specifics of Nintendo 64 at all, and the author mostly spends time comparing straight-up LUTs with some hand-made splines under a proxy metric (no real comparative benchmarks, hardly any numerics).
I have no idea what the optimal solution is, but it got me thinking: How good can we get with the same basic formulas, just better constants? The video is pretty sparse with actual source code and data, but I think I've extracted enough info to go work on his “5th order sine” and “4th order cosine” methods (both straight-up polynomials).
Let's start with the 5th order sine; it's basically an odd polynomial (he does simple range reduction to [-pi/2,pi/2)). The implementation goes roughly like this, taking in a value between 0 and 65535 inclusive:
float x = i * (float)(0.0003 / M_PI); // Prescaler reverse-engineered from the video
float sin = x * (1.0f + x * x * (b + x * x * a));
He uses 1.0f specifically because it's ready in a register (so you don't need to load the constant), but it's not hard to see that this is exactly the same as dropping the prescaler and evaluting ax⁵ + bx³ + cx. So let's see what Sollya thinks:
> fpminimax(sin(x*2*Pi/65536.0),[|x,x^3,x^5|],[|SG...|],[0;16384]);
Warning: For at least 2 of the constants displayed in decimal, rounding has happened.
x * (9.58634263952262699604034423828125e-5 + x^2 * (-1.46252571143166976153082714517950080335140228271484e-13 + x^2 * 6.1585575698811928868593794069233315902067715796875e-23))
We can bruteforce-evaluate this over all 16384 relevant inputs (corresponding to 0 through pi/2), giving maximum absolute error 0.000108. Given that he claims maximum absolute error of 0.000395 for his own version (nearly 4x as high), and we saved the initial multiplication (though not the load), this is a good start!
Nevertheless, we're not looking for better. We're looking for BESTEST. And we want to prove it. So we must unleash… the SAT solver. Usually I use OR-Tools, but for a project this bizarre, we need the big guns. We need Z3.
Z3 has full support for IEEE floats of any size, so we can start off by declaring that a,b,c are floats (the N64 supports double just fine, but his assembly strongly suggests he's working in single precision):
(declare-const a Float32)
(declare-const b Float32)
(declare-const c Float32)
We also need a maximum tolerable error. Since we're trying to beat Sollya, let's start off at 0.0001:
(declare-const eps Float32)
(assert (= eps (fp #b0 #x71 #b10100011011011100010111))) ; 0.0001
Now we want to ask the solver to minimize the error. We need to do that point by point. Let's first start with i=16383 (testing i=0 is meaningless, since it will be zero no matter what the constants are):
(declare-const x16383 Float32)
(declare-const x16383sq Float32)
(declare-const y16383 Float32)
(declare-const yy16383 Float32)
(assert (= x16383 (fp #b0 #x8c #b11111111111110000000000)))
(assert (= x16383sq (fp #b0 #x9a #b11111111111100000000000)))
(assert (= y16383 (fp #b0 #x71 #b10010010010100010001000)))
(assert (= yy16383
(fp.mul roundNearestTiesToEven x16383
(fp.add roundNearestTiesToEven c
(fp.mul roundNearestTiesToEven x16383sq
(fp.add roundNearestTiesToEven b
(fp.mul roundNearestTiesToEven a x16383sq)))))))
(assert (fp.lt (fp.abs (fp.sub roundNearestTiesToEven y16383 yy16383)) eps))
So we're defining a constant for 16383.0f, and then one for the same value
squared (precomputing this creates fewer SAT clauses, and we're really
creating enough already!), and then the expected output value (y16383)
of roughly 0.000096f. We then define that yy16383 is the output of the
formula x * (c + xsq * (b + xsq * a))
, done with standard
IEEE 754 round-ties-to-nearest-even mode. Finally, we say that the
absolute difference between yy16383 and the expected answer (where the rounding
mode doesn't matter much) must be less than our tolerance.
Behind the scenes, Z3 will create a bazillion SAT clauses for us, presumably with adders and normalization and whatever. The beauty of this is that we should be able to match hardware exactly; no matter what may happen with rounding or overflow or denormals or whatever, our simulation will give us the right error. We could even model tricks like “we only care to load the upper 16 bits of the float, to save on a MIPS instruction” and see if we could still get a reasonable approximation. There are no limits (especially not if P=NP and SAT admits a solution in polynomial time!).
There's one more clause we want:
(assert (fp.leq yy16383 one))
We don't want any value to exceed 1.0f, since his code relies on using sqrt(1 - x²) to calculate the cosine from sine, and we can't take the square root of a negative number. (I think using the square root here is rather dubious performance-wise, but we'll roll with it; again, we only want to change constants.)
Now, we could repeat this for all the other 16382 points. However, that would create literally millions of SAT clauses and take forever to work out. So I just put in 16 roughly equidistant points and let it roll:
(check-sat)
(get-value (a b c))
And after four minutes or so of calculation, Z3 (well, my script wrapper around Z3) told me that it was indeed possible to get under 0.0001 maximum absolute error:
sat
((a (fp #b0 #x35 #b00100101101100010000110)) ; 0.000000000000000000000060733997
(b (fp #b1 #x54 #b01001000110000100101110)) ; -0.00000000000014599842
(c (fp #b0 #x71 #b10010010000001010001001))) ; 0.00009584899
Now, of course, this was only for those points, but there's a limit to how fast the error polynomial can ripple, so it turns out that the error over the full input range wasn't far away (it was 0.00010049343).
But better still isn't what we want! BESTEST. Let me take a little detour towards the other simple polynomial, namely the “4th order cosine” (which calculates sin from cos, instead of the other way round). We can implement this in exactly the same way:
(fp.add roundNearestTiesToEven one
(fp.mul roundNearestTiesToEven x16383sq
(fp.add roundNearestTiesToEven b
(fp.mul roundNearestTiesToEven a x16383sq))))
That is, 1 + x_sq * (b + a * x_sq)
. Here there's no
prescaler in his version (and the coefficients are visible in the video),
so we've kept the constraint that one of the constants must be 1.0f,
and thus Z3 will have less work to do. Thus, let's ask it not for
finding something that's better than a target, but as low as possible:
(assert (fp.isPositive eps))
(declare-const eps_e (_ BitVec 8))
(declare-const eps_m (_ BitVec 23))
(assert (= eps (fp #b0 eps_e eps_m)))
(minimize eps_e)
(minimize eps_m)
It seems roundabout that we need to declare separate variables for the exponent and mantissa and then minimize those, instead of just minimizing the float directly, but evidently, this is the standard trick. So we let Z3 churn for half an hour, and lo and behold:
sat
((a (fp #b0 #x44 #b11001110111101011111101)) ; 0.0000000000000000031371447
(b (fp #b1 #x63 #b00111001101011100110110)) ; -0.000000004564664
(eps (fp #b0 #x74 #b10000001110110000000001))) ; 0.0007359386
So now it's found the optimal epsilon all by itself! Of course, it still is over that set of points. But the truth is not far behind; the actual worst case has error 0.0007380843. So it's way less than a percent off.
Unfortunately Z3 is much slower when we ask to minimize; seemingly, it switches to a more powerful but slower solver (SMT compared to SAT). The alternative would be to drive it ourselves by repeated queries with binary search or similar for epsilon, but OK, half an hour per run is entirely plausible. But we want the bestest. So what can we do?
Well, we find that the worst case for this polynomial is i=7231. So we just add that to our list of points to check. This is rather similar to how the Reméz exchange algorithm (used by, among others, Sollya) works; it is bound to fix at least that single point, and hopefully the procedure will converge much before we add all 16383 of them. (That sounds very natural even without any deep proof; again, the error function can only really oscillate so fast, so there are bound to be tons of values that will never be close to the max error). And indeed, after twelve iterations or so, we get:
((a (fp #b0 #x44 #b11001110111110101000110)) ; 0.0000000000000000031372656
(b (fp #b1 #x63 #b00111001101011101110110)) ; -0.0000000045646926
(eps (fp #b0 #x74 #b10000010010111011110000))) ; 0.00073693599
This has max error 0.00073693593 (remember, we asked for the error to be less than eps, and that eps is indeed the next possible float after the max error), and that error is in i=16383, which was already a point we asked it to optimize, and so this is THE BESTEST 4th-order cosine approximation under the given constraints. With his best cosine being 0.002787, it's a nice 3–4x improvement on error just by swapping out constants.
The sine is much slower to minimize (five hours or so per iteration),
so it hasn't converged yet; still waiting for the absolute truth there.
So for now, the one above is what we've got. At least we have a lower bound
saying that it's less than 1% off the best possible one (asking for anything
lower produces a unsatisfiability result). But the bestest 4th-order cosine
approximation is indeed 1.0f + (x * x) * (-0.000000004564664f + (x *
x) * 0.0000000000000000031372656f)
. On the Nintendo 64.