interesting; it takes 2 independent FP adds and a compare (in C) to detect nearest rounding
being in effect (which in time can overlap with the float->double conversion)
so if there's an option to reduce the algorithm by more than that for a fast
path...
(also, some CPUs (like newer Intel) support an instruction prefix encoding to force
rounding modes on a FP instruction independent of the global rounding mode,
which at some point maybe should be a gcc pragma or attribute or something,
and then used in such C code)
i don't think reducing the polynomial (from order 3 to order 2)
is possible without bigger lookup table, if less accuracy is
enough then reducing the table size is possible though:
poly order / table len / ulp error / non-nearest ulp error (rounded)
2 / 64 / 0.61 /
2 / 128 / 0.51 /
2 / 256 / 0.502 /
3 / 8 / 0.91 / > 10
3 / 16 / 0.526 / 2
3 / 32 / 0.502 / 1
3 / 64 / 0.5001 / 1
4 / 8 / 0.54 /
4 / 16 / 0.501 /
4 / 32 / 0.50004 /
4 / 64 / 0.5 /
the c code uses order=3/table=32, the x86_64 asm uses order=4/table=64