[PATCH v3 5/7] math: Remove powerpc e_hypot
Adhemerval Zanella
adhemerval.zanella@linaro.org
Thu Nov 11 20:54:43 GMT 2021
On 11/11/2021 16:48, Wilco Dijkstra wrote:
> Hi Adhemerval,
>
>>>> Another option is to use the powerpc implementation which favor FP over integer
>>>> as the default one.
>>>
>>> That is the fastest implementation. It is less accurate though (~1.04ULP with FMA
>>> and ~1.21ULP without FMA), so I'm not sure that would be acceptable.
>>
>> This should not be worse than the current default (the powerpc one is essentially
>> the same as default using FP operations).
>
> The generic version carefully computes x * x + y * y with higher accuracy so that
> the sqrt stays below 1.0ULP. The powerpc version doesn't and so goes over 1.0ULP.
For *hypotf* they are essentially the same, powerpc one just tries to optimize
the isinf/isnan because of the FP->GRP hazards. I think there is not current
justification for the TEST_INF_NAN, it would be better to use your suggestion
of on default algorithm and just remove powerpc one:
if (!isfinite(x) || !isfinite(y))
{
a = x; b = y;
if ((isinf (x) || isinf (y))
&& !issignaling_inline (x) && !issignaling_inline (y))
return INFINITY;
return x + y;
}
>
>>> I did some quick optimizations on the new algorithm, on Neoverse N1 my fastest
>>> version is less than 10% slower than the powerpc version, and has ~0.94 ULP error.
>>
>> Do you mean besides the optimized nan/inf checks? I can check if it helps on
>> powerpc.
>
> Yes. I avoid the unnecessary checks at the end by doing everything in the 3 main
> cases. The division can be made independent of the sqrt so they run in parallel on
> modern cores.
>
> However we can do even better with FMA and remove the division entirely by
> special casing the difficult case where x and y are really close. This has only 3.5%
> higher latency than the powerpc version, so that's the fastest option below 1.0ULP.
> I'll see whether it could work without FMA too and send you something to benchmark
> if it passes the testsuite.
The original paper does have a version that uses fma, but it aims to be correctly
rounded:
double h2 = h * h;
double ax2 = ax * ax;
h -= (__builtin_fma (-ay, ay, h2 - ax2)
+ __builtin_fma (h, h, -h2)
- __builtin_fma (ax, ax, -ax2)) / (2.0 * h);
return h * scale;
However, at least on recent x86_64 I did not see much improvement over no fma
version. Maybe we can come up with a version that might not be correctly
rounded that can leverage the fma for __FP_FAST_FMA.
(Also this version does not fully pass the testsuite, it trigger some underflow
exceptions that I did not investigate).
More information about the Libc-alpha
mailing list