This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
- From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- To: "Sekhar, Ashwin" <Ashwin dot Sekhar at cavium dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Cc: nd at arm dot com
- Date: Tue, 13 Jun 2017 14:23:51 +0100
- Subject: Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Szabolcs dot Nagy at arm dot com;
- Nodisclaimer: True
- References: <20170613071707.43396-1-ashwin.sekhar@caviumnetworks.com> <593FC77A.6050609@arm.com> <1497358590.4998.34.camel@caviumnetworks.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
On 13/06/17 13:56, Sekhar, Ashwin wrote:
>>> SINF
>>> ---------------------------------------------------------
>>> Input ThunderX88 ThunderX99 CortexA57
>>> ---------------------------------------------------------
>>> 0.0 1.88x 1.18x 1.17x
>>> 2.0^-28 1.33x 1.12x 1.03x
>>> 2.0^-6 1.48x 1.28x 1.27x
>>> 0.6*Pi/4 0.94x 1.14x 1.21x
>>> 13*Pi/8 1.41x 2.00x 2.16x
>>> 17*Pi/8 1.45x 1.93x 2.23x
>> based on these numbers my current c implementation is faster,
>> but it will take time to polish that for submission.
>
> Are these going to be aarch64 specific C implementations or changes in
> generic code?
>
> And Could you please inform when you are going to submit your patches.
>
> I also dont agree to having duplicated efforts. But if you dont plan to
> submit your changes in the near future, I guess I will go ahead
> addressing the other comments and work on submitting a v2 patch.
>
the plan is the next release cycle (i plan to post powf
first, then work on sinf/cosf, possibly sin/cos too, then
look at vector versions once the vector abi is in gcc).
the c implementation is generic
(sometimes the instruction scheduling is suboptimal and
i found that union based bithacks don't always give good
code but those are issues we can work on the gcc side)
one issue is fma vs non-fma code, i haven't solved that
yet, but it will probably work either way (since we use
double prec), if it makes a difference i will add ifdef
code path for the two cases (might affect the fast arg
reduction)
> Thanks
> Ashwin
>
>>
>>>
>>> 1000*Pi/4 19.68x 37.46x 27.99x
>>> 2.0^51 12.00x 13.58x 13.49x
>> this is a bug in the current generic code that it falls back
>> to slow argument reduction even though single precision arg
>> reduction can be done in a few cycles over the entire range,
>>
>> i think the x86_64 sse code could still be simpler and faster
>> (not that it matters much as these are rare cases).
>>
>>>
>>> Inf 1.04x 1.05x 1.12x
>>> Nan 0.95x 0.87x 0.82x
>>> ---------------------------------------------------------