This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
- From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- To: libc-alpha at sourceware dot org, Ashwin Sekhar T K <ashwin dot sekhar at caviumnetworks dot com>
- Cc: nd <nd at arm dot com>
- Date: Tue, 13 Jun 2017 12:07:38 +0100
- Subject: Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Szabolcs dot Nagy at arm dot com;
- Nodisclaimer: True
- References: <20170613071707.43396-1-ashwin.sekhar@caviumnetworks.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
On 13/06/17 08:17, Ashwin Sekhar T K wrote:
> This patchset adds the optimized ASIMD version of sinf/cosf
> for Aarch64. The algorithm and code flow is based on the SSE versions
> of the same in sysdeps/x86_64/fpu.
>
> The ASIMD versions are used only if the cpu supports asimd feature.
> It uses ifuncs and HWCAP to identify the ASIMD capability.
>
i thought it was a vector version because of ASIMD, but it's
just scalar sinf/cosf.
there are many issues with this patch, but most importantly it
duplicates work as i also happen to work on single precision
math functions (sorry).
i plan to work on vector math functions and double precision
math functions too, before anybody jumps on that, please
coordinate to avoid wasted effort like this.
issues:
- asm code wont be accepted: generic c code can be just as fast.
- ifunc wont be accepted: all instructions are available on all cpus.
- math code should not be fsf assigned lgpl code, but universally
available, post it under non-restricted license first, then assign
it to fsf so it can be used everywhere without legal issues.
- document the worst case ulp error and number of misrounded
cases: for single argument scalar functions you can easily test
all possible inputs in all rounding modes and that information
helps to decide if the algorithm is good enough.
- benchmark measurements ideally provide a latency and a
throughput numbers as well for the various ranges or use a
realistic workload, in this case there are many branches
for the various input ranges so it is useful to have a
benchmark that can show the effect of that.
> The patchset was tested using "make check" for the math sub-directory.
> The tests were run on linux 4.4.0-45-generic on ThunderX88 platform.
>
> The following are the approximate speedups observed over the
> existing implementation on different Aarch64 platforms for
> different input values.
>
> SINF
> ---------------------------------------------------------
> Input ThunderX88 ThunderX99 CortexA57
> ---------------------------------------------------------
> 0.0 1.88x 1.18x 1.17x
> 2.0^-28 1.33x 1.12x 1.03x
> 2.0^-6 1.48x 1.28x 1.27x
> 0.6*Pi/4 0.94x 1.14x 1.21x
> 13*Pi/8 1.41x 2.00x 2.16x
> 17*Pi/8 1.45x 1.93x 2.23x
based on these numbers my current c implementation is faster,
but it will take time to polish that for submission.
> 1000*Pi/4 19.68x 37.46x 27.99x
> 2.0^51 12.00x 13.58x 13.49x
this is a bug in the current generic code that it falls back
to slow argument reduction even though single precision arg
reduction can be done in a few cycles over the entire range,
i think the x86_64 sse code could still be simpler and faster
(not that it matters much as these are rare cases).
> Inf 1.04x 1.05x 1.12x
> Nan 0.95x 0.87x 0.82x
> ---------------------------------------------------------