This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf


On 13/06/17 08:17, Ashwin Sekhar T K wrote:
> This patchset adds the optimized ASIMD version of sinf/cosf
> for Aarch64. The algorithm and code flow is based on the SSE versions
> of the same in sysdeps/x86_64/fpu.
> 
> The ASIMD versions are used only if the cpu supports asimd feature.
> It uses ifuncs and HWCAP to identify the ASIMD capability.
>

i thought it was a vector version because of ASIMD, but it's
just scalar sinf/cosf.

there are many issues with this patch, but most importantly it
duplicates work as i also happen to work on single precision
math functions (sorry).

i plan to work on vector math functions and double precision
math functions too, before anybody jumps on that, please
coordinate to avoid wasted effort like this.

issues:

- asm code wont be accepted: generic c code can be just as fast.

- ifunc wont be accepted: all instructions are available on all cpus.

- math code should not be fsf assigned lgpl code, but universally
available, post it under non-restricted license first, then assign
it to fsf so it can be used everywhere without legal issues.

- document the worst case ulp error and number of misrounded
cases: for single argument scalar functions you can easily test
all possible inputs in all rounding modes and that information
helps to decide if the algorithm is good enough.

- benchmark measurements ideally provide a latency and a
throughput numbers as well for the various ranges or use a
realistic workload, in this case there are many branches
for the various input ranges so it is useful to have a
benchmark that can show the effect of that.

> The patchset was tested using "make check" for the math sub-directory.
> The tests were run on linux 4.4.0-45-generic on ThunderX88 platform.
> 
> The following are the approximate speedups observed over the
> existing implementation on different Aarch64 platforms for
> different input values.
> 
>   SINF
>   ---------------------------------------------------------
>   Input           ThunderX88      ThunderX99      CortexA57
>   ---------------------------------------------------------
>   0.0              1.88x           1.18x           1.17x
>   2.0^-28          1.33x           1.12x           1.03x
>   2.0^-6           1.48x           1.28x           1.27x
>   0.6*Pi/4         0.94x           1.14x           1.21x
>   13*Pi/8          1.41x           2.00x           2.16x
>   17*Pi/8          1.45x           1.93x           2.23x

based on these numbers my current c implementation is faster,
but it will take time to polish that for submission.

>   1000*Pi/4       19.68x          37.46x          27.99x
>   2.0^51          12.00x          13.58x          13.49x

this is a bug in the current generic code that it falls back
to slow argument reduction even though single precision arg
reduction can be done in a few cycles over the entire range,

i think the x86_64 sse code could still be simpler and faster
(not that it matters much as these are rare cases).

>   Inf              1.04x           1.05x           1.12x
>   Nan              0.95x           0.87x           0.82x
>   ---------------------------------------------------------


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]