This is the mail archive of the
mailing list for the glibc project.
Re: Re: [PATCH] x86-64: Optimize e_expf with FMA [BZ #21912]
- From: Arjan van de Ven <arjan at linux dot intel dot com>
- To: Carlos O'Donell <carlos at redhat dot com>, "H.J. Lu" <hjl dot tools at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 16 Aug 2017 07:31:36 -0700
- Subject: Re: Re: [PATCH] x86-64: Optimize e_expf with FMA [BZ #21912]
- Authentication-results: sourceware.org; auth=none
- References: <firstname.lastname@example.org>
On 8/16/2017 7:04 AM, Carlos O'Donell wrote:
On 08/16/2017 09:34 AM, H.J. Lu wrote:
FMA optimized e_expf improves performance by more than 50% on Skylake.
Exactly how much of e_expf-fma.S do you need to achieve that 50% speedup?
the core "fast path"
(the bit after /* Main path: here if 2^(-28)<=|x|<125*log(2) */ )
How does this algorithm compare to what is already implemented for e_expf?
I started with the SSE version of that e_expf, turned it into AVX, used FMA where possible and fixed a few
glass jaws in the fast path that you hit on skylake.
the slow path is more a direct 1:1 translation from SSE to AVX (because mixing SSE and AVX
is generally a bad idea)
My questions are basically leading to this:
(a) Can we write a generic e_expf in C with the special cases written
in C, and...
the special cases are tiny for expf() (compared to say, powf() )
(b) Is there a core kernel computation that we can then implement in assembly?
some of the hot path instructions are pretty long latency in the cpu, and to
get the performance, those are done before the special case tests
(so that their execution overlaps, effectively making all the special case tests free)
if you do the first part in C you lose all of that. The whole expf() (when hitting the fast path)
has a throughput cost of somewhere between 10 and 12 cycles (depending on the CPU generation).
A big chunk of achieving that requires overlapping some of the expensive things with more mundane
instructions (like the special case tests). Giving that up for doing these checks in C first
likely doubles the cost of expf() or something very close to that.
(note this new code does not change the basics, the existing SSE code already has all of this in assembly)