On Tue, Oct 2, 2012 at 7:50 AM, Andreas Jaeger <aj@suse.com> wrote:
On 10/02/2012 04:34 PM, H.J. Lu wrote:
On Tue, Oct 2, 2012 at 12:03 AM, Andreas Jaeger <aj@suse.com> wrote:
On 10/01/2012 08:49 PM, H.J. Lu wrote:
On Mon, Oct 1, 2012 at 10:56 AM, Andreas Jaeger <aj@suse.com> wrote:
On 10/01/2012 05:14 PM, H.J. Lu wrote:
Hi,
This patch adds multiarch FMA support to x86-64 libm. Tested on
FMA machine. OK for master?
What kind of performance benefits does it bring us? Are you sure that
all
I don't have any performance numbers. My patch just
enables FMA optimization, similar to FMA4 optimization.
Could you test at least one of these functions to see whether it makes a
difference at all, please?
It works correctly on FMA machine. I will send a separate patch
to update x86-64 ULPs due to FMA instructions. FMA functions
are a little bit smaller than SSE/AVX version.
What about performance? For such a change I don't think it's unreasonable to
ask for some numbers...
Liubox, Kirll, can you get hjl/fma/master branch vs master branch
performance numbers on Haswell for those libm functions optimized
for FMA?
the functions you enhance are really using fma and thus benefit from
the
change?
Not all FMA/FMA4 functions have FMA/FMA4 instructions. We should
take a look and use AVX functions instead.
So, let's only add those functions that really benefit from this.
Since functions in libm are implemented by calling each other,
all functions called from a libm function compiled for FMA must
also be compiled by FMA with _fma as the suffix in their symbol
names. Otherwise, wrong functions may be called. One way
Really?
If func a calls b, then a can be fma optimized but b does not need to be.
Why does a_fma need to call b_fma instead of b?
Take e_pow for example, when we optimize it for FMA, we must also optimize
__slowpow for FMA since it calls __slowpow. Although __slowpow itself
doesn't use any FMA instructions, it calls other functions which use FMA:
[hjl@gnu-tools-1 math]$ nm slowpow-fma4.o
U __add_fma4
U __dbl_mp_fma4
0000000000000000 r eps.3048
U __halfulp_fma4
0000000000000000 r .LC0
U __mp_dbl
U __mpexp_fma4
U __mplog_fma4
U __mul_fma4
0000000000000000 T __slowpow_fma4
U __sub_fma4
[hjl@gnu-tools-1 math]$
So even if __slowpow doesn't use FMA, we must compile __slowpow
with FMA so that it can calls other functions with FMA. One way to
fix it is to make all those internal functions IFUC. Their references will
be resolved to the proper versions at run-time. Instead of calling
__slowpow_fma4, we just call __slowpow, which is an IFUNC function
optimized for SSE2 and AVX. Other internal functions can be
optimized for SSE2, AVX, FMA and FMA4.