[PATCHv4 2/2] powerpc64le: ifunc (almost) all *f128 routines in multiarch mode

Tue Jun 23 16:19:05 GMT 2020

On 6/22/20 6:04 PM, Paul E Murphy via Libc-alpha wrote:
> 
> 
> On 6/22/20 11:57 AM, Adhemerval Zanella via Libc-alpha wrote:
>>
>>
>> On 15/06/2020 17:59, Paul E. Murphy via Libc-alpha wrote:
>>> See the Makefile changes for high level design/commentary.
>>>
>>> V4 changes -
>>>    * Drop patch to add libm_alias_exclusive_ldouble.  After
>>>      recent refactoring of fmaf128, it showed some unfixable
>>>      flaws.  Instead, use macro renaming for nextafterf128 to
>>>      generate the needed symbols, and rework.
>>>
>>> V3 changes -
>>>    * Cleanup comments.
>>>    * Rebase against fmaf128 cleanup
>>>    * Use Makeconfig trick to set var in le/power9 sysdep dir to
>>>      determine if ifunc support is necessary.  This works with
>>>      the upcoming CPU detection patch.
>>>    * fmaf128 patch is no longer needed.
>>>
>>> V2 changes -
>>>    * move duplicate redirect macros into 
>>> float128-ifunc-redirect-macros.h
>>>    * replace subshell usage with command sequencing
>>>    * Add more instructive documentation in Makefile about how all
>>>      these ugly pieces work togethor
>>>    * Minor comment cleanup throughout
>>>    * Improve inline documentation/commentary throughout
>>>
>>> ---8<---
>>>
>>> Programatically generate simple wrappers for most libm *f128
>>> objects and a set of ifunc objects to unify them.
>>>
>>> A second set of implementation files are generated which simply
>>> include the first implementation encountered along the search
>>> path.  This usually works, excepting when a wrapper is overriden
>>> and makefile search order slightly diverges from include order.
>>>
>>> A set of additional headers are included which primarily rely
>>> on asm redirects to rename, and less frequently macro renames
>>> where an asm redirect is not possible.  These intercept several
>>> common headers to install redirect and disable macros at specific
>>> times.  This works surprisingly well.  Notably, some ugliness
>>> occurs when header inclusion must be coerced at certain times
>>> before turning off aliasing and plt bypass wrappers.
>>>
>>> Notably, the only special case is s_significandf128.c.  It is
>>> doubly special as exists to support ldouble redirects, and
>>> exposes subtle difference between makefile rules and search path
>>> orders.  Commentary is inlined.
>>>
>>> Admittedly, this makes shared maintenance a tiny bit more
>>> difficult, but lays groundwork for supporting more optimized
>>> float128 routines which very overtly assume a soft-fp runtime.
>>> Changes to internal float128 API should fail at compile time,
>>> thus build-many-glibcs.py should readily catch any divergence.
>>>
>>> Finally, don't build this support if requested CPU is newer
>>> than power8.
>>>

> 
> This is refactoring noise, and while not wrong is not meant to be
> in the final commit message.
> 
>>
>> I am trying to digest the requirements to add such complexity on the
>> powerpc64le build rules, specially the internally Makefile hackery
>> required.
> 
> This is addressed in the notes. Mildly speaking, soft-fp code
> generation on P8 is quite limited.  This is pretty easy to identify in 
> any non-trivial binary128 function.  e.g expf128 is almost 1/3 the
> size on P9. Likewise many complex functions are almost 1/2 the size. 
> Anything soft-fp touches massively increases code size and impedes 
> instruction scheduling.
> 
> I can get some more concrete numbers, but my hope is this enables us
> to make even more meaningful improvements to common code when hardware
> support is available.

I did a quick test for expf128.  It's around a 2.5x speedup on the fast 
path (Building a table of 1M small values).  This massive speedup
is due to the expensive PLT calls required for every FP operation, and
the soft-fp variants cannot use FMA.  That hurts.  Quite a bit of libm 
centers around series approximation like expf128.