[PATCHv4 2/2] powerpc64le: ifunc (almost) all *f128 routines in multiarch mode

Wed Jun 24 20:41:06 GMT 2020

On 22/06/2020 20:04, Paul E Murphy wrote:
> 
> 
> On 6/22/20 11:57 AM, Adhemerval Zanella via Libc-alpha wrote:
>>
>>
>> On 15/06/2020 17:59, Paul E. Murphy via Libc-alpha wrote:
>>> See the Makefile changes for high level design/commentary.
>>>
>>> V4 changes -
>>>    * Drop patch to add libm_alias_exclusive_ldouble.  After
>>>      recent refactoring of fmaf128, it showed some unfixable
>>>      flaws.  Instead, use macro renaming for nextafterf128 to
>>>      generate the needed symbols, and rework.
>>>
>>> V3 changes -
>>>    * Cleanup comments.
>>>    * Rebase against fmaf128 cleanup
>>>    * Use Makeconfig trick to set var in le/power9 sysdep dir to
>>>      determine if ifunc support is necessary.  This works with
>>>      the upcoming CPU detection patch.
>>>    * fmaf128 patch is no longer needed.
>>>
>>> V2 changes -
>>>    * move duplicate redirect macros into float128-ifunc-redirect-macros.h
>>>    * replace subshell usage with command sequencing
>>>    * Add more instructive documentation in Makefile about how all
>>>      these ugly pieces work togethor
>>>    * Minor comment cleanup throughout
>>>    * Improve inline documentation/commentary throughout
>>>
>>> ---8<---
>>>
>>> Programatically generate simple wrappers for most libm *f128
>>> objects and a set of ifunc objects to unify them.
>>>
>>> A second set of implementation files are generated which simply
>>> include the first implementation encountered along the search
>>> path.  This usually works, excepting when a wrapper is overriden
>>> and makefile search order slightly diverges from include order.
>>>
>>> A set of additional headers are included which primarily rely
>>> on asm redirects to rename, and less frequently macro renames
>>> where an asm redirect is not possible.  These intercept several
>>> common headers to install redirect and disable macros at specific
>>> times.  This works surprisingly well.  Notably, some ugliness
>>> occurs when header inclusion must be coerced at certain times
>>> before turning off aliasing and plt bypass wrappers.
>>>
>>> Notably, the only special case is s_significandf128.c.  It is
>>> doubly special as exists to support ldouble redirects, and
>>> exposes subtle difference between makefile rules and search path
>>> orders.  Commentary is inlined.
>>>
>>> Admittedly, this makes shared maintenance a tiny bit more
>>> difficult, but lays groundwork for supporting more optimized
>>> float128 routines which very overtly assume a soft-fp runtime.
>>> Changes to internal float128 API should fail at compile time,
>>> thus build-many-glibcs.py should readily catch any divergence.
>>>
>>> Finally, don't build this support if requested CPU is newer
>>> than power8.
>>>
> 
> 
>>> fixup f128 ifunc
>>>
>>> drop the patch to introduce the new macro to assist simplification of
>>> s_nextafter.c.  It wasn't thought out well enough.  Instead just add
>>> the ugly macro redirections needed to generate the appropriate >> nexttoward symbols.
> 
> This is refactoring noise, and while not wrong is not meant to be
> in the final commit message.
> 
>>
>> I am trying to digest the requirements to add such complexity on the
>> powerpc64le build rules, specially the internally Makefile hackery
>> required.
> 
> This is addressed in the notes. Mildly speaking, soft-fp code
> generation on P8 is quite limited.  This is pretty easy to identify in any non-trivial binary128 function.  e.g expf128 is almost 1/3 the
> size on P9. Likewise many complex functions are almost 1/2 the size. Anything soft-fp touches massively increases code size and impedes instruction scheduling.
> 
> I can get some more concrete numbers, but my hope is this enables us
> to make even more meaningful improvements to common code when hardware
> support is available.

Indeed soft-fp is most likely bloated and incur is a lot of libcalls
of most operations. 

> 
>>
>> So if I understood correctly, let say we have these targets:
>>
>>    1. powerpc64le-linux-gnu
>>    2. powerpc64le-linux-gnu with --with-cpu=power9
>>
>> The ifunc mechanism to build optimized versions for power9 will be
>> built only for 1, while for 2. only versions that uses hardware
>> instruction for __float128 (-mfloat128-hardware gcc option)
>> will be used.
> 
> In case 2 (and with any newer cpu), this patch is a no-op.

Ack, this was my understanding.

> 
>>
>> So all the rediretion machinery done in the float128-ifunc-* are to
>> list and redirect internal libm symbols to its float128 counterparts.
>> One initial issue is this tend to be fragile: it requires to change
>> arch-specific code when generic code is changed (for instance by
>> changing the internal symbol name or the caller implementation)
> 
> The interesting symbol names are likely to see less change, and those
> that do should mostly be hidden via local calls.  This is the price
> the ppc64le maintainers pay to support multiarch for a large swath
> of libm.  This greatly simplifies the most mundane and error prone
> pieces.
> 
>>
>> Another issue the rules exceptions (such as s_totalorderf128) that
>> require additional care to check if they result in correct code.
> 
> Such is already tested via the existing test suite.

This issue is not really lacking of testing, but added complexity in
Makefile to handle such specific cases.

> 
>>
>> Another possible mantainance issue is to keep updating the exported
>> symbol list at float128-ifunc.c, float128-ifunc.h, and
>> float128_private.hfor each new possible symbol in future version.
>> It against means to correct/change arch-specific code for generic
>> changes.
> 
> Note that float128-ifunc.c only defines compat symbols for the old
> finite entry points. That set should never grow.
> 
>>
>> It also increases code size considerable with the potential to keep
>> increasing with the addition on new libm functions.
> 
> Stripping debug info, the code size increase of libm is about 220kb
> added 1210kb library.  Not trivial, but not overwhelming.
> 
>>
>> Finally the question is how useful would be this change on real
>> world cases to justify this huge build and permutation complexity.
> 
> Code size is an interesting metric to measure.  The P9 variants
> are substantially smaller where soft-fp is involved. expf128 is almost
> 1/3 the size.
> 
>>
>> What I would expect in realword cases is if the workload really
>> uses float128 extensivelly to be built with -mcpu=power9 and/or
>> -mfloat128/-mfloat128-hardware. It should cover most the required
>> hotspots and glibc can focus on providing only cases where adding
>> an specialized ifunc variant does make sense (as for the x86_64
>> sysdeps/x86_64/fpu/multiarch/mp*) for instance.
>>
>> Also, if an optimized float128 glibc build is paramount, a much
>> simpler solution would be to just provide a -mcpu=power9 built one.
> 
> That kicks the can to the distros.  I think few ship such libraries. The whole value of multiarch is to expose these benefits without having to make the end user jump through such hurdles.  I don't think the x86 comparison holds.  Adding a couple of helpful instructions is tame compared to going from soft to hard fp.

My main issue with this approach is twofold: it basically tries to
provide a soft and hard fp variant of of libm in the same library
(adding build complexity, code bloat, and extra maintainability burden) 
and it relies heavily on the ifunc (which has it own issues that bites 
us now and then).

The x86 comparison is sounded because we could make something similar
and start to provide libm variants for AVX, AVX256, etc in the same
manner.  Instead the approach used was to profile and provide specific
ifunc variants to hotpots. 

I give you that this ISA change is somewhat more intrusive than a vector 
extension, but other ABI examples (armhf with its multiple fp variants) 
usually take the example of relying of the toolchain target to provide 
such optimizations.

I not sure if the best option would be to provide a more easy way 
to configure and build just libm or add a option to build libm for
multiple configuration. And I understand that distro want to minimize
the libc.so variants (that's why ifunc was pushed initially afaik).

That's why I suggested to provide hardware float128 optimized variant
when realword usercases provide us feedback that this might a gain.
Besides the limited float128 current usage, I also expect in most 
scenarios that symbols that compiler implement as builtin (such sqrt) 
won't be called at all. Even for more complex math functions, most likely 
only a subset will be extensively used, that these are the ones that
I think we should focus on instead of just push for the bigger hammer
and optimize everything (which would be just simpler by providing
a specific libm anyways).