[PATCHv4 2/2] powerpc64le: ifunc (almost) all *f128 routines in multiarch mode

Mon Jun 22 23:04:25 GMT 2020

On 6/22/20 11:57 AM, Adhemerval Zanella via Libc-alpha wrote:
> 
> 
> On 15/06/2020 17:59, Paul E. Murphy via Libc-alpha wrote:
>> See the Makefile changes for high level design/commentary.
>>
>> V4 changes -
>>    * Drop patch to add libm_alias_exclusive_ldouble.  After
>>      recent refactoring of fmaf128, it showed some unfixable
>>      flaws.  Instead, use macro renaming for nextafterf128 to
>>      generate the needed symbols, and rework.
>>
>> V3 changes -
>>    * Cleanup comments.
>>    * Rebase against fmaf128 cleanup
>>    * Use Makeconfig trick to set var in le/power9 sysdep dir to
>>      determine if ifunc support is necessary.  This works with
>>      the upcoming CPU detection patch.
>>    * fmaf128 patch is no longer needed.
>>
>> V2 changes -
>>    * move duplicate redirect macros into float128-ifunc-redirect-macros.h
>>    * replace subshell usage with command sequencing
>>    * Add more instructive documentation in Makefile about how all
>>      these ugly pieces work togethor
>>    * Minor comment cleanup throughout
>>    * Improve inline documentation/commentary throughout
>>
>> ---8<---
>>
>> Programatically generate simple wrappers for most libm *f128
>> objects and a set of ifunc objects to unify them.
>>
>> A second set of implementation files are generated which simply
>> include the first implementation encountered along the search
>> path.  This usually works, excepting when a wrapper is overriden
>> and makefile search order slightly diverges from include order.
>>
>> A set of additional headers are included which primarily rely
>> on asm redirects to rename, and less frequently macro renames
>> where an asm redirect is not possible.  These intercept several
>> common headers to install redirect and disable macros at specific
>> times.  This works surprisingly well.  Notably, some ugliness
>> occurs when header inclusion must be coerced at certain times
>> before turning off aliasing and plt bypass wrappers.
>>
>> Notably, the only special case is s_significandf128.c.  It is
>> doubly special as exists to support ldouble redirects, and
>> exposes subtle difference between makefile rules and search path
>> orders.  Commentary is inlined.
>>
>> Admittedly, this makes shared maintenance a tiny bit more
>> difficult, but lays groundwork for supporting more optimized
>> float128 routines which very overtly assume a soft-fp runtime.
>> Changes to internal float128 API should fail at compile time,
>> thus build-many-glibcs.py should readily catch any divergence.
>>
>> Finally, don't build this support if requested CPU is newer
>> than power8.
>>

>> fixup f128 ifunc
>>
>> drop the patch to introduce the new macro to assist simplification of
>> s_nextafter.c.  It wasn't thought out well enough.  Instead just add
>> the ugly macro redirections needed to generate the appropriate >> nexttoward symbols.

This is refactoring noise, and while not wrong is not meant to be
in the final commit message.

> 
> I am trying to digest the requirements to add such complexity on the
> powerpc64le build rules, specially the internally Makefile hackery
> required.

This is addressed in the notes. Mildly speaking, soft-fp code
generation on P8 is quite limited.  This is pretty easy to identify in 
any non-trivial binary128 function.  e.g expf128 is almost 1/3 the
size on P9. Likewise many complex functions are almost 1/2 the size. 
Anything soft-fp touches massively increases code size and impedes 
instruction scheduling.

I can get some more concrete numbers, but my hope is this enables us
to make even more meaningful improvements to common code when hardware
support is available.

> 
> So if I understood correctly, let say we have these targets:
> 
>    1. powerpc64le-linux-gnu
>    2. powerpc64le-linux-gnu with --with-cpu=power9
> 
> The ifunc mechanism to build optimized versions for power9 will be
> built only for 1, while for 2. only versions that uses hardware
> instruction for __float128 (-mfloat128-hardware gcc option)
> will be used.

In case 2 (and with any newer cpu), this patch is a no-op.

> 
> So all the rediretion machinery done in the float128-ifunc-* are to
> list and redirect internal libm symbols to its float128 counterparts.
> One initial issue is this tend to be fragile: it requires to change
> arch-specific code when generic code is changed (for instance by
> changing the internal symbol name or the caller implementation)

The interesting symbol names are likely to see less change, and those
that do should mostly be hidden via local calls.  This is the price
the ppc64le maintainers pay to support multiarch for a large swath
of libm.  This greatly simplifies the most mundane and error prone
pieces.

> 
> Another issue the rules exceptions (such as s_totalorderf128) that
> require additional care to check if they result in correct code.

Such is already tested via the existing test suite.

> 
> Another possible mantainance issue is to keep updating the exported
> symbol list at float128-ifunc.c, float128-ifunc.h, and
> float128_private.hfor each new possible symbol in future version.
> It against means to correct/change arch-specific code for generic
> changes.

Note that float128-ifunc.c only defines compat symbols for the old
finite entry points. That set should never grow.

> 
> It also increases code size considerable with the potential to keep
> increasing with the addition on new libm functions.

Stripping debug info, the code size increase of libm is about 220kb
added 1210kb library.  Not trivial, but not overwhelming.

> 
> Finally the question is how useful would be this change on real
> world cases to justify this huge build and permutation complexity.

Code size is an interesting metric to measure.  The P9 variants
are substantially smaller where soft-fp is involved. expf128 is almost
1/3 the size.

> 
> What I would expect in realword cases is if the workload really
> uses float128 extensivelly to be built with -mcpu=power9 and/or
> -mfloat128/-mfloat128-hardware. It should cover most the required
> hotspots and glibc can focus on providing only cases where adding
> an specialized ifunc variant does make sense (as for the x86_64
> sysdeps/x86_64/fpu/multiarch/mp*) for instance.
> 
> Also, if an optimized float128 glibc build is paramount, a much
> simpler solution would be to just provide a -mcpu=power9 built one.

That kicks the can to the distros.  I think few ship such libraries. 
The whole value of multiarch is to expose these benefits without having 
to make the end user jump through such hurdles.  I don't think the x86 
comparison holds.  Adding a couple of helpful instructions is tame 
compared to going from soft to hard fp.