This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] S390: Refactor ifunc handling



On 06/12/2018 14:54, Stefan Liebler wrote:
> On 12/05/2018 08:32 PM, Adhemerval Zanella wrote:
>>
>>
>> On 05/12/2018 14:15, Stefan Liebler wrote:
>>> On 12/03/2018 06:30 PM, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 03/12/2018 06:48, Stefan Liebler wrote:
>>>>> On 11/30/2018 07:31 PM, Adhemerval Zanella wrote:
>>>>>>
>>>>>>
>>>>>> On 30/11/2018 13:57, Stefan Liebler wrote:
>>>>>>> This patch series is mainly refactoring the s390 specific ifunc handling.
>>>>>>> The idea is to omit ifunc variants or ifunc at all if the used compile
>>>>>>> options are already building for newer cpus. The glibc internal calls
>>>>>>> and ld.so will then use "newer" ifunc variants as before.
>>>>>>>
>>>>>>> In case of the memcpy, memset and memcmp functions, the newest ifunc variant
>>>>>>> is for z196 and there are two further ones for older cpus, but the current usual
>>>>>>> compile options are e.g. building for zEC12.
>>>>>>> In case of the string / wcsmbs functions, there are variants for z13 and
>>>>>>> "before z13".  After switching to z13 as default cpu level, there won't
>>>>>>> be IFUNC symbols in s390 libc.so.
>>>>>>>
>>>>>>> Furthermore new z13 specific ifunc variants are introduced for
>>>>>>> memmove, strstr and memmem.
>>>>>>>
>>>>>>> Some functions like the mem* functions are duplicated twice for 31 and 64 bit.
>>>>>>> In fact they are nearly the same. Thus those implementations are now unified
>>>>>>> and adjusted in order to be usable for 31 and 64bit.
>>>>>>>
>>>>>>> I've build and tested these patches with different -march levels
>>>>>>> and with / without multiarch and checked the symbols with readelf - e.g. if
>>>>>>> IFUNC is used or not and if the __GI_ symbols are targeting the correct
>>>>>>> ifunc variant.
>>>>>>>
>>>>>>> If no one objects, I plan to commit this series in the next one or two weeks.
>>>>>>>
>>>>>>
>>>>>> The only issue I have for this change is it would require another
>>>>>> ABI variant for testing and validation, which would require more
>>>>>> coverage from build-many-glibcs.py (similar to armv7 for instance).
>>>>> The ABI itself is not changed, but you are right tests could be done for e.g. -march=zEC12 and -march=z13
>>>>
>>>> I meant 'variant' in the sense it would require multiple glibc build to
>>>> actually validate a change in s390 implementations.  If I am reading it
>>>> correctly, on memcpy, for instance, we might have:
>>>>
>>>>     1. HAVE_MEMCPY_IFUNC
>>>>       1.1. HAVE_S390_MIN_Z196_ZARCH_ASM_SUPPORT
>>>>       1.2. HAVE_S390_MIN_Z10_ZARCH_ASM_SUPPORT
>>>>       1.3. Z900
>>>>     2. !HAVE_MEMCPY_IFUNC
>>>>     3. Disabled multi-arch
>>>>
>>>> So is is basically 4 different glibc builds one will need to actually
>>>> test and check to fully validate s390.
>>>>
>>> It depends on the change. If you change the implementation of the ifunc variants and build with the oldest march level, then due to ifunc-impl-list the tests will run all ifunc variants.
>>
>> You validate the ifunc implementation itself, you need to also check if it
>> builds on all the variants (even when you are sure that the ifunc variant
>> itself works). That's what I am trying to avoid, to have multiple possible
>> glibc build depending of how you configure the compiler.
>>
> You always have to build and test glibc with your used compiler-configuration of the target distro. E.g. using z13 as default architecture level set can also introduce other bugs which are not triggered if the test was previously done with zEC12 as als.
> 
> A caller of a function is just calling __GI_XYZ. If you change the caller side then you rely on the correct __GI_XYZ symbol.
> If you change one existing implementation of XYZ, then the __GI_ mechanism is not touched and you have to test the changed implementation.
> If you add a new implementation for a new machine, then you have to extend and test the mechanism. But then you can also afford to build multiple glibcs. If __GI_XYZ exists twice or does not exist, then you'll get a link error without running the testsuite. If __GI_XYZ points to a false implementation, this is really bad. Then you'll either see it with readelf or it fails as you are running on an too old machine. But a change in the ifunc-resolver can also lead to such an issue.

I am not really following your rationale here, the idea is to trade off
between add build complexity to squeeze some performance versus have a 
more simple and predictable configure builds.

My point is I would like to have some measurements if this micro-optimization
really pays for the effort of adding even more build and validation
complexity.

> Are all the power[4 to X], multiarch and co variants tested on every change?

I would expected so, however I also expect that most distribution just use
default build option (which use ifunc).

The powerpc situation is not really optimal imho and it is an artefact from
how distro usually deployed glibc for different powerpc chips (multiple
builds with loader selecting the path depending of the hwcap value). I really
would like we we move to deprecate and remove the --with-cpu option, so we 
have basically just two variants (enable and disable multiarch).

The power4-only multiarch support was motivated mainly because it is the 
minimum chip with support for hp-timing. I am not sure it was the best option
back then, but it indeed simplified how to internally handle the minimum ABI
requirements. One option might indeed use a similar strategy to enable disable
hp-timing.h based on compiler support.

Now for powerpc64le I do see that this kind of refactor make sense, since it 
uses the old powerpc64 internal organization.  Since the minimum ISA requirement
for powerpc64le is 2.07, it does not make sense pack all old ISA optimization
for ifunc variants.

> 
>>>
>>> The patchset also do some clean up and unifies 31 vs 64bit implementations from s390-32 and s390-64 folders. This simplifies maintenance.
>>
>> I do agree with these kind of changes.
>>
>>>>>
>>>>>>
>>>>>> The gains I see is a slight reduction is loading time (due no ifunc)
>>>>>> and less code side. Is is what is driving you for this change? Does
>>>>>> it worth the extra testing and validation it might require?
>>>>>>
>>>>> The current s390_vx_libc_ifunc macros are not flexible and do not allow future enhancements as they only use the *_c or *_vx variant.
>>>>
>>>> Right, this is one issue which is not really tied to setting up different
>>>> build variants depending of the minimum ISA level.
>>>>
>>>>>
>>>>> As soon as future ditros will use z13 as default architecture,
>>>>> vector instructions can be used by default.  But the current __GI_* functions won't use the vector variants.  Instead the fallback is always used for internal calls and the vector variants are always used for external calls via IFUNC.
>>>>
>>>> Some ABI does impose restriction for ifunc with internal hidden symbols
>>>> (powerpc32 and i686 if I recall correctly), is it the case for s390
>>>> and/or s390x?
>>> On s390 the compiler will generate e.g. a PC32DBL relocation instead of PLT32DBL one and the GOT pointer might not be setup in r12. But the called PLT stub relies on a setup r12.
>>>
>>> And on older binutils (unfortunately I'm not sure if it was version >= or < 2.25) bugs regarding ifunc lead to a segmentation fault while linking libc.so.
>>
>> Do you consider this a deal-break issue? One option is to check the minimum
>> supported binutils for this change and set it as the minimum required one,
>> another is disable internal ifunc for such ABI (as for powerpc32, powerpc64
>> does work).
>>
> The "not setuped r12 register" is no binutils-bug. This is such an restriction with ifunc and internal hidden symbols you've mentioned.
> Thus I can't use IFUNC __GI_* symbols.
> 
> I disable internal ifunc by setting the __GI_* symbols to the "oldest" implementation. Or is there another mechanism?

Is is an issue for both s390 and s390x? 

> 
>>>>
>>>> At least for some ABI, we can still route some internal symbols through
>>>> ifunc. On some x86 ifunc selector implementation the comments state it
>>>> might show no performance gain, but I am not sure about the validity of
>>>> the claims. So if s390 or s390x does not have any impeding reason, a
>>>> possibility might to use ifunc internally instead of relying on compiler
>>>> default or used options to setup the minimum ABI.
>>>>
>>> If I start routing those symbols via IFUNC then an additional PLT stub is introduced in libc.so. If we are running on >=z13 then the __XYZ_vx is called via PLT. On machines <z13 __XYZ_c is also called via PLT. Compared to non-ifunc __GI_ symbols we have introduced the extra PLT stub overhead without any advantage on <z13 machines as the __XYZ_c implementation is called.
>>
>> Do we have real usercases where they are stressing libc symbols that are
>> doing intra-symbol calls and which PLT overhead is dominant?
>>
> Usually we try to avoid PLT calls. I think this is one idea behind the __GI_* symbols. Shall we remove those symbols and always jump over PLT?

It really depends of the implementation that is calling and if it might be
a hotspot.  Again I really want to understand if this change is just for the
sake of a micro-optimization or if you really targetting a real usercase
where internal __GI_ symbols are being a hotspot.

In any case, what do you think about reworking the patch to refactor s390
ifunc to be able to handle more variants, unify the code, and add z13
variants and start a separate thread about how to handle the ifunc and
compiler default options? I really want to check with distro focal points
and other maintainers if my objections does indeed does make sense.

> 
>>>
>>>> So the question is whether the possible internal symbols call optimization
>>>> worth that added complexity for build and validation.
>>>>
>>> E.g. for strstr there is an advantage. On s390 we are jumping from the common-code implementations to the vector implementations. On other archs we would jump from one vector implementation to another vector implementation and you perhaps don't see so much difference.
>>>
>>
>> One option maybe is maybe to trade some code size for a specific strstr
>> variant for z13, which calls strstr/strnlen/memcpy locally.  Specially
>> for strstr, is the difference really worth to add this build and validation
>> complexity?
>>
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]