This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 3/4] sparc: Use default memcpy for rtld objects



On 05/10/2017 15:02, Adhemerval Zanella wrote:
> 
> 
> On 05/10/2017 13:49, David Miller wrote:
>> From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
>> Date: Thu,  5 Oct 2017 10:51:11 -0300
>>
>>> Both SPARC support multiarch platforms (sparcv9 and sparc64) have the
>>> a default assembly implemented memcpy.  Since it should not be any
>>> restriction about it them on the loader object and assuming they are
>>> faster than generic ones this patch uses them for rtld objects.
>>>
>>> Also, there is no indication neither on original patch [1] or in commit
>>> message why the generic one where used instead of the sparc optimized
>>> ones.
>>
>> The ultra1 memcpy is really an extremely non-ideal variant to use as
>> the default for anything.
>>
>> It's much slower on newer cpus, as the block loads and stores used in
>> the ultra1 version aren't optimized the same way they were in those
>> older chips.
>>
>> The C version is faster on newer cpus and definitely a better choice
>> as a default, especially because it doesn't use any cpu specific
>> instructions like the ultra1 variant does.
>>
>> In the Linux kernel we have an assembler version we use as the default
>> which doesn't use any special instructions.
> 
> Thanks for the explanation, although it does not explain why the ultra1
> is currently the default for sparc64 (sysdeps/sparc/sparc64/memcpy.S)
> and also the default selection for multiarch.  The C version is used
> solely for loader currently.
> 
> I tried to check which are the performance of C implementation against
> ultra1 one on a niagara5 and results are:
> 
>   - on bench-memcpy the C version is slight slower for sizes up to
>     32 (about 4% faster for sizes up to 16, 40% from 16 to 32 and
>     50% up to 32).  It is definitely faster for sizes higher than
>     64 (62% faster for sizes from 64 to 128 and 85% for sizes
>     higher than 128).
> 
>   - on bench-memcpy-random shows no performance difference, however
>     bench-memcpy-large shows the C implementation is indeed faster
>     for all inputs.
> 
> So I think that instead of using default memcpy for rtld, the best
> strategy would to use the C implementation instead as default and
> add ultra1 as another option for ifunc resolution.

One thing I forgot to ask is whether if you have any data points 
how slow the C implementation would be compared to current default
sparc64 memcpy.  Because one options would just remove it and use
the C as default without providing it as an option.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]