This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 3/4] sparc: Use default memcpy for rtld objects
On 05/10/2017 15:02, Adhemerval Zanella wrote:
>
>
> On 05/10/2017 13:49, David Miller wrote:
>> From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
>> Date: Thu, 5 Oct 2017 10:51:11 -0300
>>
>>> Both SPARC support multiarch platforms (sparcv9 and sparc64) have the
>>> a default assembly implemented memcpy. Since it should not be any
>>> restriction about it them on the loader object and assuming they are
>>> faster than generic ones this patch uses them for rtld objects.
>>>
>>> Also, there is no indication neither on original patch [1] or in commit
>>> message why the generic one where used instead of the sparc optimized
>>> ones.
>>
>> The ultra1 memcpy is really an extremely non-ideal variant to use as
>> the default for anything.
>>
>> It's much slower on newer cpus, as the block loads and stores used in
>> the ultra1 version aren't optimized the same way they were in those
>> older chips.
>>
>> The C version is faster on newer cpus and definitely a better choice
>> as a default, especially because it doesn't use any cpu specific
>> instructions like the ultra1 variant does.
>>
>> In the Linux kernel we have an assembler version we use as the default
>> which doesn't use any special instructions.
>
> Thanks for the explanation, although it does not explain why the ultra1
> is currently the default for sparc64 (sysdeps/sparc/sparc64/memcpy.S)
> and also the default selection for multiarch. The C version is used
> solely for loader currently.
>
> I tried to check which are the performance of C implementation against
> ultra1 one on a niagara5 and results are:
>
> - on bench-memcpy the C version is slight slower for sizes up to
> 32 (about 4% faster for sizes up to 16, 40% from 16 to 32 and
> 50% up to 32). It is definitely faster for sizes higher than
> 64 (62% faster for sizes from 64 to 128 and 85% for sizes
> higher than 128).
>
> - on bench-memcpy-random shows no performance difference, however
> bench-memcpy-large shows the C implementation is indeed faster
> for all inputs.
>
> So I think that instead of using default memcpy for rtld, the best
> strategy would to use the C implementation instead as default and
> add ultra1 as another option for ifunc resolution.
One thing I forgot to ask is whether if you have any data points
how slow the C implementation would be compared to current default
sparc64 memcpy. Because one options would just remove it and use
the C as default without providing it as an option.