This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On 05/10/2017 13:49, David Miller wrote: > From: Adhemerval Zanella <adhemerval.zanella@linaro.org> > Date: Thu, 5 Oct 2017 10:51:11 -0300 > >> Both SPARC support multiarch platforms (sparcv9 and sparc64) have the >> a default assembly implemented memcpy. Since it should not be any >> restriction about it them on the loader object and assuming they are >> faster than generic ones this patch uses them for rtld objects. >> >> Also, there is no indication neither on original patch [1] or in commit >> message why the generic one where used instead of the sparc optimized >> ones. > > The ultra1 memcpy is really an extremely non-ideal variant to use as > the default for anything. > > It's much slower on newer cpus, as the block loads and stores used in > the ultra1 version aren't optimized the same way they were in those > older chips. > > The C version is faster on newer cpus and definitely a better choice > as a default, especially because it doesn't use any cpu specific > instructions like the ultra1 variant does. > > In the Linux kernel we have an assembler version we use as the default > which doesn't use any special instructions. Thanks for the explanation, although it does not explain why the ultra1 is currently the default for sparc64 (sysdeps/sparc/sparc64/memcpy.S) and also the default selection for multiarch. The C version is used solely for loader currently. I tried to check which are the performance of C implementation against ultra1 one on a niagara5 and results are: - on bench-memcpy the C version is slight slower for sizes up to 32 (about 4% faster for sizes up to 16, 40% from 16 to 32 and 50% up to 32). It is definitely faster for sizes higher than 64 (62% faster for sizes from 64 to 128 and 85% for sizes higher than 128). - on bench-memcpy-random shows no performance difference, however bench-memcpy-large shows the C implementation is indeed faster for all inputs. So I think that instead of using default memcpy for rtld, the best strategy would to use the C implementation instead as default and add ultra1 as another option for ifunc resolution.
Attachment:
bench-memcpy-random-sparc64.out
Description: Text document
Attachment:
bench-memcpy-sparc64.out
Description: Text document
Attachment:
bench-memcpy-large-sparc64.out
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |