This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[RFC] String optimization workflow for architectures.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: libc-alpha at sourceware dot org
- Cc: Richard Henderson <rth at twiddle dot net>, Joseph Myers <joseph at codesourcery dot com>, Wilco <wdijkstr at arm dot com>
- Date: Sun, 31 May 2015 16:05:06 +0200
- Subject: [RFC] String optimization workflow for architectures.
- Authentication-results: sourceware.org; auth=none
- References: <20150529190952 dot GA23952 at domone>
Hi,
Now with a string skeleton we could optimize we need to adjust a
workflow.
With atmost zero effort we could probably make these routines ten
percent faster.
There three main issues, first are relatively easy. First one is
autogenerating ifuncs. A machine maintainer would need to write
function that checks what foo in gcc -march=foo is current cpu.
Then we would compile every function for each -march combination and
ifunc would select given architecture.
Second is profile feedback. That gives ten percent that I promised.
A problem of that is that it needs to make string functions standalone.
Unless we allow compiling libc with -fprofile-generate we need to first
compile these in standalone library that we run to collect profile data.
I would use call traces that I collected with dryrun.
A third issue is that I want to make tuning more generic. I added
several tunable variables and it isn't clear which option is better.
I wrote a simple evolutionary algorithm to optimize these when there are
too many combinations of tunables.
Maintainer would need to run these for day on machine to find optimum.
With ifunc he would need to do this for each architecture.
As example of possible gains run following benchmark. On haswell running
times when differently compiled are following:
-O3 -I.. -c memchr.c
gcc -O3 testm.c memchr.o
./a.out
real 0m0.701s
user 0m0.701s
sys 0m0.000s
real 0m0.700s
user 0m0.701s
sys 0m0.000s
real 0m0.700s
user 0m0.700s
sys 0m0.000s
gcc -O3 -march=native -I.. -c memchr.c
gcc -O3 testm.c memchr.o
./a.out
real 0m0.651s
user 0m0.651s
sys 0m0.000s
real 0m0.650s
user 0m0.647s
sys 0m0.003s
real 0m0.650s
user 0m0.650s
sys 0m0.000s
gcc -O3 -I.. -c memchr.c -fprofile-generate
gcc -O3 testm.c memchr.o -fprofile-generate
./a.out
gcc -O3 -I.. -S memchr.c -fprofile-use
gcc -O3 testm.c memchr.s
real 0m0.705s
user 0m0.705s
sys 0m0.000s
real 0m0.703s
user 0m0.704s
sys 0m0.000s
real 0m0.703s
user 0m0.704s
sys 0m0.000s
gcc -march=native -O3 -I.. -c memchr.c -fprofile-generate
gcc -march=native -O3 testm.c memchr.o -fprofile-generate
./a.out
gcc -march=native -O3 -I.. -S memchr.c -fprofile-use
gcc -O3 testm.c memchr.s
real 0m0.635s
user 0m0.635s
sys 0m0.000s
real 0m0.634s
user 0m0.634s
sys 0m0.000s
real 0m0.633s
user 0m0.634s
sys 0m0.000s