This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[RFC] String optimization workflow for architectures.

From: OndÅej BÃlka <neleai at seznam dot cz>
To: libc-alpha at sourceware dot org
Cc: Richard Henderson <rth at twiddle dot net>, Joseph Myers <joseph at codesourcery dot com>, Wilco <wdijkstr at arm dot com>
Date: Sun, 31 May 2015 16:05:06 +0200
Subject: [RFC] String optimization workflow for architectures.
Authentication-results: sourceware.org; auth=none
References: <20150529190952 dot GA23952 at domone>

Hi,

Now with a string skeleton we could optimize we need to adjust a
workflow.

With atmost zero effort we could probably make these routines ten
percent faster.

There three main issues, first are relatively easy. First one is
autogenerating ifuncs. A machine maintainer would need to write 
function that checks what foo in gcc -march=foo is current cpu.

Then we would compile every function for each -march combination and
ifunc would select given architecture.

Second is profile feedback. That gives ten percent that I promised.

A problem of that is that it needs to make string functions standalone.

Unless we allow compiling libc with -fprofile-generate we need to first
compile these in standalone library that we run to collect profile data.

I would use call traces that I collected with dryrun.

A third issue is that I want to make tuning more generic. I added
several tunable variables and it isn't clear which option is better.

I wrote a simple evolutionary algorithm to optimize these when there are
too many combinations of tunables.

Maintainer would need to run these for day on machine to find optimum.
With ifunc he would need to do this for each architecture.

As example of possible gains run following benchmark. On haswell running
times when differently compiled are following:


-O3 -I.. -c memchr.c
gcc -O3 testm.c memchr.o
./a.out



real	0m0.701s
user	0m0.701s
sys	0m0.000s

real	0m0.700s
user	0m0.701s
sys	0m0.000s

real	0m0.700s
user	0m0.700s
sys	0m0.000s

gcc -O3 -march=native -I.. -c memchr.c
gcc -O3 testm.c memchr.o
./a.out



real	0m0.651s
user	0m0.651s
sys	0m0.000s

real	0m0.650s
user	0m0.647s
sys	0m0.003s

real	0m0.650s
user	0m0.650s
sys	0m0.000s

gcc -O3 -I.. -c memchr.c -fprofile-generate
gcc -O3 testm.c memchr.o -fprofile-generate
./a.out
gcc -O3 -I.. -S memchr.c -fprofile-use
gcc -O3 testm.c memchr.s

real	0m0.705s
user	0m0.705s
sys	0m0.000s

real	0m0.703s
user	0m0.704s
sys	0m0.000s

real	0m0.703s
user	0m0.704s
sys	0m0.000s

gcc -march=native -O3 -I.. -c memchr.c -fprofile-generate
gcc -march=native -O3 testm.c memchr.o -fprofile-generate
./a.out
gcc -march=native -O3 -I.. -S memchr.c -fprofile-use
gcc -O3 testm.c memchr.s

real	0m0.635s
user	0m0.635s
sys	0m0.000s

real	0m0.634s
user	0m0.634s
sys	0m0.000s

real	0m0.633s
user	0m0.634s
sys	0m0.000s

Follow-Ups:
- Re: [RFC] String optimization workflow for architectures.
  - From: OndÅej BÃlka

References:
- [PATCH v4] generic string skeleton.
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]