This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: HWCAP is method to determine cpu features, not selection mechanism.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Steven Munroe <munroesj at linux dot vnet dot ibm dot com>
- Cc: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, Szabolcs Nagy <szabolcs dot nagy at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Date: Thu, 11 Jun 2015 06:34:16 +0200
- Subject: Re: HWCAP is method to determine cpu features, not selection mechanism.
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <20150609154223 dot GA20028 at domone> <1433865684 dot 21101 dot 20 dot camel at sjmunroe-ThinkPad-W500> <20150610125047 dot GA10861 at domone> <55783D2A dot 8050703 at linaro dot org> <557846D9 dot 3060909 at arm dot com> <55784802 dot 8070605 at linaro dot org> <1433954753 dot 25475 dot 61 dot camel at sjmunroe-ThinkPad-W500>
On Wed, Jun 10, 2015 at 11:45:53AM -0500, Steven Munroe wrote:
> On Wed, 2015-06-10 at 11:21 -0300, Adhemerval Zanella wrote:
> >
> > On 10-06-2015 11:16, Szabolcs Nagy wrote:
> > > On 10/06/15 14:35, Adhemerval Zanella wrote:
> > >> I agree that adding an API to modify the current hwcap is not a good
> > >> approach. However the cost you are assuming here are *very* x86 biased,
> > >> where you have only on instruction (movl <variable>(%rip), %<destiny>)
> > >> to load an external variable defined in a shared library, where for
> > >> powerpc it is more costly:
> > >
> > > debian codesearch found 4 references to __builtin_cpu_supports
> > > all seem to avoid using it repeatedly.
> > >
> > > multiversioning dispatch only happens at startup (for a small
> > > number of functions according to existing practice).
> > >
> > > so why is hwcap expected to be used in hot loops?
> > >
> >
> > Good question, I do not know and I believe Steve could answer this
> > better than me. I am only advocating here that assuming x86 costs
> > for powerpc is not the way to evaluate this patch.
> >
>
> The trade off is that the dynamic solutions (platform library selection
> via AT_PLATFORM) and STT_GNU_IFUNC require a dynamic call which in our
> ABI required an indirect branch and link via the CTR. There is also the
> overhead of the TOC save/reload.
>
Wait you are using dynamic libraries anyway which require that already
so it wouldn't make any difference.
Or you are trying to say that you statically link libraries to generic
one instead specialized ones and using simple wrapper script to run per-cpu application like following one?
if [ ! -z `cat /proc/cpuinfo | grep power11` ]
app_power11 $*
elif [ ! -z `cat /proc/cpuinfo | grep power10` ]
app_power10 $*
...
> The net is the trade-offs are different for POWER then for other
> platform. I spend a lot of time looking at performance data from
> customer applications and see these issues (as measurable additional
> path length and forced hazards).
>
> So there is a place for this proposed optimization strategy where we can
> avoid the overhead of the dynamic call and substitute the smaller more
> predictable latency of the HWCAP; load word, and immediate record, and
> branch conditional (3 instructions, low cache hazard, and highly
> predictable branch).
>
But my point is that there shouldn't be no dynamic call nor hwcap
branch. As that function is hot-spot you would gain more by inlining it
and doing decision in callers.
> The concern about the cache foot print does not apply as these fields
> share the cache line with other active TCB fields. This line will be in
> L1 for any active thread.
>
Excellent you have applications. So you could show that there is some
measurable performance benefit of your claims.
So Steven you have several applications from customers that statically
link every library for performance? I assume that as if cost of GOT on
powerpc is so high as you claim it has better cost/benefit ratio of
eliminating them than just plt entry of hwcap.
First report benchmark with unchanged application.
Then report number when you use ifdef to make it constant and compile
application with -mcpu=power7 and report difference versus generic.
When you have this you could try measure difference between plt and
noplt hwcap to see if its real or you are just micromanaging and don't
improve actual performance as you spend time on cold path instead.