This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: HWCAP is method to determine cpu features, not selection mechanism.


On Wed, Jun 10, 2015 at 11:45:53AM -0500, Steven Munroe wrote:
> On Wed, 2015-06-10 at 11:21 -0300, Adhemerval Zanella wrote:
> > 
> > On 10-06-2015 11:16, Szabolcs Nagy wrote:
> > > On 10/06/15 14:35, Adhemerval Zanella wrote:
> > >> I agree that adding an API to modify the current hwcap is not a good
> > >> approach. However the cost you are assuming here are *very* x86 biased,
> > >> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
> > >> to load an external variable defined in a shared library, where for
> > >> powerpc it is more costly:
> > > 
> > > debian codesearch found 4 references to __builtin_cpu_supports
> > > all seem to avoid using it repeatedly.
> > > 
> > > multiversioning dispatch only happens at startup (for a small
> > > number of functions according to existing practice).
> > > 
> > > so why is hwcap expected to be used in hot loops?
> > > 
> > 
> > Good question, I do not know and I believe Steve could answer this
> > better than me.  I am only advocating here that assuming x86 costs
> > for powerpc is not the way to evaluate this patch.
> > 
> 
> The trade off is that the dynamic solutions (platform library selection
> via AT_PLATFORM) and STT_GNU_IFUNC require a dynamic call which in our
> ABI required an indirect branch and link via the CTR. There is also the
> overhead of the TOC save/reload.
> 
Wait you are using dynamic libraries anyway which require that already
so it wouldn't make any difference.

Or you are trying to say that you statically link libraries to generic
one instead specialized ones and using simple wrapper script to run per-cpu application like following one?

if [ ! -z `cat /proc/cpuinfo | grep power11` ]
  app_power11 $*
elif [ ! -z `cat /proc/cpuinfo | grep power10` ]
  app_power10 $*
  ...

> The net is the trade-offs are different for POWER then for other
> platform. I spend a lot of time looking at performance data from
> customer applications and see these issues (as measurable additional
> path length and forced hazards).
> 
> So there is a place for this proposed optimization strategy where we can
> avoid the overhead of the dynamic call and substitute the smaller more
> predictable latency of the HWCAP; load word, and immediate record, and
> branch conditional (3 instructions, low cache hazard, and highly
> predictable branch).
> 
But my point is that there shouldn't be no dynamic call nor hwcap
branch. As that function is hot-spot you would gain more by inlining it
and doing decision in callers.


> The concern about the cache foot print does not apply as these fields
> share the cache line with other active TCB fields. This line will be in
> L1 for any active thread.
>
Excellent you have applications. So you could show that there is some
measurable performance benefit of your claims.

So Steven you have several applications from customers that statically
link every library for performance? I assume that as if cost of GOT on
powerpc is so high as you claim it has better cost/benefit ratio of
eliminating them than just plt entry of hwcap.

First report benchmark with unchanged application.
Then report number when you use ifdef to make it constant and compile
application with -mcpu=power7 and report difference versus generic.

When you have this you could try measure difference between plt and
noplt hwcap to see if its real or you are just micromanaging and don't
improve actual performance as you spend time on cold path instead.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]