This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

HWCAP is method to determine cpu features, not selection mechanism.


On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
> On Tue, 2015-06-09 at 17:42 +0200, OndÅej BÃlka wrote:
> > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
> > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > > 
> > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > > > The proposed patch adds a new feature for powerpc. In order to get
> > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > > > This enables users to write versioned code based on the HWCAP bits
> > > > > without going through the overhead of reading them from the auxiliary
> > > > > vector.
> > > 
> > > > i assume this is for multi-versioning.
> > > 
> > > The intent is for the compiler to implement the equivalent of
> > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > efficiently as getauxv and scanning the auxv is too slow for inline
> > > optimizations.
> > > 
> > > > i dont see how the compiler can generate code to access the
> > > > hwcap bits currently (without making assumptions about libc
> > > > interfaces).
> > > > 
> > > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
> > > 
> > > The TCB offsets are already fixed and can not change from release to
> > > release.
> > > 
> > I don't have problem with this but why do you add tls, how can different
> > threads have different ones when kernel could move them between cores.
> > 
> > So instead we just add to libc api following two variables below. These would
> > be initialized by linker as we will probably use them internally.
> > 
> > extern int __hwcap, __hwcap2;
> > 
> The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This
> guarantees one instruction load from TCB.
> 
> A Static variable would require a an indirect load via the TOC/GOT
> (which can be megabytes for a large program/library). I really really
> want the avoid that.
> 
> The point is to make fast decisions about which code the execute.
> STT_GNU_IFUNC is just too complication for most application programmers
> to use.
> 
> Now if the GLIBC community wants to provide a durable API for static
> access to the HWCAP. I have not problem with that, but it does not solve
> this problem.
> 
Thats completely false and outright dangerous advice.

First that if ifuncs are too much complication to use they shouldn't
touch hwcap at first place. Ifuncs are relatively easy to read if you
take optimizing for specific cpu seriously and are aware of precautions
you could take.

If you let other programmers touch hwcap you would get disaster. You
need to compile each variant separately with appropriate gcc flags.
Otherwise if you just do decision inline then compiler is free to insert
newer instructions to generic code. That could lead to unexpected
crashes caused just by compiling with different gcc than original
programmer used.

So you need to have different file for each enabled capability and
compile these separately. (Or use assembly but most programmers don't
qualify.) Or you could try to add pragmas to tell gcc which part of file
should be optimized with which optimizations but thats even worse that
ifunc.

So you read hwcap register and need to call function. That indirection
already costs you more than GOT access you tried to save. 

Also even if you could handle previous problems with assembly functions
you lose more cycles than save as you couldn't compile file with
-march=native. Best solution I found would be distributions package
gentoo model, have variant of package for each cpu that would package
manager fetch based on your cpu and a script on startup that checks if
cpu changed and if so then he would relink all packages to generic
versions.

That would allow programmers use #ifdef _HAS_SSE4 for code thats easier
to maintain.

Finally while Florian solution works your argument is suspect. First it
costs tls so it needs to be frequently used. That makes address always
be in L1 cache which makes GOT size irrelevant. And if you have problems
with hwcap not being in cache duplicating it ten times if you have ten
threads would make situation worse, not better.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]