This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB


On 06/29/2015 07:37 PM, Steven Munroe wrote:
On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
On 06/09/2015 04:06 PM, Steven Munroe wrote:
On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:

On 08/06/15 22:03, Carlos Eduardo Seo wrote:
The proposed patch adds a new feature for powerpc. In order to get
faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
This enables users to write versioned code based on the HWCAP bits
without going through the overhead of reading them from the auxiliary
vector.

i assume this is for multi-versioning.

The intent is for the compiler to implement the equivalent of
__builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
is RISC so we use the HWCAP. The trick to access the HWCAP[2]
efficiently as getauxv and scanning the auxv is too slow for inline
optimizations.


There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
the variables private to glibc that already contain this information.  That
ought to be fast enough for the builtin, rather than consuming space in the TCB.


Richard I do not understand how a 38 instruction function accessed via a
PLT call stub (minimum 4 additional instructions) is equivalent or "as
good as" a single in-line load instruction.

Even with best case path for getauxval HWCAP2 we are at 14 instructions
with exposure to 3 different branch miss predicts. And that is before
the application can execute its own __builtin_cpu_supports() test.

Lets look at a real customer example. The customer wants to use the P8
128-bit add/sub but also wants to be able to unit test code on existing
P7 machines. Which results in something like this:

static inline vui32_t
vec_addcuq (vui32_t a, vui32_t b)
{
         vui32_t t;

                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
                 {

                         __asm__(
                             "vaddcuq %0,%1,%2;"
                             : "=v" (t)
                             : "v" (a),
                               "v" (b)
                             : );
...

So it is clear to me that executing 14+ instruction to decide if I can
optimize to use new single instruction optimization is not a good deal.

This is a horrible way to use this builtin. In the same way that using ifunc at this level would also be horrible.

Even supposing that this builtin uses a single load, you've at least doubled the overhead of using the insn. The user really should be aware of this and manually hoist this check much farther up the call chain. At which point the difference between 2 cycles for a load and 40 cycles for a call is immaterial.

And if the user is really concerned about unit tests, surely ifdefs are more appropriate for this situation. At the moment one can only test the P7 path on P7 and the P8 path on P8; better if one can also test the P7 path on P8.


r~


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]