On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
On 06/09/2015 04:06 PM, Steven Munroe wrote:
On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
On 08/06/15 22:03, Carlos Eduardo Seo wrote:
The proposed patch adds a new feature for powerpc. In order to get
faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
This enables users to write versioned code based on the HWCAP bits
without going through the overhead of reading them from the auxiliary
vector.
i assume this is for multi-versioning.
The intent is for the compiler to implement the equivalent of
__builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
is RISC so we use the HWCAP. The trick to access the HWCAP[2]
efficiently as getauxv and scanning the auxv is too slow for inline
optimizations.
There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
the variables private to glibc that already contain this information. That
ought to be fast enough for the builtin, rather than consuming space in the TCB.
Richard I do not understand how a 38 instruction function accessed via a
PLT call stub (minimum 4 additional instructions) is equivalent or "as
good as" a single in-line load instruction.
Even with best case path for getauxval HWCAP2 we are at 14 instructions
with exposure to 3 different branch miss predicts. And that is before
the application can execute its own __builtin_cpu_supports() test.
Lets look at a real customer example. The customer wants to use the P8
128-bit add/sub but also wants to be able to unit test code on existing
P7 machines. Which results in something like this:
static inline vui32_t
vec_addcuq (vui32_t a, vui32_t b)
{
vui32_t t;
if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
{
__asm__(
"vaddcuq %0,%1,%2;"
: "=v" (t)
: "v" (a),
"v" (b)
: );