This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: munroesj at linux dot vnet dot ibm dot com
- Cc: Carlos O'Donell <carlos at redhat dot com>, Carlos Eduardo Seo <cseo at linux dot vnet dot ibm dot com>, GLIBC Devel <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>, Richard Henderson <rth at redhat dot com>
- Date: Thu, 9 Jul 2015 21:02:52 +0200
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <559617FF dot 8010100 at redhat dot com> <20150703085542 dot GE32307 at domone> <55968AF8 dot 8060104 at redhat dot com> <20150703171121 dot GA23898 at domone> <1436283324 dot 12188 dot 25 dot camel at oc7878010663>
On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
> But these could be done without much of our help. We need to keep these
> > writable to support this hack. I don't know exact assembly for powerpc,
> > it should be similar to how do it on x64:
> >
> > int x;
> >
> > int foo()
> > {
> > #ifdef SHARED
> > asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
> > #else
> > asm ("lea x(%rip), %rax; movb $32, (%rax)");
> > #endif
> > return &x;
> > }
> >
>
> Not so simple on PowerISA as we don't have PC-relative addressing.
>
> 1) The global entry requires 2 instruction to establish the TOC/GOT
> 2) Medium model requires two instructions (fused) to load a pointer from
> the GOT.
> 3) Finally we can load the cached hwcap.
>
> None of this is required for the TP+offset.
>
And why you didn't wrote that when it was first suggested? When you don't answer
it looks like you don't want to answer because that suggestion is better.
Here problem isn't lack of relative addressing but that you don't start
with GOT in register.
You certainly could do similar hack as you do with tcb and place hwcap
bits just after that so you could do just one load.
That you require so many instructions on powerpc is gcc bug, rather than
rule. You don't need that many instructions when you place frequent
symbols in -32768..32767 range. For example here you could save one
addition.
int x, y;
int foo()
{
return x + y;
}
original
00000000000007d0 <foo>:
7d0: 02 00 4c 3c addis r2,r12,2
7d4: 30 78 42 38 addi r2,r2,30768
7d8: 00 00 00 60 nop
7dc: 30 80 42 e9 ld r10,-32720(r2)
7e0: 00 00 00 60 nop
7e4: 38 80 22 e9 ld r9,-32712(r2)
7e8: 00 00 6a 80 lwz r3,0(r10)
7ec: 00 00 29 81 lwz r9,0(r9)
7f0: 14 4a 63 7c add r3,r3,r9
7f4: b4 07 63 7c extsw r3,r3
7f8: 20 00 80 4e blr
new
addis r2,r12,2
ld r10,-1952(r2)
ld r9,-1944(r2)
lwz r3,0(r10)
lwz r9,0(r9)
add r3,r3,r9
extsw r3,r3
blr
> Telling me how x86 does things is not much help.
That why we need to know how that would work on powerpc.
> >
> > > Without a concrete implementation I can't comment on one or the other.
> > > It is in my opinion overly harsh to force IBM to go implement this new
> > > feature. They have space in the TCB per the ABI and may use it for their
> > > needs. I think the community should investigate symbol address munging
> > > as a method for storing data in addresses and make a generic API from it,
> > > likewise I think the community should investigate standardizing tp+offset
> > > data access behind a set of accessor macros and normalizing the usage
> > > across the 5 or 6 architectures that use it.
> > >
> > I would like this as with access to that I could improve performance of
> > several inlines.
> >
> >
> > > > Also I now have additional comment with api as if you want faster checks
> > > > wouldn't be faster to save each bit of hwcap into byte field so you
> > > > could avoid using mask at each check?
> > >
> > > That is an *excellent* suggestion, and exactly the type of technical
> > > feedback that we should be giving IBM, and Carlos can confirm if they've
> > > tried such "unpacking" of the bits into byte fields. Such unpacking is
> > > common in other machine implementations.
> > >
> This does not help on Power, Any (byte, halfword, word, doubleword,
> quadword) aligned load is the same performance. Splitting our bits to
> bytes just slow things down. Consider:
>
> if (__builtin_cpu_supports(ARCH_2_07) &&
> __builtin_cpu_supports(VEC_CRYPTO))
>
> This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
> byte Boolean.
>
> Again value judgements about that is fast or slow can vary by platform.
Instruction count means nothing if you don't have good intuition about
powerpc platform. If you consider these your three instructions are lot
slower than byte Booleans.
Use following benchmark. You need separate compilation as to simulate
many calls of function that uses hwcap that are not optimized away by
gcc. I used computation before hwcap selection as without that there
wouldn't be much difference as with OoO execution it would mostly
measure latency of loads. It would still be slower but its 1.90s vs 1.92s
Adding third check makes no difference, and case of one is obviously
faster.
Also how are you sure that checking more flags happens often to justify
any potential savings with more checks if there were any savings?
Benchmark is following:
[neleai@gcc2-power8 ~]$ echo c.c:;cat c.c; echo x.c:;cat x.c;echo y.c:;
cat y.c; gcc -O3 x.c -c; gcc -O3 x.o c.c -o x; gcc -O3 y.c -c; gcc -O3
c.c y.o -o y; time ./x ; time ./y; time ./x; time ./y
c.c:
volatile int v, w;
volatile int u;
int main()
{
u= -1;
v = 1; w = 1;
long i;
unsigned long sum = 0;
for (i=0;i<500000000;i++)
sum += foo(sum, 42);
return sum;
}
x.c:
extern int v,w;
int __attribute__((noinline))foo(int x, int y){
x= 3 * x - 32 + y;
y = 4 * x + 5;
if (v & w)
return 3 * x;
return 5 * y;
}
y.c:
extern int u;
int __attribute__((noinline))foo(int x, int y){
x= 3 * x - 32 + y;
y = 4 * x + 5;
if (((u&((1<<17)|(1<<21)))==((1<<17)|(1<<21))))
return 3 * x;
return 5 * y;
}
real 0m2.390s
user 0m2.389s
sys 0m0.001s
real 0m2.531s
user 0m2.529s
sys 0m0.001s
real 0m2.390s
user 0m2.389s
sys 0m0.001s
real 0m2.532s
user 0m2.530s
sys 0m0.001s