This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB


On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
 > But these could be done without much of our help. We need to keep these
> > writable to support this hack. I don't know exact assembly for powerpc,
> > it should be similar to how do it on x64:
> > 
> > int x;
> > 
> > int foo()
> > {
> > #ifdef SHARED
> > asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
> > #else
> > asm ("lea x(%rip), %rax; movb $32, (%rax)");
> > #endif
> > return &x;
> > }
> > 
> 
> Not so simple on PowerISA as we don't have PC-relative addressing.
> 
> 1) The global entry requires 2 instruction to establish the TOC/GOT
> 2) Medium model requires two instructions (fused) to load a pointer from
> the GOT.
> 3) Finally we can load the cached hwcap.
> 
> None of this is required for the TP+offset.
>
And why you didn't wrote that when it was first suggested? When you don't answer 
it looks like you don't want to answer because that suggestion is better.

Here problem isn't lack of relative addressing but that you don't start
with GOT in register. 

You certainly could do similar hack as you do with tcb and place hwcap
bits just after that so you could do just one load.

That you require so many instructions on powerpc is gcc bug, rather than
rule. You don't need that many instructions when you place frequent
symbols in -32768..32767 range. For example here you could save one
addition.

int x, y;
int foo()
{
  return x + y;
}

original

00000000000007d0 <foo>:
 7d0:	02 00 4c 3c 	addis   r2,r12,2
 7d4:	30 78 42 38 	addi    r2,r2,30768
 7d8:	00 00 00 60 	nop
 7dc:	30 80 42 e9 	ld      r10,-32720(r2)
 7e0:	00 00 00 60 	nop
 7e4:	38 80 22 e9 	ld      r9,-32712(r2)
 7e8:	00 00 6a 80 	lwz     r3,0(r10)
 7ec:	00 00 29 81 	lwz     r9,0(r9)
 7f0:	14 4a 63 7c 	add     r3,r3,r9
 7f4:	b4 07 63 7c 	extsw   r3,r3
 7f8:	20 00 80 4e 	blr

new

 	addis   r2,r12,2
	ld      r10,-1952(r2)
	ld      r9,-1944(r2)
	lwz     r3,0(r10)
	lwz     r9,0(r9)
	add     r3,r3,r9
	extsw   r3,r3
	blr

 
> Telling me how x86 does things is not much help.

That why we need to know how that would work on powerpc.

> > 
> > > Without a concrete implementation I can't comment on one or the other.
> > > It is in my opinion overly harsh to force IBM to go implement this new
> > > feature. They have space in the TCB per the ABI and may use it for their
> > > needs. I think the community should investigate symbol address munging
> > > as a method for storing data in addresses and make a generic API from it,
> > > likewise I think the community should investigate standardizing tp+offset
> > > data access behind a set of accessor macros and normalizing the usage
> > > across the 5 or 6 architectures that use it.
> > >
> > I would like this as with access to that I could improve performance of
> > several inlines.
> > 
> > 
> > > > Also I now have additional comment with api as if you want faster checks
> > > > wouldn't be faster to save each bit of hwcap into byte field so you
> > > > could avoid using mask at each check?
> > > 
> > > That is an *excellent* suggestion, and exactly the type of technical
> > > feedback that we should be giving IBM, and Carlos can confirm if they've
> > > tried such "unpacking" of the bits into byte fields. Such unpacking is
> > > common in other machine implementations.
> > >
> This does not help on Power, Any (byte, halfword, word, doubleword,
> quadword) aligned load is the same performance. Splitting our bits to
> bytes just slow things down. Consider:
> 
> if (__builtin_cpu_supports(ARCH_2_07) &&   
>     __builtin_cpu_supports(VEC_CRYPTO))
> 
> This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
> byte Boolean. 
> 
> Again value judgements about that is fast or slow can vary by platform.

Instruction count means nothing if you don't have good intuition about
powerpc platform. If you consider these your three instructions are lot
slower than byte Booleans. 

Use following benchmark. You need separate compilation as to simulate
many calls of function that uses hwcap that are not optimized away by
gcc. I used computation before hwcap selection as without that there
wouldn't be much difference as with OoO execution it would mostly
measure latency of loads. It would still be slower but its 1.90s vs 1.92s

Adding third check makes no difference, and case of one is obviously
faster.

Also how are you sure that checking more flags happens often to justify
any potential savings with more checks if there were any savings?

Benchmark is following:

[neleai@gcc2-power8 ~]$ echo c.c:;cat c.c; echo x.c:;cat x.c;echo y.c:;
cat y.c; gcc -O3 x.c -c; gcc -O3 x.o c.c -o x; gcc -O3 y.c -c; gcc -O3
c.c y.o -o y; time ./x ; time ./y; time ./x; time ./y

c.c:
volatile int v, w;
volatile int u;
int main()
{
  u= -1;
  v = 1; w = 1;
  long i;
  unsigned long sum = 0;
  for (i=0;i<500000000;i++)
    sum += foo(sum, 42);
  return sum;

}
x.c:
extern int v,w;
int __attribute__((noinline))foo(int x, int y){
 x= 3 * x - 32 + y;
 y = 4 * x + 5;
 if (v & w)
   return 3 * x;
 return 5 * y;
}

y.c:
extern int u;
int __attribute__((noinline))foo(int x, int y){
 x= 3 * x - 32 + y;
 y = 4 * x + 5;
 if (((u&((1<<17)|(1<<21)))==((1<<17)|(1<<21))))
   return 3 * x;
 return 5 * y;
}


real	0m2.390s
user	0m2.389s
sys	0m0.001s

real	0m2.531s
user	0m2.529s
sys	0m0.001s

real	0m2.390s
user	0m2.389s
sys	0m0.001s

real	0m2.532s
user	0m2.530s
sys	0m0.001s


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]