This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- Cc: libc-alpha at sourceware dot org
- Date: Wed, 1 Jul 2015 13:55:35 +0200
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <5591239A dot 9030907 at twiddle dot net> <1435603025 dot 5485 dot 23 dot camel at oc7878010663> <20150629211831 dot GA23965 at domone> <5591BD23 dot 6090501 at linaro dot org> <20150630031409 dot GA28953 at domone> <5592A310 dot 9010902 at linaro dot org> <20150630211515 dot GA26880 at domone> <55930E26 dot 3050203 at linaro dot org>
On Tue, Jun 30, 2015 at 06:46:14PM -0300, Adhemerval Zanella wrote:
> >> Again this is something, as Steve has pointed out, you only assume without
> >> knowing the subject in depth: it is operating on *vector* registers and
> >> thus it will be more costly to move to and back GRP than just do in
> >> VSX registers. And as Steven has pointed out, the idea is to *validate*
> >> on POWER7.
> > If that is really case then using hwcap for that makes absolutely no sense.
> > Just surround these builtins by #ifdef TESTING and you will compile
> > power7 binary. When you are releasing production version you will
> > optimize that for power8. A difference from just using correct -mcpu
> > could dominate speedups that you try to get with these builtins. Slowing
> > down production application for validation support makes no sense.
> That is a valid point, but as Steve has pointed out the idea is exactly
> to avoid multiple builds.
And thats exactly problem that you just ignore solution. Seriously when
having single build is more important than -mcpu that will give you 1%
performance boost do you think that a 1% boost from hwcap selection
matters? I could come with easy suggestions like changing makefile to
create app_power7 and app_power8 in single build. And a app_power7 could
check if it supports power8 instruction and exec app_power8. I really
doubt why you insist on single build when a best practice is separate
testing and production.
Insisting that you need single binary would mean that you should stick
with power7 optimization and don't bother with hwcap instruction
> > Also you didn't answered my question, it works in both ways.
> > From that example his uses vector register doesn't follow that
> > application should use vector registers. If user does
> > something like in my example, the cost of gpr -> vector conversion will
> > harm performance and he should keep these in gpr.
> And again you make assumptions that you do not know: what if the program
> is made with vectors in mind and they want to process it as uint128_t if
> it is the case? You do know that neither the program constraints so
> assuming that it would be better to use GPR may not hold true.
I didn't make that assumption.
I just said that your assumption that one must use vector
registers is wrong again. From my previous mail:
> Customer just wants to do 128 additions. If a fastest way
> is with GPR registers then he should use gpr registers.
> My claim was that this leads to slow code on power7. Fallback above
> takes 14 cycles on power8 and 128bit addition is similarly slow.
> Yes you could craft expressions that exploit vectors by doing ands/ors
> with 128bit constants but if you mostly need to sum integers and use 128
> bits to prevent overflows then gpr is correct choice due to transfer
Yes it isn't known but its more likely that programmers just used that
as counter instead of vector magic. So we need to see use case in more
>> >>> I am telling all time that there are better alternatives where this
> >>> doesn't matter.
> >>> One example would be write gcc pass that runs after early inlining to
> >>> find all functions containing __builtin_cpu_supports, cloning them to
> >>> replace it by constant and adding ifunc to automatically select variant.
> >> Using internal PLT calls to such mechanism is really not the way to handle
> >> performance for powerpc.
> > No you are wrong again. I wrote to introduce ifunc after inlining. You
> > do inlining to eliminate call overhead. So after inlining effect of
> > adding plt call is minimal, otherwise gcc should inline that to improve
> > performance in first place.
> It is the case if you have the function definition, which might not be
> true. But this is not the case since the code could be in a shared
Seriously? If its function from shared library then it should use ifunc
and not force every caller to keep hwcap selection in sync with library,
and you need plt indirection anyway.
For function definition again get low-hanging fruit and use --lto. It
is really preexisting problem as you will also gain performance by
fixing it in first place.
Also its bit off topic but you don't need internal plt for ifunc as its
implementation detail. You could do it with any ifunc if we decide that eager
resolution is ok.
If plt situation is as bad on power as you claim then you should write
plt elission. Idea is that loader would generate branch
instructions for all used functions instead plt stubs. For autogenerated ifunc gcc
could prepare page for each processor and runtime could do single mmap
acording to hwcap per process.
> > Also why are you so sure that its code in main binary and not code in
> > shared library?
Could you answer that as one should put reusable parts of program in
> >> What does it have to do with vectors? I just saying that in split-core mode
> >> the CPU group dispatches are statically allocated for the eight threads
> >> and thus pipeline gain are lower. And indeed it was not the case for the
> >> example (I rushed without doing the math, my mistake again).
> > And you are telling that in majority of time contested threads would be
> > problem? Do you have statistic how often that happens?
> > Then I would be more worried about vector implementation than gpr one.
> > It goes both ways. A slowdown in gpr code is relatively unlikely for
> > simple economic reasons: As addition, shifts... are frequent
> > intstruction one of best performance/silicon tradeoff is add more
> > execution units that do that until slowdown become unlikely. On other
> > hand for rarely used instructions that doesn't make sense so I wouldn't
> > be much surprised that when all threads would do 128bit vector addition it
> > would get slow as they contest only one execution unit that could do
> > that.
> Seriously, split-core is not really about contested threads, but rather
> a way to set the core specially in KVM mode.
I just tried to understand why your example is relevant. I jumped bit
that a split core is equivalent to contested cpu. If you run other
cpu-intensive three threads then you will get similar cpu dispatches as
when you use split-core.
Also you didn't answer my question if split core is used often or its
just corner case. If its less than 1% then we shouldn't optimize that
corner case and you shouldn't post a irrelevant technical detail in
> But we digress here, since
> the idea is not analyse Steve code snippet if this is faster, better, etc;
> but rather if hwcap using TCB access is better way to handle such compiler
It is as main objection was if this helps at all. If you don't want to
show that this is better than current state we could conclude:
As this snipped was invalid no example that one needs to often
access hwcap was offered. Existing applications read hwcap once per run.
So any proposal to optimize hwcap should be dropped as current code
gives reasonable performance.