This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: Steven Munroe <munroesj at linux dot vnet dot ibm dot comcom>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, libc-alpha at sourceware dot org
- Date: Tue, 07 Jul 2015 10:47:36 -0500
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <5591239A dot 9030907 at twiddle dot net> <1435603025 dot 5485 dot 23 dot camel at oc7878010663> <20150629211831 dot GA23965 at domone> <5591BD23 dot 6090501 at linaro dot org> <20150630031409 dot GA28953 at domone> <5592A310 dot 9010902 at linaro dot org> <20150630211515 dot GA26880 at domone> <55930E26 dot 3050203 at linaro dot org> <20150701115535 dot GA3025 at domone>
- Reply-to: munroesj at linux dot vnet dot ibm dot com
On Wed, 2015-07-01 at 13:55 +0200, OndÅej BÃlka wrote:
> On Tue, Jun 30, 2015 at 06:46:14PM -0300, Adhemerval Zanella wrote:
> > >> Again this is something, as Steve has pointed out, you only assume without
> > >> knowing the subject in depth: it is operating on *vector* registers and
> > >> thus it will be more costly to move to and back GRP than just do in
> > >> VSX registers. And as Steven has pointed out, the idea is to *validate*
> > >> on POWER7.
> > >
> > > If that is really case then using hwcap for that makes absolutely no sense.
> > > Just surround these builtins by #ifdef TESTING and you will compile
> > > power7 binary. When you are releasing production version you will
> > > optimize that for power8. A difference from just using correct -mcpu
> > > could dominate speedups that you try to get with these builtins. Slowing
> > > down production application for validation support makes no sense.
> > That is a valid point, but as Steve has pointed out the idea is exactly
> > to avoid multiple builds.
> And thats exactly problem that you just ignore solution. Seriously when
> having single build is more important than -mcpu that will give you 1%
> performance boost do you think that a 1% boost from hwcap selection
> matters? I could come with easy suggestions like changing makefile to
> create app_power7 and app_power8 in single build. And a app_power7 could
> check if it supports power8 instruction and exec app_power8. I really
> doubt why you insist on single build when a best practice is separate
> testing and production.
> Insisting that you need single binary would mean that you should stick
> with power7 optimization and don't bother with hwcap instruction
> > >
> > >
> > > Also you didn't answered my question, it works in both ways.
> > > From that example his uses vector register doesn't follow that
> > > application should use vector registers. If user does
> > > something like in my example, the cost of gpr -> vector conversion will
> > > harm performance and he should keep these in gpr.
> > And again you make assumptions that you do not know: what if the program
> > is made with vectors in mind and they want to process it as uint128_t if
> > it is the case? You do know that neither the program constraints so
> > assuming that it would be better to use GPR may not hold true.
> I didn't make that assumption.
> I just said that your assumption that one must use vector
> registers is wrong again. From my previous mail:
> > Customer just wants to do 128 additions. If a fastest way
> > is with GPR registers then he should use gpr registers.
> > My claim was that this leads to slow code on power7. Fallback above
> > takes 14 cycles on power8 and 128bit addition is similarly slow.
> > Yes you could craft expressions that exploit vectors by doing ands/ors
> > with 128bit constants but if you mostly need to sum integers and use 128
> > bits to prevent overflows then gpr is correct choice due to transfer
> > cost.
> Yes it isn't known but its more likely that programmers just used that
> as counter instead of vector magic. So we need to see use case in more
> >> >>> I am telling all time that there are better alternatives where this
> > >>> doesn't matter.
> > >>>
> > >>> One example would be write gcc pass that runs after early inlining to
> > >>> find all functions containing __builtin_cpu_supports, cloning them to
> > >>> replace it by constant and adding ifunc to automatically select variant.
> > >>
> > >> Using internal PLT calls to such mechanism is really not the way to handle
> > >> performance for powerpc.
> > >>
> > > No you are wrong again. I wrote to introduce ifunc after inlining. You
> > > do inlining to eliminate call overhead. So after inlining effect of
> > > adding plt call is minimal, otherwise gcc should inline that to improve
> > > performance in first place.
> > It is the case if you have the function definition, which might not be
> > true. But this is not the case since the code could be in a shared
> > library.
> Seriously? If its function from shared library then it should use ifunc
> and not force every caller to keep hwcap selection in sync with library,
> and you need plt indirection anyway.
if you believe so strongly that ifunc it the best solution then I
suggest you look at the 1000s of packages in a Linux distro and see how
many of them use IFUNC or any of the other suggested techniques.
My survey shows very few.
So your issue is not with me but with the world at large.
If you want this to be a serious option then you need to convince all of