This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: Steven Munroe <munroesj at linux dot vnet dot ibm dot com>
- To: Rich Felker <dalias at libc dot org>
- Cc: Szabolcs Nagy <szabolcs dot nagy at arm dot com>, Carlos Eduardo Seo <cseo at linux dot vnet dot ibm dot com>, GLIBC Devel <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>
- Date: Tue, 09 Jun 2015 13:43:09 -0500
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <55770ABA dot 1010205 at arm dot com> <20150609165018 dot GK17573 at brightrain dot aerifal dot cx> <1433871424 dot 21101 dot 44 dot camel at sjmunroe-ThinkPad-W500> <20150609174202 dot GL17573 at brightrain dot aerifal dot cx>
- Reply-to: munroesj at linux dot vnet dot ibm dot com
On Tue, 2015-06-09 at 13:42 -0400, Rich Felker wrote:
> On Tue, Jun 09, 2015 at 12:37:04PM -0500, Steven Munroe wrote:
> > On Tue, 2015-06-09 at 12:50 -0400, Rich Felker wrote:
> > > On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote:
> > > > >> if hwcap is useful abi between compiler and libc
> > > > >> then why is this done in a powerpc specific way?
> > > > >
> > > > > Other platform are free use this technique.
> > > >
> > > > i think this is not a sustainable approach for
> > > > compiler abi extensions.
> > > >
> > > > (it means juggling with magic offsets on the order
> > > > of compilers * libcs * targets).
> > > >
> > > > unfortunately accessing the ssp canary is already
> > > > broken this way, i'm not sure what's a better abi,
> > > > but it's probably worth thinking about one before
> > > > the tcb code gets too messy.
> > >
> > > For the canary I think it makes sense, even though it's ugly -- the
> > > compiler has to generate a reference in every single function (for
> > > 'all' mode, or just most non-trivial functions in 'strong' mode).
> > > That's much different from a feature (hwcap) that should only be used
> > > at init-time and where, even if programmers did abuse it and use it
> > > over and over at runtime, it's only going to be a small constant
> > > overhead in a presumably medium to large sized function, and the cost
> > > is only the need to setup the GOT register and load from the GOT,
> > > anyway.
> >
> > You are entitled to you own opinion but you are not accounting for the
> > aggressive out of order execution the POWER processors and specifics of
> > the PowerISA. In the time it take to load indirect via the TOC (4 cycles
> > minimum) compare/branch we could have executed 12-16 useful
> > instructions.
> >
> > Any indirection exposes the sequences to hazards (like cache miss) which
> > only make things worse.
> >
> > As stated before I have thought about this and understand the options in
> > the context of the PowerISA, POWER micro-architecture, and the PowerPC
> > ABIs. This information is publicly available (if a little hard to find)
> > but I doubt you have taken the time to study it in detail, if at all.
> >
> > I suspect you base your opinion on other architectures and hardware
> > implementations that do not apply to this situation.
>
> That's nice but all theoretical. I've seen countless such theoretical
> claims from people who are coming from a standpoint of the vendor
> manuals for the ISA they're working with, and more often than not,
> they don't translate into measurable benefits. (I've been guilty of
> this myself too, going to great lengths to tweak x86 codegen or even
> write the asm by hand, only to find the resulting code to run the
> exact same speed.) Creating a permanent ABI is an extremely high cost,
> and unless you can justify the cost with actual measurements and a
> reason to believe those measurements have anything to do with
> real-world usage needs, I believe it's an unjustified cost.
>
This is not theory, I am thinking at the level of pipeline cycle timing
for P7/P8. I have been at this so long I can do this in my head.
Now experience does tell me that adding an indirection and the
associated exposure to cache miss hazard can mean the the performance
optimization gets lost in the hazard when it is measured.
I have been to this movie, I don't need to see it again.