This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: Steven Munroe <munroesj at linux dot vnet dot ibm dot comcom>
- To: Rich Felker <dalias at libc dot org>
- Cc: libc-alpha at sourceware dot org
- Date: Wed, 01 Jul 2015 14:12:20 -0500
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <20150609163835 dot GI17573 at brightrain dot aerifal dot cx>
- Reply-to: munroesj at linux dot vnet dot ibm dot com
On Tue, 2015-06-09 at 12:38 -0400, Rich Felker wrote:
> On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote:
> > The proposed patch adds a new feature for powerpc. In order to get
> > faster access to the HWCAP/HWCAP2 bits, we now store them in the
> > TCB. This enables users to write versioned code based on the HWCAP
> > bits without going through the overhead of reading them from the
> > auxiliary vector.
> > A new API is published in ppc.h for get/set the bits in the
> > aforementioned memory area (mainly for gcc to use to create
> > builtins).
> Do you have any justification (actual performance figures for a
> real-world usage case) for adding ABI constraints like this? This is
> not something that should be done lightly. My understanding is that
> hwcap bits are normally used in initializing functions pointers (or
> equivalent things like ifunc resolvers), not again and again at
> runtime, so I'm having a hard time seeing how this could help even if
> it does make the individual hwcap accesses measurably faster.
> It would also be nice to see some justification for the magic number
> offsets. Will they be stable under changes to the TCB structure or
> will preserving them require tip-toeing around them?
This discussion has metastasizes into so many side discussions, meta
discussion, personal opinions etc that i would like to start over at the
point where we where still discussing how to implement something
First a level set on requirements and goals.
The intent is allow application developers to develop new application
for Linux on Power and simply the porting of existing Linux applications
to Power. And encourage then to apply the same level of platform
optimization to Power as they do for other Linux platforms.
While there are a near infinity of options (some of which some members
of this community think are stupid) and I have seen them all being used.
As general rule I find it counter productive to call the customer (All
Linux Application Developers are our customers) stupid their face, so I
try to explain the options and encourage them to use many of the
techniques that this community thinks are not stupid.
But as rule the application developer are busy, don't have much patience
for nonsense like IFUNC and AT_PLATFORM library search strategies. They
tend to use what they already know, apply minimal effort to solve the
immediate problem, and move on!
One of the "things they already know" is the __built_cpu_is()
__built_cpu_supports() GCC builtins for X86. To goal of this simple
proposal is enable that for powerpc powerpc64 and powerpc64le, based on
the existing AT_HWCAP/AT_HWCAP2 mechanisms.
Another observation is that many of these applications are deployed as
shared object libraries and frequently are not linked directly to the
main application but loaded via dl_open() are runtime. So clever
solutions that are only simple and/or fast from a main programs but
difficult and/or slow for dl_open() library are not an option.
They are very firm about a "single binary built" for all supported
distros and all supported hardware generations.
And finally these applications tend to be massive C++ programs composed
of smallish members functions and byzantine layers of templates. I have
not observed wide use of private/hidden and so these libraries tend to
expose every member function as a PLT entry, which resists most
Net this is a harder problem then it looks.
So lets write down some requirements:
0) Something the average application developer will understand and use.
1) In any user library, in including ones loaded via LD_PRELOAD and
2) Across multiple Distro versions and across Distros (using different
And goals for the Power implementation:
1) As fast as possible accounting for the limits of the ABI, ISA and
1a) Minimal path length to obtain the hwcap bit vector for test
1b) Limited exposure to micro-architecture hazards including
2) Simple and reliable initialization of the cached values.
3) And without relying on .text relocation in libraries.
First lets dispose of the obvious. Extern static variables.
This is not horrible for larger grained examples but can be less than
optimal for fine grained C++ examples. As stated above the hwcap will
not be local to the user library. As PowerISA does not have PC-relative
addressing our ABI requires that R2 (AKA the TOC pointer) is set to
address the local (for this libraries) GOT/TOC/PLT before we access any
static variable and extern require an indirect load of the extern hwcap
address from the GOT/TOC.
In addition, since we are potentially changing R2 (AKA the TOC pointer)
we are now obligated to save and restore the R2.
Now the design of POWER assumes that as RISC architecture with lots of
registers and being designed for massive memory bandwidth and
out-of-order execution, the processor core does no optimized for
programs that store to then immediately reload from a memory location.
In a machine with 16-pipelines per core and capable of dispatching up to
8 instructions per cycle, "immediate" has an amazingly broad definition
(many 10s of instructions).
So the store and reload of the TOC pointer can hit the Load-hit-store
hazard (essentially the load got issued (out-of-order) before the store
it depended on was complete or at a stage where a bypass was available)
even across the execution of the called function. While the core detects
and corrects this state, it does so in a heavy handed way (instruction
rejects (11 cycles each) or instruction fetch flush (worse)). Lets just
say this is something to avoid if you can.
So introducing a static variable to C++ functions that would not
normally access static should be avoided. Many C++ member functions are
small enough execute completely within the available (volatile) register
and don't even need a stack-frame. So a __builtin_cpu_supports() design
based on none local extern static would be a unforced error in these
Of course the TCB based proposal avoids all of this because the TCB
pointer (R13) is constant across all functions in a thread (not
save/restored in the user application).
Now for the next obvious case. Which why not a normal TLS variable.
If you think about the requirements for a while it becomes clear. As the
HWCAP cache would have to be defined and initialized in either libgcc or
libc, accept will be none local from any user library. So all the local
TLC access optimization's are disallowed. Add the requirement to support
dl_open() libraries leaves the general dynamic TLS model as the ONLY
This requires a up-call to __tls_get_adder plus accessing a couple of
TLS relocations from to GOT. And the __tls_get_addr, which is in
ld64.so.2, which requires a PLT call stub that saves and restores the
TOC-pointer. Remember the previous discussion about TOC save/restore and
exposure to the load-hit-store hazard?
Now there were a lot of suggestions to just force the HWCAP TLS
variables into initial exec or local exec TLS model with an attribute.
This would resolve to direct TLS offset in some special reserved TLS
How does that work with a library loaded with dl_open()? How does that
work with a library linked with one toolchain / GLIBC on Distro X and
run on a system with a different toolchain and GLIBC on Distro Y? With
different versions of GLIBC? Will HWCAP get the same TLS offset? Do we
end up with .text relocations that we are also trying to avoid?
Again the TCB avoids all of this as it provides a fixed offset defined
by the ABI and does not require any up-calls or indirection. And also
will work in any library without induced hazards. This clearly works
across distros including previous version of GLIBC as the words where
previously reserved by the ABI. Application libraries that need to run
on older distros can add a __built_cpu_init() to their library init or
if threaded to their thread create function.