This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB


On Tue, 2015-06-30 at 18:01 +0200, Torvald Riegel wrote:
> On Tue, 2015-06-30 at 10:07 -0500, Steven Munroe wrote:
> > On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote:
> > > On 06/29/2015 07:37 PM, Steven Munroe wrote:
> > > > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
> > > >> On 06/09/2015 04:06 PM, Steven Munroe wrote:
> > > >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > >>>>
> > > >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > >>>>> The proposed patch adds a new feature for powerpc. In order to get
> > > >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > >>>>> This enables users to write versioned code based on the HWCAP bits
> > > >>>>> without going through the overhead of reading them from the auxiliary
> > > >>>>> vector.
> > > >>>
> > > >>>> i assume this is for multi-versioning.
> > > >>>
> > > >>> The intent is for the compiler to implement the equivalent of
> > > >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > >>> efficiently as getauxv and scanning the auxv is too slow for inline
> > > >>> optimizations.
> > > >>>
> > > >>
> > > >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
> > > >> the variables private to glibc that already contain this information.  That
> > > >> ought to be fast enough for the builtin, rather than consuming space in the TCB.
> > > >>
> > > >
> > > > Richard I do not understand how a 38 instruction function accessed via a
> > > > PLT call stub (minimum 4 additional instructions) is equivalent or "as
> > > > good as" a single in-line load instruction.
> > > >
snip
> > > This is a horrible way to use this builtin. In the same way that using ifunc at 
> > > this level would also be horrible.
> > > 
> > Yes it is just an example, there are many more that you might find less
> > objectionable.
> 
> Could you give more examples that give a clearer picture why you think
> that the concern that Richard raised isn't valid?  Especially this here
> regarding 2 vs 40 cycles:
> 
I don't see where Richard raise a 2 vs 40 cycle comparison. I think it
was my comment that getauxval was to heavy (38+4 insturctions) for this
and similar cases.

The classic case was Decimal Floating Point when we introduced HW DFP in
POWER6 but needed to support older systems. In many cases we had to
choose between a single DFP instruction and calling software emulation.
The difference are 10-100 to 1 in performance.

DFP was heavily used in DB2, Oracle, SAP, but even though I provided CPU
tuned library (using the AT_PLATFORM dynamic library search capability)
implementation they refused to use it because it was Linux specific.
They did use a cruder version of __builtin_cpu_supports() in the
"portable" implementation they got from IBM research.

There are also enough differences between POWER7 and POWER8 Vector
capabilities (POWER8 added 120 new instructions) to cause the High
Performance Computing Folks fits. While they are more likely to use
#ifdef _ARCH_PWR8 (then most application developers) they still "want
to" provide a "single binary build" supporting multiple machines and
distros. And they don't seem too interested in exotic techniques like
IFUNC.


> > > Even supposing that this builtin uses a single load, you've at least doubled 
> > > the overhead of using the insn. The user really should be aware of this and 
> > > manually hoist this check much farther up the call chain.  At which point the 
> > > difference between 2 cycles for a load and 40 cycles for a call is immaterial.
> > >
> > > And if the user is really concerned about unit tests, surely ifdefs are more 
> > > appropriate for this situation.  At the moment one can only test the P7 path on 
> > > P7 and the P8 path on P8; better if one can also test the P7 path on P8.
> > > 
> > 
> > Yes I know there are better alternatives.

But 4-6 cycles beats 20-40 cycles every time.
> > 
> > This is not intended for use within GLIBC or by knowledgeable folks like
> > yourself and the GLIBC community. 
> > 
> > This about application developers in other communities and users where I
> > would settle for them to just support my platform with any optimization
> > that is somewhat sane.
> > 
> > The __Builtin_cpu_supports exist and I see its use. I don't see much use
> > of the more "complicated" approaches that we use in GLIBC. So it seem
> > reasonable to enable __builtin_cpu_supports for POWER but define the
> > implementation to be optimal for the PowerPC platform.
> > 
> > Most the argument against seems to be based on assumed "moral hazard".
> > Where you think what they are doing is stupid and so you refuse to help
> > them with any mechanisms that might make what they are doing, and will
> > continue to do, a little less stupid.
> 
> I didn't understand Richard's concerns to be about that.  Rather, it
> seemed to me he's concerned about supporting use cases that only mean
> technical debt for us; if there is a much simpler way on the users' side
> to do this right, we have to see whether we get a good balance between
> technical debt and benefits for some users.
> 
I do not understand this.

How is this any different then stack_guard and __private_ss (split
stack) which exist in X86_64 as well. These are non-version accesses to
the TCB from GCC. 

Both have added fields for TM support. And we have already added the EBB
and DSO/TAR support entries specific to PowerISA, while X86_64 contains
a number of TCB fields specific to SSE and AVX extensions.

Nothing new here folks just the Platform extending its own ABI to solve
platform specific problems within the toolchain.

Note I deliberately said ABI and Toolchain (the combined compiler (GCC,
LLVM, ...) linker (Binutils), Dynamic Linker and Posix Runtime (GLIBC))
not just GLIBC.

So the TCB, like the stack layout, register conventions, calling
conventions, etc is full of fixed (non-versioned) offsets; defined,
controlled, and documented by the ABI, known and used as necessarily by
toolchain components to implement the ABI. That is how this stuff works!

Yes TCB field offsets can never change and this was established long ago
in the original ABI and platform supplements.

This is why I argue that this proposal does not imply any "technical
debt" on GLIBC but just part of the ongoing evolution of the ABI as
implemented by the complete toolchain stack.

It is the responsibility of the Platform ABI owner to coordinate
implementation of the ABI to the various tool-chain components and
communities, including GLIBC.

It turns out GLIBC is the best place to get the hwcap fields initialized
for each thread. 


> > I appreciate the concern, but think this is odd position for a community
> > that uses phrases like "Free as in Freedom" to describe what they do.
> 
> I don't think we promise to do everything for everyone.  That does not
> conflict with free software.
> 



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]