This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: Steven Munroe <munroesj at linux dot vnet dot ibm dot comcom>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: Richard Henderson <rth at twiddle dot net>, Szabolcs Nagy <szabolcs dot nagy at arm dot com>, Carlos Eduardo Seo <cseo at linux dot vnet dot ibm dot com>, GLIBC Devel <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>
- Date: Tue, 30 Jun 2015 15:02:18 -0500
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <5591239A dot 9030907 at twiddle dot net> <1435603025 dot 5485 dot 23 dot camel at oc7878010663> <55923C17 dot 6000400 at twiddle dot net> <1435676863 dot 10780 dot 24 dot camel at oc7878010663> <1435680067 dot 4216 dot 15 dot camel at localhost dot localdomain>
- Reply-to: munroesj at linux dot vnet dot ibm dot com
On Tue, 2015-06-30 at 18:01 +0200, Torvald Riegel wrote:
> On Tue, 2015-06-30 at 10:07 -0500, Steven Munroe wrote:
> > On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote:
> > > On 06/29/2015 07:37 PM, Steven Munroe wrote:
> > > > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
> > > >> On 06/09/2015 04:06 PM, Steven Munroe wrote:
> > > >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > >>>>
> > > >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > >>>>> The proposed patch adds a new feature for powerpc. In order to get
> > > >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > >>>>> This enables users to write versioned code based on the HWCAP bits
> > > >>>>> without going through the overhead of reading them from the auxiliary
> > > >>>>> vector.
> > > >>>
> > > >>>> i assume this is for multi-versioning.
> > > >>>
> > > >>> The intent is for the compiler to implement the equivalent of
> > > >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > >>> efficiently as getauxv and scanning the auxv is too slow for inline
> > > >>> optimizations.
> > > >>>
> > > >>
> > > >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
> > > >> the variables private to glibc that already contain this information. That
> > > >> ought to be fast enough for the builtin, rather than consuming space in the TCB.
> > > >>
> > > >
> > > > Richard I do not understand how a 38 instruction function accessed via a
> > > > PLT call stub (minimum 4 additional instructions) is equivalent or "as
> > > > good as" a single in-line load instruction.
> > > >
snip
> > > This is a horrible way to use this builtin. In the same way that using ifunc at
> > > this level would also be horrible.
> > >
> > Yes it is just an example, there are many more that you might find less
> > objectionable.
>
> Could you give more examples that give a clearer picture why you think
> that the concern that Richard raised isn't valid? Especially this here
> regarding 2 vs 40 cycles:
>
I don't see where Richard raise a 2 vs 40 cycle comparison. I think it
was my comment that getauxval was to heavy (38+4 insturctions) for this
and similar cases.
The classic case was Decimal Floating Point when we introduced HW DFP in
POWER6 but needed to support older systems. In many cases we had to
choose between a single DFP instruction and calling software emulation.
The difference are 10-100 to 1 in performance.
DFP was heavily used in DB2, Oracle, SAP, but even though I provided CPU
tuned library (using the AT_PLATFORM dynamic library search capability)
implementation they refused to use it because it was Linux specific.
They did use a cruder version of __builtin_cpu_supports() in the
"portable" implementation they got from IBM research.
There are also enough differences between POWER7 and POWER8 Vector
capabilities (POWER8 added 120 new instructions) to cause the High
Performance Computing Folks fits. While they are more likely to use
#ifdef _ARCH_PWR8 (then most application developers) they still "want
to" provide a "single binary build" supporting multiple machines and
distros. And they don't seem too interested in exotic techniques like
IFUNC.
> > > Even supposing that this builtin uses a single load, you've at least doubled
> > > the overhead of using the insn. The user really should be aware of this and
> > > manually hoist this check much farther up the call chain. At which point the
> > > difference between 2 cycles for a load and 40 cycles for a call is immaterial.
> > >
> > > And if the user is really concerned about unit tests, surely ifdefs are more
> > > appropriate for this situation. At the moment one can only test the P7 path on
> > > P7 and the P8 path on P8; better if one can also test the P7 path on P8.
> > >
> >
> > Yes I know there are better alternatives.
But 4-6 cycles beats 20-40 cycles every time.
> >
> > This is not intended for use within GLIBC or by knowledgeable folks like
> > yourself and the GLIBC community.
> >
> > This about application developers in other communities and users where I
> > would settle for them to just support my platform with any optimization
> > that is somewhat sane.
> >
> > The __Builtin_cpu_supports exist and I see its use. I don't see much use
> > of the more "complicated" approaches that we use in GLIBC. So it seem
> > reasonable to enable __builtin_cpu_supports for POWER but define the
> > implementation to be optimal for the PowerPC platform.
> >
> > Most the argument against seems to be based on assumed "moral hazard".
> > Where you think what they are doing is stupid and so you refuse to help
> > them with any mechanisms that might make what they are doing, and will
> > continue to do, a little less stupid.
>
> I didn't understand Richard's concerns to be about that. Rather, it
> seemed to me he's concerned about supporting use cases that only mean
> technical debt for us; if there is a much simpler way on the users' side
> to do this right, we have to see whether we get a good balance between
> technical debt and benefits for some users.
>
I do not understand this.
How is this any different then stack_guard and __private_ss (split
stack) which exist in X86_64 as well. These are non-version accesses to
the TCB from GCC.
Both have added fields for TM support. And we have already added the EBB
and DSO/TAR support entries specific to PowerISA, while X86_64 contains
a number of TCB fields specific to SSE and AVX extensions.
Nothing new here folks just the Platform extending its own ABI to solve
platform specific problems within the toolchain.
Note I deliberately said ABI and Toolchain (the combined compiler (GCC,
LLVM, ...) linker (Binutils), Dynamic Linker and Posix Runtime (GLIBC))
not just GLIBC.
So the TCB, like the stack layout, register conventions, calling
conventions, etc is full of fixed (non-versioned) offsets; defined,
controlled, and documented by the ABI, known and used as necessarily by
toolchain components to implement the ABI. That is how this stuff works!
Yes TCB field offsets can never change and this was established long ago
in the original ABI and platform supplements.
This is why I argue that this proposal does not imply any "technical
debt" on GLIBC but just part of the ongoing evolution of the ABI as
implemented by the complete toolchain stack.
It is the responsibility of the Platform ABI owner to coordinate
implementation of the ABI to the various tool-chain components and
communities, including GLIBC.
It turns out GLIBC is the best place to get the hwcap fields initialized
for each thread.
> > I appreciate the concern, but think this is odd position for a community
> > that uses phrases like "Free as in Freedom" to describe what they do.
>
> I don't think we promise to do everything for everyone. That does not
> conflict with free software.
>