This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: Steven Munroe <munroesj at linux dot vnet dot ibm dot comcom>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: "Carlos O'Donell" <carlos at redhat dot com>, Carlos Eduardo Seo <cseo at linux dot vnet dot ibm dot com>, GLIBC Devel <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>, Richard Henderson <rth at redhat dot com>
- Date: Tue, 07 Jul 2015 10:35:24 -0500
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <559617FF dot 8010100 at redhat dot com> <20150703085542 dot GE32307 at domone> <55968AF8 dot 8060104 at redhat dot com> <20150703171121 dot GA23898 at domone>
- Reply-to: munroesj at linux dot vnet dot ibm dot com
On Fri, 2015-07-03 at 19:11 +0200, OndÅej BÃlka wrote:
> On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote:
> > On 07/03/2015 04:55 AM, OndÅej BÃlka wrote:
> > >> At the end of the day it's up to IBM to make the best use of the
> > >> tp+offset data stored in the TCB, but every byte you save is another
> > >> byte you can use later for something else.
> > >>
> > > Carlos a problem with this patch is that they ignored community
> > > feedback. Early in this thread Florian come with better idea to use
> > > GOT+offset that could be accessed as
> > > &hwcap_hack and avoids per-thread runtime overhead.
> >
> > Steven and Carlos have not ignored the community feedback, they just
> > have a different set of priorities and requirements. There is little
> > to discuss if your priorities and requirements are different.
> >
> > The use of tp+offset data is indeed a scarce resource that should be
> > used only when absolutely necessary or when the use case dictates.
> >
> > It is my opinion as a developer, that Carlos' patch is flawed because
> > it uses a finite resource, namely tp+offset data, for what I perceive
> > to be a flawed design pattern that as a free software developer I don't
> > want to encourage. These are not entirely technical arguments though,
> > they are subjective and based on my desire to educate and mentor developers
> > who write such code. I don't present these arguments as sustained
> > opposition to the patch because they are not technical and Carlos
> > has a need to accelerate this use case today.
> >
Value judgments about what is precious can vary.
On Power CPU, cycles and hazard avoidance is more precious then a double
word or two. On a machine with 64KB pages, 128-byte cache-lines, and
supported memory configs up to 32TB, this is a good trade-off.
I am not trying to impose this on any one else.
> > I have only a few substantive technical issues with the patch. Given
> > that the ABI allocates a large block of tp+offset data, I think it is
> > OK for IBM to use the data in this way. For example I think it is much
> > much more serious that such a built application will likely just crash
> > when run with an older glibc. This is a distribution maintenance issue
> > that I can't ignore and I'd like to see it solved by a dependency on a
> > versioned dummy symbol.
> >
We agree to add the symbol check and fail the app it is loading an old
GLIBC.
> > Lastly, the symbol address hack is an incomplete solution because Florian
> > has not provided an implementation. Depending on the implementation it
> > may require a new relocation, and that is potentially more costly to the
> > program startup than the present process for filling in HWCAP/HWCAP2.
>
> Thats valid concern. My idea was checking if hwcap_hack relocation exist.
> I didn't realize that it scales with number of libraries.
>
> One of reasons why I didn't like this proposal is that it harms linux
> ecosystem as it increases startup cost of a bit everything while its
> unlikely that cross-platform projects will use this.
>
> But these could be done without much of our help. We need to keep these
> writable to support this hack. I don't know exact assembly for powerpc,
> it should be similar to how do it on x64:
>
> int x;
>
> int foo()
> {
> #ifdef SHARED
> asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
> #else
> asm ("lea x(%rip), %rax; movb $32, (%rax)");
> #endif
> return &x;
> }
>
Not so simple on PowerISA as we don't have PC-relative addressing.
1) The global entry requires 2 instruction to establish the TOC/GOT
2) Medium model requires two instructions (fused) to load a pointer from
the GOT.
3) Finally we can load the cached hwcap.
None of this is required for the TP+offset.
Telling me how x86 does things is not much help.
>
> > Without a concrete implementation I can't comment on one or the other.
> > It is in my opinion overly harsh to force IBM to go implement this new
> > feature. They have space in the TCB per the ABI and may use it for their
> > needs. I think the community should investigate symbol address munging
> > as a method for storing data in addresses and make a generic API from it,
> > likewise I think the community should investigate standardizing tp+offset
> > data access behind a set of accessor macros and normalizing the usage
> > across the 5 or 6 architectures that use it.
> >
> I would like this as with access to that I could improve performance of
> several inlines.
>
>
> > > Also I now have additional comment with api as if you want faster checks
> > > wouldn't be faster to save each bit of hwcap into byte field so you
> > > could avoid using mask at each check?
> >
> > That is an *excellent* suggestion, and exactly the type of technical
> > feedback that we should be giving IBM, and Carlos can confirm if they've
> > tried such "unpacking" of the bits into byte fields. Such unpacking is
> > common in other machine implementations.
> >
This does not help on Power, Any (byte, halfword, word, doubleword,
quadword) aligned load is the same performance. Splitting our bits to
bytes just slow things down. Consider:
if (__builtin_cpu_supports(ARCH_2_07) &&
__builtin_cpu_supports(VEC_CRYPTO))
This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
byte Boolean.
Again value judgements about that is fast or slow can vary by platform.