This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB


On Fri, 2015-07-03 at 19:11 +0200, OndÅej BÃlka wrote:
> On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote:
> > On 07/03/2015 04:55 AM, OndÅej BÃlka wrote:
> > >> At the end of the day it's up to IBM to make the best use of the
> > >> tp+offset data stored in the TCB, but every byte you save is another
> > >> byte you can use later for something else.
> > >>
> > > Carlos a problem with this patch is that they ignored community
> > > feedback. Early in this thread Florian come with better idea to use
> > > GOT+offset that could be accessed as 
> > > &hwcap_hack and avoids per-thread runtime overhead.
> > 
> > Steven and Carlos have not ignored the community feedback, they just
> > have a different set of priorities and requirements. There is little
> > to discuss if your priorities and requirements are different.
> > 
> > The use of tp+offset data is indeed a scarce resource that should be
> > used only when absolutely necessary or when the use case dictates.
> > 
> > It is my opinion as a developer, that Carlos' patch is flawed because
> > it uses a finite resource, namely tp+offset data, for what I perceive
> > to be a flawed design pattern that as a free software developer I don't
> > want to encourage. These are not entirely technical arguments though,
> > they are subjective and based on my desire to educate and mentor developers
> > who write such code. I don't present these arguments as sustained
> > opposition to the patch because they are not technical and Carlos
> > has a need to accelerate this use case today.
> > 
Value judgments about what is precious can vary.

On Power CPU, cycles and hazard avoidance is more precious then a double
word or two. On a machine with 64KB pages, 128-byte cache-lines, and
supported memory configs up to 32TB, this is a good trade-off.

I am not trying to impose this on any one else.

> > I have only a few substantive technical issues with the patch. Given
> > that the ABI allocates a large block of tp+offset data, I think it is
> > OK for IBM to use the data in this way. For example I think it is much
> > much more serious that such a built application will likely just crash
> > when run with an older glibc. This is a distribution maintenance issue
> > that I can't ignore and I'd like to see it solved by a dependency on a
> > versioned dummy symbol.
> > 

We agree to add the symbol check and fail the app it is loading an old
GLIBC.

> > Lastly, the symbol address hack is an incomplete solution because Florian
> > has not provided an implementation. Depending on the implementation it
> > may require a new relocation, and that is potentially more costly to the
> > program startup than the present process for filling in HWCAP/HWCAP2.
> 
> Thats valid concern. My idea was checking if hwcap_hack relocation exist. 
> I didn't realize that it scales with number of libraries.
> 
> One of reasons why I didn't like this proposal is that it harms linux
> ecosystem as  it increases startup cost of a bit everything while its 
> unlikely that cross-platform projects will use this.
> 
> But these could be done without much of our help. We need to keep these
> writable to support this hack. I don't know exact assembly for powerpc,
> it should be similar to how do it on x64:
> 
> int x;
> 
> int foo()
> {
> #ifdef SHARED
> asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
> #else
> asm ("lea x(%rip), %rax; movb $32, (%rax)");
> #endif
> return &x;
> }
> 

Not so simple on PowerISA as we don't have PC-relative addressing.

1) The global entry requires 2 instruction to establish the TOC/GOT
2) Medium model requires two instructions (fused) to load a pointer from
the GOT.
3) Finally we can load the cached hwcap.

None of this is required for the TP+offset.

Telling me how x86 does things is not much help.
> 
> > Without a concrete implementation I can't comment on one or the other.
> > It is in my opinion overly harsh to force IBM to go implement this new
> > feature. They have space in the TCB per the ABI and may use it for their
> > needs. I think the community should investigate symbol address munging
> > as a method for storing data in addresses and make a generic API from it,
> > likewise I think the community should investigate standardizing tp+offset
> > data access behind a set of accessor macros and normalizing the usage
> > across the 5 or 6 architectures that use it.
> >
> I would like this as with access to that I could improve performance of
> several inlines.
> 
> 
> > > Also I now have additional comment with api as if you want faster checks
> > > wouldn't be faster to save each bit of hwcap into byte field so you
> > > could avoid using mask at each check?
> > 
> > That is an *excellent* suggestion, and exactly the type of technical
> > feedback that we should be giving IBM, and Carlos can confirm if they've
> > tried such "unpacking" of the bits into byte fields. Such unpacking is
> > common in other machine implementations.
> >
This does not help on Power, Any (byte, halfword, word, doubleword,
quadword) aligned load is the same performance. Splitting our bits to
bytes just slow things down. Consider:

if (__builtin_cpu_supports(ARCH_2_07) &&   
    __builtin_cpu_supports(VEC_CRYPTO))

This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
byte Boolean. 

Again value judgements about that is fast or slow can vary by platform.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]