This is the mail archive of the
mailing list for the glibc project.
Re: powerpc __tls_get_addr call optimization
- From: Rich Felker <dalias at libc dot org>
- To: Alan Modra <amodra at gmail dot com>
- Cc: Carlos O'Donell <carlos at redhat dot com>, libc-alpha at sourceware dot org
- Date: Sat, 21 Mar 2015 00:36:30 -0400
- Subject: Re: powerpc __tls_get_addr call optimization
- Authentication-results: sourceware.org; auth=none
- References: <20150318061145 dot GE24573 at bubble dot grove dot modra dot org> <5509B0D4 dot 2020903 at redhat dot com> <20150319025631 dot GC28603 at bubble dot grove dot modra dot org> <550B94FC dot 3070903 at redhat dot com> <20150320075502 dot GC26234 at bubble dot grove dot modra dot org> <20150320152712 dot GK23507 at brightrain dot aerifal dot cx> <20150321030702 dot GD26234 at bubble dot grove dot modra dot org>
On Sat, Mar 21, 2015 at 01:37:02PM +1030, Alan Modra wrote:
> On Fri, Mar 20, 2015 at 11:27:12AM -0400, Rich Felker wrote:
> > On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
> > > On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
> > > > On 03/18/2015 10:56 PM, Alan Modra wrote:
> > > > > On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
> > > > >> On 03/18/2015 02:11 AM, Alan Modra wrote:
> > > > >>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
> > > > >>> revisiting an old patch of mine.
> > > > >>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
> > > > >>
> > > > >> I'm not against this patch, but it certainly seems like you would be
> > > > >> better served by just implementing tls descriptors?
> > > > >
> > > > > I think this is one better than tls descriptors, because powerpc
> > > > > avoids the indirect function call used by tls descriptors.
> > > >
> > > > You mean to say it is "faster" than tls descriptors, but at the same
> > >
> > > To be honest, there isn't much difference in the optimized case where
> > > static TLS is available. It boils down to an indirect call to a
> > > function that loads one value vs. a direct call to a stub that loads
> > > two values and compares one against zero. I think what I've
> > > implemented is slightly better for PowerPC, but whether that would
> > > carry over to other architectures is debatable.
> > If the performance difference isn't measurable in real-world
> > applications, I would think uniformity between targets would be a lot
> > more valuable.
> Think of my design as "TLS descriptors version 2". I take the best
> features of TLS descriptors and add one trick, the special linker
> stub, that allows you to omit many of the nasty details of the current
> TLS descriptor design. A target that currently has TLS support but no
> TLS descriptor support and follows the powerpc design:
> 1) won't need to implement gcc changes for tls descriptors,
> 2) won't need to define new relocations,
> 3) won't need to implement linker support for tls descriptors, quite a
> large effort, and
> 4) won't need to implement dl-tlsdesc.S and tlsdesc.c in glibc, also
> not a simple task.
> Another benefit in terms of reliability (and repeatable user timing!)
> is that extended TLS descriptors are not needed, so the locking and
> mallocing in tlsdeschtab.h is avoided.
If the lazy allocation stuff is removed (which it should be; it breaks
AS-safety and other things), the last issue would go away.
> Admittedly, part of the reason a port is so much easier is due to
> omitting lazy TLS resolution. Lazy TLS is complex. What's more, the
> per-target support code is non-trivial. All of tlsdesc.c and half of
> dl-tlsdesc.S is lazy TLS support. I question whether the added
> complexity provides commensurate benefit in real-world applications,
> apart from the degenerate case of loading a shared library that is
> never used. (And even then, you'd need a lot of __thread variables to
> make it worthwhile.)
> In fact, I wouldn't be surprised to find lazy TLS has a net negative
> benefit in real-world applications!
> /me dons asbestos suit. :)
I completely agree. I want to see it removed.
> > I also don't see how your approach is a "direct call". The function
> > being called is in a different DSO so it has to go through a pointer
> > in the GOT or similar, in which case it's just as "indirect" as the
> > TLSDESC call would be.
> It is a direct call to the linker provided stub, which will return
> after a few instructions in the optimized case when static TLS is
That linker-provided stub address is loaded from a "GOT slot" of some
sort, just like the tlsdesc function would be. Either way you have a
PC/GP-relative load followed by a jump to the loaded address. There's
actually one additional level of indirection to load this pointer for
TLSDESC, but for static TLS, the callee returns instantly after
performing a single load.
With non-TLSDESC dynamic TLS on the other hand, there's an additional
PC/GP-relative address computation (for the module/offset structure's
address to pass) in the caller, which should equal out with the cost
of the extra indirection for TLSDESC. But then there's a fair bit of
additional work to be done in the callee.
> Control is passed to __tls_get_addr_opt only when no static TLS was
> available for the shared library at the time the library was
> dynamically relocated, ie. it was dlopen'ed and not enough spare
> static TLS was free.
Where is contol passed if static TLS was used? Maybe I'm
misunderstanding your design? How would the dynamic linker resolve
some calls to __tls_get_addr to different places than other calls,
when there's only a single GOT entry for it?