This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Questions about powerpc __tls_get_addr optimization
On Fri, Oct 12, 2018 at 11:38:14AM +1030, Alan Modra wrote:
> On Thu, Oct 11, 2018 at 05:26:54PM -0400, Rich Felker wrote:
> > Alan, can you (or anyone) shed some light on why the DT_PPC_OPT flag
> > is needed for the dynamic linker to be able to apply your
> > __tls_get_addr optimization? Assuming the dynamic linker implements
> > the real function __tls_get_addr, it could do the same modid==0 check
> > itself, without needing assistance from ld.
>
> I think you are correct.
>
> > My concern with doing this would be that there's no relocation on the
> > second (offset) slot when the local-dynamic model is used, and I
> > thought ld would be doing something special to account for this when
> > the optimization is used, but apparently your code in the glibc
> > dynamic linker just ignores this and fills in both slots when
> > processing the R_PPC64_DTPMOD64 relocation.
>
> Right, the slots are filled in for local dynamic at that point.
Do you think it's safe to overwrite these in the dynamic linker?
they're initialized with the 0x8000 offset on ppc and mips, which is
problematic because __tls_get_addr then has to add 0x8000, which takes
2 insns on powerpc. Obviously the DTV entries could be pre-offset by
0x8000 (this is what glibc does, I think) but offsetting DTV entries
isn't a good idea if you have to also support TLSDESC since it doesn't
want them offset (would have to undo the offsets, making it slower).
Right now there is no arch that has both the 0x8000 offset and TLSDESC
support, so it doesn't matter, but I'd rather just be able to patch up
the DTPREL-missing slots to work without the offset.
> > Is this valid, i.e. is is
> > valid to assume that a corresponding R_PPC64_DTPREL64 relocation will
> > come after the R_PPC64_DTPMOD64 relocation, if there is one? Is this
> > assumption valid for other targets as well?
>
> Yes, it is a reasonable assumption if you are using a sane assembler
> and linker, and you're not deliberately trying to create out of order
> relocations. The assembler will normally emit the DTPREL reloc after
> the DTPMOD one by virtue of the DTPREL word appearing after the DTPMOD
> word. Linkers generally emit dynamic relocations in the same order as
> source relocations, or sort in a manner that guarantees they are
> ordered sensibly. In the case of GNU ld, -z combreloc will always
> place the DTPREL reloc after the corresponding DTPMOD reloc because
> they have the same symbol and r_offset for the DTPREL is greater than
> r_offset for the DTPMOD. This should be true for other targets too.
OK. It's unfortunate that it doesn't seem to be written out in a spec
anywhere, but it's probably reasonable to assume ld implementations
will preserve this behavior.
> > Are there other reasons I'm missing that DT_PPC_OPT is needed in order
> > for it to be valid for the dynamic linker to use this technique? Or is
> > it just that you only wanted to implement the zero check in the PLT
> > stub and not repeat it in the __tls_get_addr function?
>
> I can't remember for sure what I was thinking at the time I
> implemented the feature for PowerPC. It's quite possible I didn't
> even think of the possibility that __tls_get_addr could check the
> modid. But if I had, I may have rejected the idea as costing a little
> extra at run time.
>
> Incidentally, one of the gains in this optimization comes from
> avoiding the __tls_get_addr call (the second and subsequent times) and
> the inevitable load-hit-store on the r2 save for a short duration
> function.
Yes, I noticed this for the powerpc64 case, but the same idea applies
to other archs. I just tested a variant of it on x86 with musl and got
something like a 10% performance boost for global-dynamic TLS access
where the library was loaded at program start. My variant avoided
issues with local-dynamic and the second slot lacking a relocation by,
rather than storing 0 in the modid, storing a negative modid
containing the negated (or in the x86 case already-negative) offset
from the thread pointer to module's TLS base, so that __tls_get_addr
could return tp+slot[0]+slot[1]. I'm not sure if this is a better
appproach; it involves one more add than yours. If I'm confident the
slot[1] fixup can be done safely I'd probably go with an exact
duplicate of your version instead. Perhaps if the DTPMOD relocation
uses a symbol, skip writing slot[1] and assume there's a DTPREL
coming.
Rich