This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RFC: TLS improvements for IA32 and AMD64/EM64T


Over the past few months, I've been working on porting to IA32 and
AMD64/EM64T the interesting bits of the TLS design I came up with for
FR-V, achieving some impressive speedups along with slight code size
reductions in the most common cases.

Although the design is not set in stone yet, it's fully implemented
and functional with patches I'm about to post for binutils, gcc and
glibc mainline, as follow-ups to this message, except that the GCC
patch will go to gcc-patches, as expected.

The specs RFC is attached.  Comments are welcome.

	Thread Local Storage Descriptors for IA32 and AMD64/EM64T

		      Version 0.9.2 - 2005-09-15

     Alexandre Oliva <aoliva@redhat.com, oliva@lsd.ic.unicamp.br>


Introduction
============

While porting NPTL to the FR-V architecture, an idea occurred to me
that would enable significant improvements to the General Dynamic and
Local Dynamic access models to thread-local variables.  These methods
are known to be extremely inefficient because of the need to call a
function to obtain the address of a thread-local variable, a function
that can often be quite inefficient.

The reason for calling such a function is that, when code is compiled
for a dynamic library (the cases in which these access models are
used), it is not generally possible to know whether a thread-local
variable is going to be allocated in the Static Thread Local Storage
Block or not.  Dynamic libraries that are loaded at program start up
can have their thread local storage sections assigned to this static
block, since their TLS requirements are all known before the program
is initiated.  Libraries loaded at a later time, however, may need
dynamically-allocated storage for their TLS blocks.

In the former case, the offset from the Thread Pointer, usually held
in a register, to the thread-local variable is going to be the same
for all threads, whereas in the latter case, such offset may vary, and
it may even be necessary to allocate storage for the variable at the
time it is accessed.

Existing implementations of GD and LD access models did not take
advantage of this run-time constant to speed up the case of libraries
loaded before program start up, a case that is certainly the most
common.

Even though libraries can choose more efficient access models, they
can only do so by giving up the ability for the modules that define
such variables to be loaded after program start up, since different
code sequences have to be generated in this case.

The method proposed here doesn't make a compile-time trade off; it
rather decides, at run time, how each variable can be accessed most
efficiently, without any penalty on code or data sizes in case the
variable is defined in an initially-loaded module, and with some
additional data, allocated dynamically, for the case of late-loaded
modules.  In both cases, performance is improved over the traditional
access models.

Another advantage of this novel design for such access models is that
it enables relocations to thread-local variables to be resolved
lazily.


Background
==========

Thread-local storage is organized as follows: for every thread, two
blocks of memory are allocated: a Static TLS block and a Dynamic
Thread Vector.  A thread pointer, normally a reserved register, points
to some fixed location in the Static TLS block, that contains a
pointer to the dynamic thread vector at some fixed location as well.

TLS for the main executable, if it has a TLS section, is also at a
fixed offset from the thread pointer.  Other modules loaded before the
program starts will also have their TLS sections assigned to the
Static TLS block.

The dynamic thread vector starts with a generation count, followed by
pointers to TLS blocks holding thread-specific copies of the TLS
sections of each module.

If modules are loaded at run time, the dynamic thread vector may need
to grow, and the corresponding TLS blocks may have to be allocated
dynamically.  The generation count is used to control whether the DTV
is up-to-date with regard to newly-loaded or unloaded modules,
enabling the reuse of entries without confusion on whether a TLS block
has been created and initialized for a newly-loaded module or whether
that block was used by a module already unloaded, that is still
waiting for deallocation.


Programs can access thread-local variables by using code that follows
4 different models: Local Exec, Initial Exec, General Dynamic and
Local Dynamic.

Local Exec is only applicable when code in the main executable
accesses a thread-local variable also defined in the main executable.
In such cases, the offset from the Thread Pointer to the variable can
be computed by the linker, so the access model consists in simply
adding this offset to the Thread Pointer to compute the address of the
variable.

Initial Exec is applicable when code in the main executable accesses
thread-local variables that are defined in other loadable modules that
are dependencies of this executable.  Since the dynamic loader will
make sure they are loaded before the program can start running, it can
lay out the Static TLS Block taking their TLS sections' sizes and
alignment requirements into account.  Since the Thread Pointer points
into the same location within the Static TLS Block of each thread, the
offset from the Thread Pointer to the variable one wants to access is
constant for all threads.  However, unlike the Local Exec case, it is
not known at link time, but rather at run time.  So, the compiler
arranges for the offset to be loaded from a Global Offset Table entry,
that the linker creates to hold the TP offset of the variable, and
adds that to the TP to access the variable.  The linker also emits a
dynamic relocation for the dynamic loader to fill in the GOT entry
with the correct value.

Although the Initial Exec access model is intended for use in code
generated for executables, some dynamic libraries may use it as well,
when they know the referenced variable is defined in the main
executable or in another initially-loaded dynamic library.  In fact,
if a dynamic library references a thread-local variable defined in
itself, using the Initial Exec access model, it may be giving up the
ability to be loaded at run time, with mechanisms such as the dlopen()
library call.


General Dynamic is the most general access model, in that it poses no
requirement whatsoever on the variable-accessing or the
variable-defining modules.  This access model has traditionally
involved calling a function, typically called __tls_get_addr(), to
obtain the address of a thread-local variable.  This function takes
two pieces of information to compute the address: a module id, an
index into the dynamic thread vector corresponding to the module in
which the variable is defined, and an offset of the variable from the
beginning of the corresponding TLS block.

These two pieces of information are computed by the dynamic loader, in
GOT locations determined by the linker.  If they're adjacent, it's
possible to pass both pieces of information to __tls_get_addr() by
means of a single pointer; otherwise, they're usually passed as two
separate arguments.  In either case, the code sets up arguments for
__tls_get_addr() and then calls it explicitly, using the returned
value as the address of the variable.

The Local Dynamic access model can only be used to access variables
that are local to the module in which they're used, and it only makes
sense when accessing multiple such variables.  The idea is to use a
single call to __tls_get_addr() to compute the base address of the TLS
block for the module.  Then, to access variables, it suffices to add
to this base address the offset corresponding to each variable.  The
offset can be used as a literal in the code, since it's determined by
the linker when it lays out the TLS section of the module.


Overall Design
==============

The optimization proposed herein intends to improve the performance of
dynamic libraries that access thread-local variables using the dynamic
access models, in cases they're loaded before program start up,
without slowing down (or, if possible, speeding up) the performance
when the same libraries are loaded during run time.

The idea is to use TLS descriptors containing a pointer to a function
and an argument.  This uses two words in the GOT, just like the
arguments traditionally passed to __tls_get_addr(), but uses this
space in a very different way.

A single relocation is to be used to compute the value of the two
words of the TLS descriptor.  The dynamic loader, when resolving the
relocation, can determine whether the TLS block used for the module
that defines the TLS symbol is in the static TLS block or not.

If it is in the static TLS block, the offset from the TP to the symbol
is constant for all threads, so it sets the argument in the TLS
descriptor to this constant offset, and sets the function pointer to a
piece of code that simply returns the argument (plus the TP, in an
alternate design).

If the defining module is in dynamically-allocated TLS, it sets the
argument to a dynamically-allocated extended descriptor containing the
arguments passed to __tls_get_addr() and the module's generation
count, and the function pointer to a piece of code that checks whether
the generation count is sufficiently up to date and that the DTV entry
is set.  If so, this piece of code loads the DTV entry, adds the
symbol offset to it, and subtracts the TP value (unless using the
alternate design), returning the result.  Otherwise, it calls
__tls_get_addr(), and subtracts from its return value the TP (unless
using the alternate design).

The functions defined above use custom calling conventions that
require them to preserve any registers they modify.  This penalizes
the case that requires dynamic TLS, since it must preserve all
call-clobbered registers before calling __tls_get_addr(), but it is
optimized for the most common case of static TLS, and also for the
case in which the code generated by the compiler can be relaxed by the
linker to a more efficient access model: being able to assume no
registers are clobbered by the call tends to improve register
allocation.  Also, the function that handles the dynamic TLS case will
most often be able to avoid calling __tls_get_addr(), thus potentially
avoiding the need for preserving registers.


Lazy relocation is accomplished by setting the argument portion of the
TLS descriptor to point to the relocation, and the function pointer to
point to a lazy relocation function.  Some effort is needed to ensure
that the TLS descriptor is modified and accessed atomically, and that
the lazy relocation function can quickly identify the module that
contains the relocation.


This optimized access model thus consists in setting up a pointer to
the TLS descriptor (or, in another alternate design, loading its
contents atomically), then calling the function at the address given
by the pointer to function in the TLS descriptor, passing to it a
pointer to the descriptor itself (or, in an alternate design, the
argument portion of the TLS descriptor).  The value returned by this
call can then be used as an offset from the thread pointer to access
the variable (or as the address of the variable, in the first
alternate design).

The actual code sequences and implementation details for IA32 and
AMD64/EM64T are depicted in the following subsections.


IA32
----

The general dynamic access model used to be:

        leal    variable@TLSGD(,%ebx,1), %eax
        call    ___tls_get_addr@PLT
	# use %eax as the address, or (%eax) to access the value

After the call instruction, %eax holds the address of variable.
variable@TLSGD is resolved by the linker to the GOT offset holding the
data structure passed by reference to ___tls_get_addr(), containing
the TLS module id and the TLS offset, computed by two separate dynamic
relocations.  The odd addressing mode is needed to make the
instruction longer, such that, in case the linker finds out it can
relax the code sequence, the necessary code fits.


The optimized method proposed here uses:

        leal    tval@TLSDESC(%ebx), %eax
	[...] # any other instructions that preserve %eax
        call    *tval@TLSCALL(%eax)
	# add %gs:0 to %eax to compute the address,
	# or use %gs:(%eax) to access the value

Note that this call instruction is actually emitted without an offset,
so it's 4 bytes shorter than it appears to be.  It's equivalent to
`call *(%eax)', that emits a two-byte instruction, but it's annotated
with a relocation that enables the leal and the call to be moved apart
from each other, for better scheduling.  The following new relocations
types are used by the code above:

#define R_386_TLS_GOTDESC  38		/* GOT offset for TLS descriptor.  */
#define R_386_TLS_DESC     39		/* TLS descriptor containing
					   pointer to code and to
					   argument, returning the TLS
					   offset for the symbol.  */
#define R_386_TLS_DESC_CALL 40		/* Marker of call through TLS
					   descriptor for
					   relaxation.  */

TLS_DESC is the dynamic relocation that the linker emits in response
to the two other relocations.  All of the relocations accept addends;
being REL, the relocations have in-place addends.  In the TLS_DESC
case, that applies to a pair of words, the addend is stored in the
second word, to simplify some lazy relocation cases.  In order to
enable lazy relocation, %ebx MUST point to the GOT at the point of the
call instruction.

Instead of adding more relocations for the Local Dynamic case, we
propose the use a symbol _TLS_MODULE_BASE_, that the linker implicitly
defines, as a hidden symbol, to the base address of the TLS section of
a module.


When relaxing the sequence above to the Initial Exec model, we'd get
sequences such as:

	movl	variable@GOTNTPOFF(%ebx), %eax
	[...]
	movl	%eax, %eax	   # or any other two-byte nop

or:

	leal	variable@GOTNTPOFF(%ebx), %eax
	[...]
	movl	(%eax), %eax

or, in case there's a GOT entry holding the positive TPOFF:

	movl	variable@GOTTPOFF(%ebx), %eax
	[...]
	negl	%eax

I'm not sure which of these would be better to use, but in my current
implementation I've preferred the first alternative, unless a GOT
entry for the third is needed for other reasons, and one for the
first case isn't.


The Local Exec model uses code such as:

	.byte 0x65	       # %gs, or any harmless 1-byte prefix
	movl	$variable@NTPOFF, %eax
	[...]
	movl	%eax, %eax	# or any other two-byte nop

or:

	leal	variable@NTPOFF, %eax	# avoiding the need for the prefix
	[...]
	movl	%eax, %eax	# or any other two-byte nop


The function for the static TLS case can be as simple as:

_dl_tlsdesc_return:
	movl	4(%eax), %eax
	ret

whereas the other takes a bit more effort, requiring the following
data structure definitions:

struct tlsdesc
{
  ptrdiff_t __attribute__((regparm(1))) (*entry)(struct tlsdesc *);
  void *arg;
};

typedef struct dl_tls_index
{
  unsigned long int ti_module;
  unsigned long int ti_offset;
} tls_index;

struct tlsdesc_dynamic_arg
{
  tls_index tlsinfo;
  size_t gen_count;
};

The definition of the resolution function follows the logic depicted
below, except for the need to preserve all registers except %eax.

ptrdiff_t
__attribute__ ((__regparm__ (1)))
_dl_tlsdesc_dynamic (struct tlsdesc *tdp) 
{
  struct tlsdesc_dynamic_arg *td = tdp->arg; 
  dtv_t *dtv = *(dtv_t **)((char *)__thread_pointer + DTV_OFFSET);
  if (__builtin_expect (td->gen_count <= dtv[0].counter
			&& (dtv[td->tlsinfo.ti_module].pointer.val
			    != TLS_DTV_UNALLOCATED),
			1))
    return dtv[td->tlsinfo.ti_module].pointer.val + td->tlsinfo.ti_offset 
      - __thread_pointer;

  /* Preserve any call-clobbered registers not preserved because of
     the above across the call below.  */
  return ___tls_get_addr (&td->tlsinfo) - __thread_pointer;
}

The tlsdesc_dynamic_arg objects are allocated by the dynamic loader
when resolving a relocation, and stored in a hash table created for
the module in which the symbol is defined.  Note that this dynamic
allocation has no implications to prelinking, since prelinking is
only applicable to modules loaded before program start up, and thus
always uses the static TLS case, that does not need dynamic
allocation.


An alternate design in which the function called through the TLS
descriptor returns not the TP offset, but rather the address of the
variable of interest, could refrain from adding %gs:0 to the value
returned by the call to compute the address of a symbol, and from
using the %gs: prefix when accessing the variable, but it would
require the use of a longer call instruction to enable proper
relaxation.  The call instruction would have to be 7, instead of 2
bytes long, such that the linker could relax it to `addl %gs:0, %eax'.
This would make code that accesses the variable 4 bytes longer on
average (5 bytes minus one used by the %gs prefix), whereas code that
computes its address would be shorter by only two bytes.  It's not
clear such a change would be profitable.


Lazy relocation is not profitable in all cases.  Consider that we need
the argument in the TLS descriptor to hold both the addend and the
address of the relocation, if it's a REL relocation, and it probably
wouldn't make sense to depend on dynamic memory allocation for the
relocation.  So, in case we actually need both pieces of information,
we perform the relocation immediately.

In other cases, such as that of RELA relocations and REL relocations
with zero addends, we use the argument to hold a pointer to the
relocation.  Another special case we handle is that of REL relocations
with addends that reference the *ABS* section, i.e., that reference
the local TLS section.

All cases handled by lazy relocation start by grabbing a global
dynamic loader lock and checking that the pointer in the TLS
descriptor hasn't changed.  If it has, we return immediately into the
new pointer, after releasing the lock.  Otherwise, we set the function
pointer to a hold function that will get all other threads that
attempt to use this variable to wait until relocation is complete.

After diverting any other threads to the hold function, we can perform
the relocation, determining whether the referenced symbol is in Static
TLS or not, and deciding which of the two functions to use from that
point on and computing the argument to pass to it.  The argument is
stored first in the TLS descriptor, and the new entry point is stored
last.  The relocation functions finally release the lock and return
into the newly-computed function.

In the current implementation, the hold function attempts to grab the
lock, checks that the pointer hasn't changed and releases the lock,
such that, if any other such TLS descriptor lazy relocation is in
progress, it will wait until the lock is released.  When it obtains
the lock, it releases it immediately and returns into the function
newly-stored in the TLS descriptor.

An alternate implementation is envisioned that relies on condition
variables that hold functions would wait on, such that the relocation
functions wouldn't have to hold the lock throughout their execution,
rather waking up all hold functions upon completion of the relocation.


AMD64/EM64T
-----------

The design is very similar to that of IA32.  The main difference stems
from the fact that AMD64's IP-relative addressing modes have enabled
it to do away with the need for a register holding the GOT pointer,
which in turn required additional measures to enable lazy relocation
of TLS descriptors.

Where the existing ABI uses the following sequence:

        .byte   0x66
        leaq    variable@TLSGD(%rip), %rdi
        .word   0x6666
        rex64
        call    __tls_get_addr@PLT
	# use %eax as the address, or (%eax) to access the value

we propose this instead:

        leaq    tval@TLSDESC(%rip), %rax
	[...] # any other instructions that preserve %eax
        call    *tval@TLSCALL(%rax)
	# add %fs:0 to %eax to compute the address,
	# or use %fs:(%eax) to access the value

Note that, as in the IA32 case, the call instruction is a two-byte
instruction, the offset is completely discarded by the assembler.  The
following new relocations types are used by the code above:

#define R_X86_64_GOTPC32_TLSDESC 27	/* GOT offset for TLS descriptor.  */
#define R_X86_64_TLSDESC         28	/* TLS descriptor.  */
#define R_X86_64_TLSDESC_CALL    29	/* Marker for call through TLS
					   descriptor.  */

As on IA32, _TLS_MODULE_BASE_ is to be used to obtain the base address
for the Local Dynamic access model.


The alternate design proposed for IA32 that gets the TLS descriptor
call to compute not the offset, but the actual address of the
variable, would require a much longer call instruction to accommodate
the 9 bytes needed to add the TP to the address in case of
relaxation, so it is even less likely to be profitable.


When relaxing the sequence above to the Initial Exec model, we'd get
sequences such as:

        movq    variable@GOTTPOFF(%rip), %rax
	[...]
        rex64 nop	   # or any other two-byte nop

I'm not sure this last instruction is the most efficient 2-bytes-long
do-nothing instruction available on AMD64, but it's used for Local
Exec as well:

        movq    $variable@TPOFF, %rax
	[...]
        rex64 nop


The other data structures and TLS descriptor offset computation
functions are equivalent to those used on IA32, with one point worth
noting that the type of the `entry' member of struct tlsdesc cannot be
represented in C, and not even in GNU-extended C, since the function
takes its argument in %rax, rather than in %rdi.


Since AMD64 uses RELA dynamic relocations, all TLS descriptors are
suitable for lazy relocation; there's no need to worry about
preserving the addend, since it is held in the relocation table
itself.  Thus, there's only need for one lazy relocation function.

That said, this function needs some means to be told what module the
relocation at hand refers to.  On IA32, this is done by means of the
reserved register %ebx, that points to the Global Offset Table, but no
register holds this pointer on AMD64.  Although it would be possible
to use the relocation pointer to search the relocation ranges for all
loaded modules, this would be extremely inefficient, so we've
introduced two dynamic table entries to enable the dynamic loader to
communicate with modules that use lazy relocation:

#define DT_TLSDESC_GOT	0x6ffffff7	/* Location of GOT entry used
					   by TLS descriptor resolver
					   PLT entry.  */
#define DT_TLSDESC_PLT	0x6ffffff8	/* Location of PLT entry for
					   TLS descriptor resolver
					   calls.  */

A module that uses lazy TLSDESC relocations MUST define these two
entries.  The former indicates the address of a GOT entry to be filled
in by the dynamic loader with the address of the internal function to
be used for lazy relocation of TLS descriptors.  The latter must hold
the address of a PLT entry that pushes onto the stack the module's
link map address, located in the GOT portion reserved for the dynamic
loader to use, and then jumps to the lazy relocation function, using
the address stored in the TLSDESC_GOT entry.  The lazy relocation
function is responsible for releasing the stack slot taken up by the
PLT entry.


Future Improvements
===================

The use of TLS descriptors to access thread-local variables would
enable the compression of the DTV such that it contained only entries
for non-static modules.  Static ones could be given negative ids, such
that legacy relocations and direct calls to __tls_get_addr() could
still work correctly, but entries could be omitted from the DTV, and
the DTV entries would no longer need the boolean currently used to
denote entries that are in Static TLS.

The DTV could also be modified so as to contain TP offsets instead of
absolute addresses.  Some refactoring of _dl_tlsdesc_dynamic() and
__tls_get_addr() could avoid the need to subtract TP in the former,
using an alternate entry point that refrains from adding the TP to the
offsets in the new DTV.

It might make sense to combine the current design, in which TLSCALL
functions return the TP offsets, with the alternate design in which
they return the actual address, introducing different relocations for
each case, enabling code to use the former when accessing variables
directly using the TP segment register, and using the latter when
computing their addresses.  It's not obvious that this would have any
significant impact on performance, and although code size could
certainly be reduced for code that computes the address of
thread-local variables, it's not obvious that the need for additional
GOT entries and supporting code would not make up for it.


Conclusion
==========

The design described above is currently functional, using CVS
snapshots of GCC, binutils and glibc taken today, plus local changes
about to be contributed.

Even though the performance of the standard NPTL benchmark is improved
by only a negligible margin, that's understandable: glibc and nptl,
aware that the dynamic access models used to be so inefficient, take
advantage of the fact that libc is always loaded initially and
reference almost all thread-local variables using the more efficient
access models.

However, synthetic benchmarks designed to time functions that return
the value or the address of thread-local variables, have shown that
the performance of the proposed method is significantly better than
that of the currently-used method.  When the referenced variable is
found to be in static TLS at run time, the newly-proposed method makes
such functions about twice as fast as when using the method in wide
use today, bringing it close to the performance of the Initial Exec
model.  Even when the variable is in dynamic TLS, the speedup is still
over 20%.

As for code size, the new method tends to be a win for dynamic
libraries that access TLS variables through the segment registers more
often than they compute the address of such variables.  Libraries in
GNU libc such as libpthread, libmemusage, as well as most dynamic
libraries used in its testsuite, have experienced small reductions in
code size on both architectures.  The dynamic loader and libm
experienced size reductions only on AMD64, retaining the same size on
IA32.  Only libc itself grew in terms of code size, and only on AMD64,
by as little as 0.014%.


This is a draft of work in progress.  The ABI changes suggested above
are not to be taken as final.  In particular, the relocation numbers
and dynamic table entries have not been approved by the official
maintainers of the ABIs, the instruction selection is still subject to
change and even the calling conventions may be modified without
notice.

Copyright 2005 Alexandre Oliva.  Permission is granted to distribute
verbatim copies of this document.  Please contact the author at
aoliva@redhat.com or oliva@lsd.ic.unicamp.br to request additional
permissions.


Change Log
==========

0.9.2 - 2005-09-15: Rename main section to Overall Design.  Clarify
    how the dynamic TLS handler can often avoid the overhead of
    preserving registers.  Add rough figures of performance
    improvements.

0.9.1 - 2005-09-13: Fixed typos and thinkos.  Improved readability.
    Added section with Future Improvements.

0.9 - 2005-09-09: initial version.
-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]