Bug 31372 - _dl_tlsdesc_dynamic doesn't preserve all caller-saved registers
Summary: _dl_tlsdesc_dynamic doesn't preserve all caller-saved registers
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.40
: P2 normal
Target Milestone: 2.40
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-11 23:04 UTC by H.J. Lu
Modified: 2024-04-11 19:06 UTC (History)
6 users (show)

See Also:
Host:
Target: i386, x86-64,arm
Build:
Last reconfirmed:
fw: security-


Attachments
A patch (6.19 KB, patch)
2024-02-11 23:18 UTC, H.J. Lu
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2024-02-11 23:04:41 UTC
_dl_tlsdesc_dynamic only preserves integer registers, but not vector
registers.
Comment 1 H.J. Lu 2024-02-11 23:18:28 UTC
Created attachment 15358 [details]
A patch

I am testing this.
Comment 2 Florian Weimer 2024-02-12 06:59:59 UTC
I think we need more discussion what the ABI should be like.

The trampoline does not need to save argument registers.
Comment 3 Florian Weimer 2024-02-12 10:46:58 UTC
(In reply to Florian Weimer from comment #2)
> I think we need more discussion what the ABI should be like.
> 
> The trampoline does not need to save argument registers.

Or maybe it does. We should have official ABI clarification.
Comment 4 H.J. Lu 2024-02-12 13:10:43 UTC
(In reply to Florian Weimer from comment #3)
> (In reply to Florian Weimer from comment #2)
> > I think we need more discussion what the ABI should be like.
> > 
> > The trampoline does not need to save argument registers.
> 
> Or maybe it does. We should have official ABI clarification.

It is explicitly documented:

https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt

---
The functions defined above use custom calling conventions that
require them to preserve any registers they modify.  This penalizes
the case that requires dynamic TLS, since it must preserve (*) all
call-clobbered registers before calling __tls_get_addr(), but it is
optimized for the most common case of static TLS, and also for the
case in which the code generated by the compiler can be relaxed by the
linker to a more efficient access model: being able to assume no
registers are clobbered by the call tends to improve register
allocation.  Also, the function that handles the dynamic TLS case will
most often be able to avoid calling __tls_get_addr(), thus potentially
avoiding the need for preserving registers.
---
Comment 5 H.J. Lu 2024-02-15 20:39:21 UTC
Arm has the similar issue:

https://github.com/zatrazz/glibc/commits/azanella/tls-descriptor-fixes-arm/
Comment 6 Sam James 2024-02-15 20:40:10 UTC
I feel like this emphasises that we need to look at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48868.
Comment 7 H.J. Lu 2024-02-15 20:44:49 UTC
(In reply to Sam James from comment #6)
> I feel like this emphasises that we need to look at
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48868.

GCC testsuite isn't set up to support run-time tests with extra library
dependencies.  For now, we can add GNU2 TLS run-time tests to glibc.
My glibc patch:

https://patchwork.sourceware.org/project/glibc/list/?series=30995

includes one.  So far, it has detected the GNU2 TLS bug on 3 arches.
Comment 8 Sam James 2024-02-15 21:07:48 UTC
(In reply to H.J. Lu from comment #7)
> (In reply to Sam James from comment #6)
> > I feel like this emphasises that we need to look at
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48868.
> 
> GCC testsuite isn't set up to support run-time tests with extra library
> dependencies.  For now, we can add GNU2 TLS run-time tests to glibc.

gcc.dg/torture/tls has some basic tests but only with a single TU. We could add  -mtls-dialect=gnu2 to tls.exp in DG_TORTURE_OPTIONS but it would not gain us much.

I hoped that there might be some for IPA or LTO but looks like no.

> My glibc patch:
> [...]
> includes one.  So far, it has detected the GNU2 TLS bug on 3 arches.

Ah, thanks, I hadn't looked at it. I guess this with nsz's previous tests should give us OK coverage for now.
Comment 9 H.J. Lu 2024-02-15 21:14:21 UTC
(In reply to Sam James from comment #8)
> > My glibc patch:
> > [...]
> > includes one.  So far, it has detected the GNU2 TLS bug on 3 arches.
> 
> Ah, thanks, I hadn't looked at it. I guess this with nsz's previous tests
> should give us OK coverage for now.

Each arch should define PREPARE_MALLOC to clobber more caller-saved
registers when calling malloc so that we can improve coverage on
each arch.
Comment 10 Sourceware Commits 2024-02-28 17:05:54 UTC
The master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0aac205a814a8511e98d02b91a8dc908f1c53cde

commit 0aac205a814a8511e98d02b91a8dc908f1c53cde
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Feb 26 06:37:03 2024 -0800

    x86: Update _dl_tlsdesc_dynamic to preserve caller-saved registers
    
    Compiler generates the following instruction sequence for GNU2 dynamic
    TLS access:
    
            leaq    tls_var@TLSDESC(%rip), %rax
            call    *tls_var@TLSCALL(%rax)
    
    or
    
            leal    tls_var@TLSDESC(%ebx), %eax
            call    *tls_var@TLSCALL(%eax)
    
    CALL instruction is transparent to compiler which assumes all registers,
    except for EFLAGS and RAX/EAX, are unchanged after CALL.  When
    _dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow
    path.  __tls_get_addr is a normal function which doesn't preserve any
    caller-saved registers.  _dl_tlsdesc_dynamic saved and restored integer
    caller-saved registers, but didn't preserve any other caller-saved
    registers.  Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE,
    XSAVE and XSAVEC to save and restore all caller-saved registers.  This
    fixes BZ #31372.
    
    Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic)
    to optimize elf_machine_runtime_setup.
    Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Comment 11 Sourceware Commits 2024-02-29 12:31:43 UTC
The master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9b7091415af47082664717210ac49d51551456ab

commit 9b7091415af47082664717210ac49d51551456ab
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Feb 28 12:08:03 2024 -0800

    x86-64: Update _dl_tlsdesc_dynamic to preserve AMX registers
    
    _dl_tlsdesc_dynamic should also preserve AMX registers which are
    caller-saved.  Add X86_XSTATE_TILECFG_ID and X86_XSTATE_TILEDATA_ID
    to x86-64 TLSDESC_CALL_STATE_SAVE_MASK.  Compute the AMX state size
    and save it in xsave_state_full_size which is only used by
    _dl_tlsdesc_dynamic_xsave and _dl_tlsdesc_dynamic_xsavec.  This fixes
    the AMX part of BZ #31372.  Tested on AMX processor.
    
    AMX test is enabled only for compilers with the fix for
    
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114098
    
    GCC 14 and GCC 11/12/13 branches have the bug fix.
    Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
Comment 12 Carlos O'Donell 2024-04-02 13:16:36 UTC
These two other issues are needed and should be noted as needed:

commit fd7ee2e6c5eb49e4a630a9978b4d668bff6354ee
Author: Andreas Schwab <schwab@suse.de>
Date:   Tue Mar 19 13:49:50 2024 +0100

    Add tst-gnu2-tls2mod1 to test-internal-extras
    
    That allows sysdeps/x86_64/tst-gnu2-tls2mod1.S to use internal headers.
    
    Fixes: 717ebfa85c ("x86-64: Allocate state buffer space for RDI, RSI and RBX")

commit 717ebfa85c8240d32d0d19d86a484c31c55c9617
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 18 06:40:16 2024 -0700

    x86-64: Allocate state buffer space for RDI, RSI and RBX
    
    _dl_tlsdesc_dynamic preserves RDI, RSI and RBX before realigning stack.
    After realigning stack, it saves RCX, RDX, R8, R9, R10 and R11.  Define
    TLSDESC_CALL_REGISTER_SAVE_AREA to allocate space for RDI, RSI and RBX
    to avoid clobbering saved RDI, RSI and RBX values on stack by xsave to
    STATE_SAVE_OFFSET(%rsp).
    
       +==================+<- stack frame start aligned at 8 or 16 bytes
       |                  |<- RDI saved in the red zone
       |                  |<- RSI saved in the red zone
       |                  |<- RBX saved in the red zone
       |                  |<- paddings for stack realignment of 64 bytes
       |------------------|<- xsave buffer end aligned at 64 bytes
       |                  |<-
       |                  |<-
       |                  |<-
       |------------------|<- xsave buffer start at STATE_SAVE_OFFSET(%rsp)
       |                  |<- 8-byte padding for 64-byte alignment
       |                  |<- 8-byte padding for 64-byte alignment
       |                  |<- R11
       |                  |<- R10
       |                  |<- R9
       |                  |<- R8
       |                  |<- RDX
       |                  |<- RCX
       +==================+<- RSP aligned at 64 bytes
    
    Define TLSDESC_CALL_REGISTER_SAVE_AREA, the total register save area size
    for all integer registers by adding 24 to STATE_SAVE_OFFSET since RDI, RSI
    and RBX are saved onto stack without adjusting stack pointer first, using
    the red-zone.  This fixes BZ #31501.
    Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
Comment 13 Sourceware Commits 2024-04-03 17:42:59 UTC
The release/2.39/master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a364304718725a31ab141936322855c76c73e35e

commit a364304718725a31ab141936322855c76c73e35e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Feb 26 06:37:03 2024 -0800

    x86: Update _dl_tlsdesc_dynamic to preserve caller-saved registers
    
    Compiler generates the following instruction sequence for GNU2 dynamic
    TLS access:
    
            leaq    tls_var@TLSDESC(%rip), %rax
            call    *tls_var@TLSCALL(%rax)
    
    or
    
            leal    tls_var@TLSDESC(%ebx), %eax
            call    *tls_var@TLSCALL(%eax)
    
    CALL instruction is transparent to compiler which assumes all registers,
    except for EFLAGS and RAX/EAX, are unchanged after CALL.  When
    _dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow
    path.  __tls_get_addr is a normal function which doesn't preserve any
    caller-saved registers.  _dl_tlsdesc_dynamic saved and restored integer
    caller-saved registers, but didn't preserve any other caller-saved
    registers.  Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE,
    XSAVE and XSAVEC to save and restore all caller-saved registers.  This
    fixes BZ #31372.
    
    Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic)
    to optimize elf_machine_runtime_setup.
    Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
    
    (cherry picked from commit 0aac205a814a8511e98d02b91a8dc908f1c53cde)
Comment 14 Sourceware Commits 2024-04-03 17:43:04 UTC
The release/2.39/master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=853e915fdd6ae6c5f1a7a68d2594ec8dbfef1286

commit 853e915fdd6ae6c5f1a7a68d2594ec8dbfef1286
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Feb 28 12:08:03 2024 -0800

    x86-64: Update _dl_tlsdesc_dynamic to preserve AMX registers
    
    _dl_tlsdesc_dynamic should also preserve AMX registers which are
    caller-saved.  Add X86_XSTATE_TILECFG_ID and X86_XSTATE_TILEDATA_ID
    to x86-64 TLSDESC_CALL_STATE_SAVE_MASK.  Compute the AMX state size
    and save it in xsave_state_full_size which is only used by
    _dl_tlsdesc_dynamic_xsave and _dl_tlsdesc_dynamic_xsavec.  This fixes
    the AMX part of BZ #31372.  Tested on AMX processor.
    
    AMX test is enabled only for compilers with the fix for
    
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114098
    
    GCC 14 and GCC 11/12/13 branches have the bug fix.
    Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
    
    (cherry picked from commit 9b7091415af47082664717210ac49d51551456ab)
Comment 15 Adhemerval Zanella 2024-04-11 19:06:14 UTC
Fixed on 2.40.  The x86 fixes are 0aac205a814a8511e98d02b91a8dc908f1c53cde,  9b7091415af47082664717210ac49d51551456ab, and 717ebfa85c8240d32d0d19d86a484c31c55c9617; while arm32 is fixed by 64c7e344289ed085517c2227d8e3b06388242c13 (along with other commits for the test infrastructure).  The aarch64 implementation already handle the required caller-saved registers.