Bug 26817

Summary: x86-64: Support -mstack-protector-guard=global (define global __stack_chk_guard)
Product: glibc Reporter: Fangrui Song <i>
Component: libcAssignee: Not yet assigned to anyone <unassigned>
Status: UNCONFIRMED ---    
Severity: normal CC: carlos, drepper.fsp
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Fangrui Song 2020-10-30 03:22:58 UTC
% cat a.c
int main() {}
% gcc -fuse-ld=bfd -mstack-protector-guard=global -fstack-protector-all a.c
...
/usr/bin/ld.bfd: a.c:(.text+0x21): undefined reference to `__stack_chk_guard'

(GCC source code let either libssp or libc to provide __stack_chk_guard. On many glibc platforms, TARGET_LIBC_PROVIDES_SSP is set.)

In glibc/csu/libc-start.c, either -mstack-protector-guard=global or
-mstack-protector-guard=tls is supported, but not both.  On x86-64, =tls is the
supported variant.  However, there is no fundamental reason that the two
variants cannot coexist.

    /* Set up the stack checker's canary.  */
    uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
  # ifdef THREAD_SET_STACK_GUARD
    THREAD_SET_STACK_GUARD (stack_chk_guard);
  # else
    __stack_chk_guard = stack_chk_guard;
  # endif

There are some arguments that -mstack-protector-guard=global may be better:

(1) Performance.
On x86-64 -fno-pic and -fpie, gcc emits R_X86_64_PC32 referencing the external
__stack_chk_guard. The code sequence is shorter than a TP relative access.
(-fpie not using GOT was an unfortunate GCC 5 change).
On -fpic =global requires a GOT indirection, which might be slower than a TP
relative access.

(2) Security.
For a new thread, nptl/allocatestack.c allocates a memory region, places the
static TLS block at the end, and uses the remaining space as the thread stack.
The stack pointer may be just a few hundred bytes below the canary address
(%fs:0x28 on x86-64) and a large out-of-bounds write can potentially override it.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

void *foo(void *_) {
  char a[0x100];
  for (int i = 0; i < LEN; i++)
    a[i] = 0x41;
  return 0;
}

int main() {
  pthread_t pid;
  pthread_create(&pid, 0, foo, 0);
  pthread_join(pid, 0);
}

% gcc -fstack-protector-all a.c -pthread -D LEN=0x200 && ./a.out
*** stack smashing detected ***: terminated
[1]    2672403 abort      ./a.out

# %fs:0x28 has been overwritten, so there is no 'stack smashing detected'.
% gcc -fstack-protector-all a.c -pthread -D LEN=0x1000 && ./a.out
[1]    2672478 segmentation fault  ./a.out

-mstack-protector-guard=global does not have the problem.

nsz mentioned: the kernel prefers tls because it can have different canary per
thread. in userspace that would not work because some code may context switch to
a different thread and check the wrong canary.
Comment 1 Fangrui Song 2020-10-30 03:27:35 UTC
Note, -mstack-protector-guard=tls can mitigate the problem by inserting a guard page  between the stack and the static TLS block, but that can come with a large overhead.
Comment 2 Carlos O'Donell 2020-11-24 21:41:20 UTC
(In reply to Fangrui Song from comment #0)
> (2) Security.
> For a new thread, nptl/allocatestack.c allocates a memory region, places the
> static TLS block at the end, and uses the remaining space as the thread
> stack.
> The stack pointer may be just a few hundred bytes below the canary address
> (%fs:0x28 on x86-64) and a large out-of-bounds write can potentially
> override it.

We need to fix:
Bug 11787 - Program with large TLS segments fail.
https://sourceware.org/bugzilla/show_bug.cgi?id=11787

The static TLS should not be part of the stack accounting and when we split them out there won't be a problem with the stack being beside static TLS.