Bug 16291 - feature request: provide simpler ways to compute stack and tls boundaries
Summary: feature request: provide simpler ways to compute stack and tls boundaries
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: libc (show other bugs)
Version: 2.19
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 16133
  Show dependency treegraph
 
Reported: 2013-12-04 04:49 UTC by Kostya Serebryany
Modified: 2018-02-15 21:56 UTC (History)
12 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kostya Serebryany 2013-12-04 04:49:50 UTC
This is both a feature request and a request for a suggestion. 
We are developing dynamic testing tools like AddressSanitizer and ThreadSanitizer
which are now part of both LLVM and GCC distributions. 
The tools have a run-time library that needs to compute thread's stack and 
tls boundaries. Right now it is done with a series of ugly hacks.

For non-main threads we get the stack boundaries by querying
pthread_getattr_np and pthread_attr_getstack. This works, 
but is problematic because pthread_getattr_np calls malloc and we intercept malloc. 

For the main thread we get the stack boundaries by querying 
getrlimit, /proc/self/maps, and an address of a local variable because at the point when we need the info libpthread might not have initialized itself. 

Getting TLS boundaries is even more involved. 
Today we use the glibc's private symbol _dl_get_tls_static_info
and then subtract a magic constant (size of the glibc's thread descriptor).

Most of the code is here: 
http://llvm.org/viewvc/llvm-project/compiler-rt/trunk/lib/sanitizer_common/sanitizer_linux_libcdep.cc?revision=195441
(GetThreadStackTopAndBottom and InitTlsSize)

This works for us on a system we are testing (Ubuntu 12.04),
but this is very fragile and is often broken on older Linux distros. 

Is it possible to compute the stack and tls boundaries more reliably today?
If not, would it be possible to add some new functions to do that?

For every thread (including the main one) we need to know its stack and tls boundaries and the functions that compute it for us should not call malloc
or any other public library functions. For the main thread, these functions should
work reliably very early (in pre_init functions)
Comment 1 Rich Felker 2013-12-04 09:19:26 UTC
Do you want to get the information for the calling thread, or for an arbitrary target thread? The latter is much harder.

For the main thread, determining its stack extents is difficult and libc really has nothing to do with it; the kernel is mostly responsible. One end is fixed and libc could easily record that and provide it, but the moving end is hard to measure except by the main thread itself where you can just look at the current value of the stack pointer. Of course one issue is whether you want the _current_ extents of the stack or the maximum possible extents. Getting the latter is difficult and may even be impossible under certain conditions where setrlimit has been used to adjust it during the program's lifetime.
Comment 2 Kostya Serebryany 2013-12-04 09:34:39 UTC
(In reply to Rich Felker from comment #1)
> Do you want to get the information for the calling thread, or for an
> arbitrary target thread? The latter is much harder.

Thanks for clarification. We need only the simple case:
find the stack and tls boundaries for the calling thread. 

> 
> For the main thread, determining its stack extents is difficult and libc
> really has nothing to do with it; the kernel is mostly responsible. One end
> is fixed and libc could easily record that and provide it, 

Even getting one end would be helpful. 
We can continue to deduct the second end from /proc/self/maps and getrlimit

> but the moving
> end is hard to measure except by the main thread itself where you can just
> look at the current value of the stack pointer. Of course one issue is
> whether you want the _current_ extents of the stack or the maximum possible
> extents. 

We need the maximum possible.
Of course, this is not very useful if the stack is unlimited.

> Getting the latter is difficult and may even be impossible under
> certain conditions where setrlimit has been used to adjust it during the
> program's lifetime.
Comment 3 Rich Felker 2013-12-04 09:56:27 UTC
The stack is never truly unlimited. There's a finite distance in address space between the stack and the next adjacent VMA which inherently limits stack growth. And the kernel only reserved a certain virtual address range in which new mmaps won't be placed. But interpreting a stack limit of RLIM_INFINITY does require making assumptions about how the kernel treats it, which might be invalidated by changes in the kernel. So it's not easy.

Can you clarify why you need to know the max size the stack might grow to, rather than the currently valid stack range? The latter would be much easier to obtain, and the former could (I think) become wrong if the application uses setrlimit between the time the stack size is obtained and the time your code needs to know it.
Comment 4 Kostya Serebryany 2013-12-04 10:24:31 UTC
> Can you clarify why you need to know the max size the stack might grow to,
> rather than the currently valid stack range? The latter would be much easier
> to obtain, and the former could (I think) become wrong if the application
> uses setrlimit between the time the stack size is obtained and the time your
> code needs to know it.

Our code computes the stack bounds at thread initialization and never computes them again for performance reasons: the bounds are needed in the 
frame-pointer-based unwinder which needs to know the stack bounds,
and we call the unwinder on every malloc/free, so this is hot.
And we don't want to have special code like "if (in_main_thread)" in hot places.
We need the stack bounds in a few other less performance-critical places too.

Anyway, the moving end of the main thread's stack is the least of our concerns,
tls size and non-main-thread stack bounds are more painful.
Comment 5 Ondrej Bilka 2013-12-04 14:35:39 UTC
Adding a thread-local variable with current bounds would be more appropriate. It is interesting from other reasons, for example deciding if memory should be allocated by alloca or malloc.

As performance is concerned you cannot do much better, as malloc already uses tls you can avoid that overhead.

Also how do you handle split-stacks? Its gcc feature that does more harm than good so do you support that?
Comment 6 Kostya Serebryany 2013-12-04 14:49:46 UTC
(In reply to Ondrej Bilka from comment #5)
> Adding a thread-local variable with current bounds would be more

You mean, glibc can export thread-local variables for stack bounds and TLS bounds?
That will be very convenient for us, but will consume some amount of TLS,
so if I were a glibc developer I would think twice. 


> appropriate. It is interesting from other reasons, for example deciding if
> memory should be allocated by alloca or malloc.

Hmm. Interesting indeed, although a bit scary. 
Suppose you have a deep recursion and before the recursion starts 
(i.e. you have plenty of stack) you decide to replace large malloc with alloca.
One should use this nice hack with caution. 

> 
> As performance is concerned you cannot do much better, as malloc already
> uses tls you can avoid that overhead.
> 
> Also how do you handle split-stacks? Its gcc feature that does more harm
> than good so do you support that?
I don't think anyone ever used split-stacks with AddressSanitizer, 
they will hardly work together.
Comment 7 Rich Felker 2013-12-04 15:46:35 UTC
There's no reason to ask libc for the moving end of the stack; simply &local for any automatic variable "local" will do the job. Only the other end of the stack needs to be known by your code, and it never changes.
Comment 8 Kostya Serebryany 2013-12-04 17:20:17 UTC
(In reply to Rich Felker from comment #7)
> There's no reason to ask libc for the moving end of the stack; 

There are few other reasons for us, at least for non-main threads:
- we need to know the stack boundaries to correctly report the thread id on 
stack buffer overflow
- we need to unpoison the shadow memory that corresponds the the thread's stack at thread shutdown
- probably some more less important


> simply &local
> for any automatic variable "local" will do the job. Only the other end of
> the stack needs to be known by your code, and it never changes.
Comment 9 Rich Felker 2013-12-04 17:33:31 UTC
The identity of the current thread does not depend on the current stack address. You can get it with pthread_self(), or perhaps even via inspecting the thread pointer register (or similar arch-specific mechanism) directly. Note that even the stack range provided by pthread_getattr_np or some proposed new mechanism is not sufficient to get all possible stack addresses the thread might run on, since sigaltstack may be used for signal handlers which run in the thread. (There's also GCC's split-stack nonsense but I'm assuming everybody's already realized that was a horrible idea and given up on using it...)
Comment 10 Kostya Serebryany 2013-12-05 03:54:15 UTC
(In reply to Rich Felker from comment #9)
> The identity of the current thread does not depend on the current stack
> address. You can get it with pthread_self(), or perhaps even via inspecting
> the thread pointer register (or similar arch-specific mechanism) directly.
> Note that even the stack range provided by pthread_getattr_np or some
> proposed new mechanism is not sufficient to get all possible stack addresses
> the thread might run on, since sigaltstack may be used for signal handlers
> which run in the thread. (There's also GCC's split-stack nonsense but I'm
> assuming everybody's already realized that was a horrible idea and given up
> on using it...)
We are interested only in the main stack: not split-stack, not sigaltstack's one.
Comment 11 Kostya Serebryany 2013-12-11 11:01:14 UTC
I accidentally found the global symbol __libc_stack_end.
It is not thread-local, so it helps only for the main thread stack.
Can we rely on it?
Comment 12 Rich Felker 2013-12-11 15:14:22 UTC
In general, unless specifically documented (such as a few functions from stdio_ext.h with historic precedent), any identifier prefixed with __ is not intended for application usage and not guaranteed to remain exposed or to retain its current semantics. It may be the case that this one has to, since on some platforms (powerpc at least), libgcc was abusing it as a (buggy) way to find the aux vector. But I would recommend looking for a cleaner solution.
Comment 13 Carlos O'Donell 2014-01-16 15:06:49 UTC
(In reply to Kostya Serebryany from comment #0)
> This is both a feature request and a request for a suggestion. 
> We are developing dynamic testing tools like AddressSanitizer and
> ThreadSanitizer
> which are now part of both LLVM and GCC distributions. 
> The tools have a run-time library that needs to compute thread's stack and 
> tls boundaries. Right now it is done with a series of ugly hacks.

This is something I think we can help with by providing implementation
symbols internally that ASAN and other tools can rely upon as part of
an agreement we have with those tools.
 
> For non-main threads we get the stack boundaries by querying
> pthread_getattr_np and pthread_attr_getstack. This works, 
> but is problematic because pthread_getattr_np calls malloc and we intercept
> malloc. 

OK.

> For the main thread we get the stack boundaries by querying 
> getrlimit, /proc/self/maps, and an address of a local variable because at
> the point when we need the info libpthread might not have initialized
> itself. 

OK.

> Getting TLS boundaries is even more involved. 
> Today we use the glibc's private symbol _dl_get_tls_static_info
> and then subtract a magic constant (size of the glibc's thread descriptor).

That's ugly, and we could do better there.

However, you're only talking about static TLS there.

There is dynamic TLS also that is not within the static tls block.

Do you need information about that? Such information is racy, prone to
change as libraries are loaded and unloaded etc.

> This works for us on a system we are testing (Ubuntu 12.04),
> but this is very fragile and is often broken on older Linux distros. 

Understood.

> Is it possible to compute the stack and tls boundaries more reliably today?

It is possible to compute at least one end of the main thread stack.

It is possible to compute the boundaries of the non-main thread stack easily.

It is also possible to know the boundaries of static tls for all threads.

It is harder to know the boundaries of dynamic tls for all threads, but
potentially possible. The information once you have it though could
immediately become stale (given the dynamic nature).

Perhaps we can start by addressing the first three in a reliable way.

> If not, would it be possible to add some new functions to do that?

Could you suggest an API?

> For every thread (including the main one) we need to know its stack and tls
> boundaries and the functions that compute it for us should not call malloc
> or any other public library functions. For the main thread, these functions
> should
> work reliably very early (in pre_init functions)

Understood.

As I said, if you designed an API for this, we could look at it and say
if it's possible to fill it in from our end?

You also need to think about what security implications there are about
such an API, if any.

Stability of the API, etc.
Comment 14 Kostya Serebryany 2014-01-23 15:45:31 UTC
Turns out we do need both static and dynamic TLS.
E.g. the new implementation breaks our (hackish) implementation of 
LeakSanitizer.

When leak detection begins, we need to include all TLS ranges of live threads
into the initial root set so that pointers stored there are treated as live. 
With the old glibc implementation, where dynamic TLS is allocated with 
__libc_memalign, we treat all calls to __libc_memalign that come from 
elf/dl-tls.c as those that should be stored in the root set. Ugly, but works.
Once the dynamic TLS is allocated via mmap we've lost all means of intercepting it (asan/lsan do not try to intercept mmap)

Here is a test case that demonstrates the false positive:

/* RUN:                                                                          
SYSROOT=/path/to/fresh/glibc/trunk/build                                         
GCC_ROOT=/path/to/fresh/gcc/trunk/build
# Build a.out                                                                    
$GCC_ROOT/bin/gcc -g -fsanitize=address  -static-libasan   -L${SYSROOT}/usr/lib64 \
  --sysroot=${SYSROOT}   -Wl,-rpath=${SYSROOT}/lib -I${SYSROOT}/include \        
  -Wl,--dynamic-linker=${SYSROOT}/lib/ld-2.18.90.so \                            
  use_tls_dynamic.c
# Build .so                                                                      
$GCC_ROOT/bin/gcc -fPIC use_tls_dynamic.c -o use_tls_dynamic-so.so -shared \
 --sysroot=${SYSROOT}  -Wl,--dynamic-linker=${SYSROOT}/lib/ld-2.18.90.so \       
 -I${SYSROOT}/include -DBUILD_SO -o use_tls_dynamic-so.so
# Run with LeakSanitizer enabled                                                 
                                                                                 
ASAN_OPTIONS=detect_leaks=1 ./a.out                                              

# Output will look like:
==14336==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 1337 byte(s) in 1 object(s) allocated from:
    #0 0x444ba7 in __interceptor_malloc
    #1 0x45a3b5 in main use_tls_dynamic.c:37
    #2 0x7faae3087b0c in __libc_start_main glibc/csu/libc-start.c:285

SUMMARY: AddressSanitizer: 1337 byte(s) leaked in 1 allocation(s).
*/
#ifndef BUILD_SO                                                                 
#include <assert.h>                                                              
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>                                                              

int main(int argc, char *argv[]) {
  const char *path = "./use_tls_dynamic-so.so";                                  
  
  void *handle = dlopen(path, RTLD_LAZY);
  if (!handle) fprintf(stderr, "%s\n", dlerror());                               
  assert(handle != 0);
  typedef void **(* store_t)(void *p);
  store_t StoreToTLS = (store_t)dlsym(handle, "StoreToTLS");                     
  assert(dlerror() == 0);
  
  void *p = malloc(1337);
  void **p_in_tls = StoreToTLS(p);                                               
  assert(*p_in_tls == p);
  fprintf(stderr, "Test alloc: %p.\n", p);
  return 0;
} 
#else  // BUILD_SO
__thread void *huge_thread_local_array[1 << 17];                                 
void **StoreToTLS(void *p) {
  huge_thread_local_array[0] = p;
  return &huge_thread_local_array[0];                                            
} 
#endif
Comment 15 Kostya Serebryany 2014-01-23 15:57:15 UTC
>> You also need to think about what security implications there are about
>> such an API, if any.
I am not a security expert, but looks like getting the information
in a non-portable hackish way is possible today w/o any additional interfaces.


>> Could you suggest an API?
Let me try... 
Today we have a global variable __libc_stack_end.
For stack and static TLS we may have thread-local vars that point 
to thread/tls bounds.
__libc_thread_stack_begin
__libc_thread_stack_end
__libc_thread_static_tls_begin
__libc_thread_static_tls_end
A function that returns these values for the current thread would 
be fine too, as long as that function does not call anything else
(god forbid, malloc)
For the main thread's stack it's ok to have only one of the stack bounds
meaningful.

Dynamic TLS is indeed harder.
Is it possible to register a callback which will be called from glibc whenever
it adds/removes a dynamic TLS slot for every thread?

__libc_register_dynamic_tls_slot_update(
  void (*f)(void *beg, void *end, 
            int created /*1 for creation, 0 for deletion*/));
Comment 16 Kostya Serebryany 2014-01-24 08:46:20 UTC
FTR: the new TLS implementation (with __signal_safe_memalign instead of __libc_memalign) also causes a false positive with MemorySanitizer: 

1. Create a thread with dynamic tls, put garbage there
2. join the thread
3. Create a new thread, it will likely get the same dtls address. The shadow of dtls is still poisoned and so we report a false positive.
Comment 17 Rich Felker 2014-01-27 01:28:44 UTC
Kostya, for what it's worth, I would very much prefer an API based on function calls rather than external-linkage static and thread-local objects. Using function calls gives the implementation a lot more freedom. In particular, as long as the API is reasonable, I'd like to support it in musl libc, but we don't require or plan to require a toolchain that supports TLS as a dependency for building libc. In general, I think going through accessor functions has a better chance of getting other FOSS systems to adopt the same or similar API, making it easier for tools/libraries that need to use this API.
Comment 18 Kostya Serebryany 2014-01-27 07:16:02 UTC
(In reply to Rich Felker from comment #17)
> Kostya, for what it's worth, I would very much prefer an API based on
> function calls 
Function calls would work equally well for us as long as these functions
don't call anything else (e.g. malloc, getrlimit, etc).

void __libc_get_stack_bounds(void **stack_beg, void **stack_end);
void __libc_get_static_tls_bounds(void **stls_beg, void **stls_end);
Comment 19 Kostya Serebryany 2014-01-29 10:00:29 UTC
I've implemented a very ugly hack to handle dynamic TLS in both <=2.18 and 2.19
(intercept __tls_get_addr and make lots of assumptions:
http://llvm.org/viewvc/llvm-project?rev=200384&view=rev)

However I kindly ask to resolve this issue in the right way before 2.20,
so setting the target milestone accordingly. 

To summarize my current view of the required interface:

// Get the stack bounds for the current thread. 
// For main thread, the value of stack_beg is undefined.
void __libc_get_stack_bounds(void **stack_beg, void **stack_end);

// Get the bounds of static tls for the current thread.
void __libc_get_static_tls_bounds(void **stls_beg, void **stls_end);

// These functions are called by glibc when a dynamic TLS is created/destroyed in 
// the current thread. By default the functions do nothing, and the user may 
// redefine them.
__libc_on_dynamic_tls_create(void *beg, void *end);
__libc_on_dynamic_tls_destroy(void *beg, void *end);
Comment 20 Rich Felker 2014-01-29 10:25:55 UTC
I'm not convinced that __libc_on_dynamic_tls_create and __libc_on_dynamic_tls_destroy are a good API. Assuming they're called when the thread obtains new TLS storage, the implementations of these functions would need to be both thread-safe and AS-safe. This requirement would at least need to be documented.

A cleaner API alternative might be to provide an iterator for TLS memory ranges in the current thread, similar to how dl_iterate_phdr works for loaded DSOs. This would avoid placing complex safety requirements on the component using the API and would avoid having global state. (Interposing symbols over top of libc-internal symbols is a NASTY hidden global state, since it admits only one client to the API.)
Comment 21 Kostya Serebryany 2014-01-29 10:40:31 UTC
(In reply to Rich Felker from comment #20)
> I'm not convinced that __libc_on_dynamic_tls_create and
> __libc_on_dynamic_tls_destroy are a good API. Assuming they're called when
> the thread obtains new TLS storage, the implementations of these functions
> would need to be both thread-safe and AS-safe. This requirement would at
> least need to be documented.

Good point. thread- and AS- safe implementation is clearly possible. 

> 
> A cleaner API alternative might be to provide an iterator for TLS memory
> ranges in the current thread, similar to how dl_iterate_phdr works for
> loaded DSOs. This would avoid placing complex safety requirements on the
> component using the API and would avoid having global state. (

I am not sure this will work for us. 
Consider LeakSanitizer, which stops all threads before starting real work.
It needs to query all threads while they are stopped, so it can not ask 
the threads to iterate over their TLS ranges, the information should
be stored somewhere already. 

> Interposing
> symbols over top of libc-internal symbols is a NASTY hidden global state,
> since it admits only one client to the API.)
Instead of interposing we can use callbacks
Comment 22 Rich Felker 2014-01-29 14:25:42 UTC
Callbacks are still global state in this context, but they're not quite as bad since they do admit multiple clients to the API. How does LeakSanitizer "stop" threads? Even when gdb stops a process it's able to make the process perform function calls, but perhaps this is too much of a pain to implement.
Comment 23 Kostya Serebryany 2014-01-29 15:22:59 UTC
>> How does LeakSanitizer "stop" threads?
It creates a special thread (using raw clone) and then ptraces all other threads.

http://llvm.org/viewvc/llvm-project/compiler-rt/trunk/lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc?revision=192689&view=markup&pathrev=192689

>> but perhaps this is too much of a pain to implement.
Oh, yea. Even the current implementation is a bit too complex...
Comment 24 Rich Felker 2014-01-29 18:21:14 UTC
What about an enumeration API that takes a target pthread_t rather than operating on the calling thread? This is just an idea; I'm not sure I like it either, because it's inherently racy (the information returned may be outdated) unless you call it with the target thread stopped or at least blocked. I also think it may be unsafe to call it from a "thread" created with raw clone, but this really applies to ANY libc function (even the others we're proposing); all libc functions are free to assume they have a valid thread pointer, TCB, TLS, DTV, etc. which LeakSanitizer cannot provide without poking badly at glibc internals. Is there a reason you're not simply using pthread_create to create this thread? The POSIX threads API is intended to be such that applications should not need to know or care about the existence of threads produced internally by the implementation or by third-party library functions.
Comment 25 Kostya Serebryany 2014-01-29 18:35:51 UTC
(In reply to Rich Felker from comment #24)
> What about an enumeration API that takes a target pthread_t rather than
> operating on the calling thread? 

Equally dislike. 
MemorySanitizer needs the information on the TLS creation/destruction
event when it happens to avoid false positives. (comment #16)

> This is just an idea; I'm not sure I like
> it either, because it's inherently racy (the information returned may be
> outdated) unless you call it with the target thread stopped or at least
> blocked. I also think it may be unsafe to call it from a "thread" created
> with raw clone, but this really applies to ANY libc function (even the
> others we're proposing); all libc functions are free to assume they have a
> valid thread pointer, TCB, TLS, DTV, etc. which LeakSanitizer cannot provide
> without poking badly at glibc internals. 

> Is there a reason you're not simply
> using pthread_create to create this thread? The POSIX threads API is
> intended to be such that applications should not need to know or care about
> the existence of threads produced internally by the implementation or by
> third-party library functions.

Sergey, please reply here, you are better equipped.
Comment 26 Sergey Matveev 2014-01-29 18:50:40 UTC
>> Is there a reason you're not simply using pthread_create to create this thread?

The task must belong to a different process group, because ptrace cannot attach to tasks within the same process group. We also cannot use a full-blown separate process for this because we need direct access to the parent's address space.

Even if we could ptrace between threads, it still wouldn't be safe to call glibc functions while other threads are frozen at arbitrary points.

>> I also think it may be unsafe to call it from a "thread" created with raw clone, but this really applies to ANY libc function (even the others we're proposing); all libc functions are free to assume they have a valid thread pointer, TCB, TLS, DTV, etc. which LeakSanitizer cannot provide without poking badly at glibc internals.

Right, the difference is that for functions like __libc_get_stack_bounds(), we don't have to call them during the "stop-the-world" phase. We can query the stack bounds once (immediately after the thread is created) and store them in our own thread registry, which we can safely access from the cloned task later. We cannot reliably do this to DTLS with the "iterator" interface you proposed. If we queried this information at some point prior to freezing the threads, we would have no way of knowing that it's not gone out of date by the time we actually freeze them.
Comment 27 Rich Felker 2014-01-30 05:10:32 UTC
I don't see how what you're doing can work at all except by chance unless you are completely refraining from using any libc functionality in the cloned "thread" which is neither a valid thread nor a valid process from libc's perspective. This is making me more and more doubtful that glibc or any other libc could provide a suitable permanent API contract suitable to your needs without severely constraining its own internals in ways that affect large parts of the  implementation.

As for the need to do these hacks, both ptrace and process_vm_readv/process_vm_writev provide access to the traced process's vm space. Is the reason you need to share virtual address space with the traced process simply for the sake of performance?
Comment 28 Kostya Serebryany 2014-01-30 05:13:54 UTC
(In reply to Rich Felker from comment #27)
> I don't see how what you're doing can work at all except by chance unless
> you are completely refraining from using any libc functionality in the
> cloned "thread" which is neither a valid thread nor a valid process from
> libc's perspective. 

You are absolutely right, we are not using anything from libc in that special
"thread".

> This is making me more and more doubtful that glibc or
> any other libc could provide a suitable permanent API contract suitable to
> your needs without severely constraining its own internals in ways that
> affect large parts of the  implementation.

Note that the interface we are asking for is not to be called from the "special"
thread. We need to call it in regular threads, and not only for LeakSanitizer
which uses this special thread, but also for other tools
that don't do it.


> As for the need to do these hacks, both ptrace and
> process_vm_readv/process_vm_writev provide access to the traced process's vm
> space. Is the reason you need to share virtual address space with the traced
> process simply for the sake of performance?
Sergey?
Comment 29 Sergey Matveev 2014-01-30 19:06:29 UTC
(In reply to Rich Felker from comment #27)
> I don't see how what you're doing can work at all except by chance unless
> you are completely refraining from using any libc functionality in the
> cloned "thread" which is neither a valid thread nor a valid process from
> libc's perspective. This is making me more and more doubtful that glibc or
> any other libc could provide a suitable permanent API contract suitable to
> your needs without severely constraining its own internals in ways that
> affect large parts of the  implementation.

Ugly as our approach may be, I don't see how it affects us in the context of our present discussion. If we were to use a proper separate process, that would not  make the task of discovering the DTLS ranges any easier.

Moreover, the reason why we avoid calling libc functions from the cloned task is not _just_ because it is not a proper thread. It's also because the other threads are frozen at arbitrary points while the cloned task runs, and some of them could be holding libc locks. So even if we took the opposite road of using a regular posix thread (which is possible - just broker out the ptrace code to a separate process, make it attach to every thread except the current one, and do the actual work in the current thread) we still wouldn't want to call libc functions from it.

So we don't need an interface that we could use from the cloned task. What we actually need is an interface that would give us the DTLS ranges at some point prior to launching the cloned task, and somehow ensure they're still valid when it runs.

Callbacks are a natural fit for this. If glibc cooperates by calling our callbacks immediately after a DTLS block allocation and immediately before deallocation, and the callbacks record those events atomically, then we can freeze the thread at an arbitrary point and be sure that our record of what the DTLS looks like is up-to-date.

If, on the other hand, we are only allowed to ask glibc about the current state of the DTLS, then we must query it immediately before launching the cloned task, while somehow preventing it from changing until the threads are actually frozen. I don't see a good way to do that, short of having glibc protect the DTLS code with a global lock.

> As for the need to do these hacks, both ptrace and
> process_vm_readv/process_vm_writev provide access to the traced process's vm
> space. Is the reason you need to share virtual address space with the traced
> process simply for the sake of performance?

Yes. Given that we need to scan most of the memory used by the process, using PTRACE_PEEKTEXT would be impractical. I wasn't aware of process_vm_readv before, so I don't have any benchmarks at hand. If we were to switch to this method, I can think of some ways to _maybe_ get reasonable performance, at the cost of a substantial increase in complexity.
Comment 30 Carlos O'Donell 2014-01-30 19:21:31 UTC
(In reply to Sergey Matveev from comment #29)
> Callbacks are a natural fit for this. If glibc cooperates by calling our
> callbacks immediately after a DTLS block allocation and immediately before
> deallocation, and the callbacks record those events atomically, then we can
> freeze the thread at an arbitrary point and be sure that our record of what
> the DTLS looks like is up-to-date.

Instead of a callback could we use a systemtap probe with the right arguments? We have precedent set for adding probes to libc and it would be easy to do with the current infrastructure and low-overhead.
Comment 31 Sergey Matveev 2014-01-30 19:43:34 UTC
(In reply to Carlos O'Donell from comment #30)

> Instead of a callback could we use a systemtap probe with the right
> arguments? We have precedent set for adding probes to libc and it would be
> easy to do with the current infrastructure and low-overhead.

Kostya investigated this possibility not long ago and rejected it. I'm going to defer this to him since I'm not familiar with systemtap.
Comment 32 Kostya Serebryany 2014-01-31 05:05:24 UTC
> Instead of a callback could we use a systemtap probe with the right
> arguments? We have precedent set for adding probes to libc and it would be
> easy to do with the current infrastructure and low-overhead.

I admit I have not investigated systemtap properly (need to install a newer Linux distro first). However, from what I see systemtap is not enabled by default 
in glibc build and requires extra packages to be installed on the system.
Which means that some lsan/msan users will be left behind 
(those who can't install extra packages) and we'll be forced to keep the hack
for the fallback, which defeats the purpose of having the right solution. 

Besides there is a consistency reason (which is not necessary relevant for you).
Sanitizers already heavily rely on interposition for many cases, 
including all cases where glibc has systemtap hooks, but not limited to that.
Even if we used systemtap for pthread events, at least one hook
is not usable for us (in pthread_create, we actually replace one of the parameters, which systemtap does not allow, afaict).
Comment 33 Carlos O'Donell 2014-01-31 05:31:01 UTC
(In reply to Kostya Serebryany from comment #32)
> > Instead of a callback could we use a systemtap probe with the right
> > arguments? We have precedent set for adding probes to libc and it would be
> > easy to do with the current infrastructure and low-overhead.
> 
> I admit I have not investigated systemtap properly (need to install a newer
> Linux distro first). However, from what I see systemtap is not enabled by
> default 
> in glibc build and requires extra packages to be installed on the system.
> Which means that some lsan/msan users will be left behind 
> (those who can't install extra packages) and we'll be forced to keep the hack
> for the fallback, which defeats the purpose of having the right solution. 

For RHEL and Fedora we always enable systemtap. It is a requirement that the users kernel have kprobes enabled.

The reason I mention systemtap is that it gives you a richer and more flexible way to look at the internals of glibc in realtime.

> Besides there is a consistency reason (which is not necessary relevant for
> you).
> Sanitizers already heavily rely on interposition for many cases, 
> including all cases where glibc has systemtap hooks, but not limited to that.

That's fine, and you can continue to user interposition. I'm not saying you shouldn't. I'm just saying that for TLS and stack extents we may wish to create a probe point that we declare stable for tool usage.

> Even if we used systemtap for pthread events, at least one hook
> is not usable for us (in pthread_create, we actually replace one of the
> parameters, which systemtap does not allow, afaict).

systemtap allows modifying all parameters and having those modifications impact the running application. In fact I use systemtap all the time to inject failures into glibc code in realtime for testing.

I hope that I don't sound like I'm telling to use systemtap. Far from it. I'm just suggesting an option, and trying to figure out if this option is or is not a good match for what you want.

In summary:
- Needs kprobes.
- Deep introspection with internal glibc probe points.
- Ability to modify arguments.

The in-process function call is probably still faster...
Comment 34 Kostya Serebryany 2014-01-31 05:50:26 UTC
Can we do both? (interposable function call and a probe)
Comment 35 Carlos O'Donell 2014-01-31 06:08:51 UTC
(In reply to Kostya Serebryany from comment #34)
> Can we do both? (interposable function call and a probe)

In general? Sure, why not? I assume you wish to continue interposing malloc family functions.

As the API for determining TLS and stack boundaries? I'd prefer we had either a function/callback API or probe-based API. Even a public AS-safe allocator API, when interposed, is just a hack for getting at what you really want which is the TLS and stack boundary information.
Comment 36 Kostya Serebryany 2014-01-31 07:02:39 UTC
> As the API for determining TLS and stack boundaries? I'd prefer we had
> either a function/callback API or probe-based API. Even a public AS-safe
> allocator API, when interposed, is just a hack for getting at what you
> really want which is the TLS and stack boundary information.

I'd clearly prefer an API based on callbacks or function interposition over
probe-based API, simply because I know the former will always work, 
and not sure if the latter will.
The risks with systemtap are 
 - other (non-RedHat) distros will not have it by default or will require extra packages
 - systemtap may not work with one of a dozen sandboxes our users like to mix with *san
 - something yet unknown

When is 2.20 feature freeze? 
If we have enough time, we can still experiment with both. 
I'd add a macro that extends into either a function call or a probe depending on 
a configure-time setting and give both variants some extended testing.
Comment 37 Carlos O'Donell 2014-01-31 08:57:07 UTC
(In reply to Kostya Serebryany from comment #36)
> > As the API for determining TLS and stack boundaries? I'd prefer we had
> > either a function/callback API or probe-based API. Even a public AS-safe
> > allocator API, when interposed, is just a hack for getting at what you
> > really want which is the TLS and stack boundary information.
> 
> I'd clearly prefer an API based on callbacks or function interposition over
> probe-based API, simply because I know the former will always work, 
> and not sure if the latter will.
> The risks with systemtap are 
>  - other (non-RedHat) distros will not have it by default or will require
> extra packages
>  - systemtap may not work with one of a dozen sandboxes our users like to
> mix with *san
>  - something yet unknown
> 
> When is 2.20 feature freeze? 
> If we have enough time, we can still experiment with both. 
> I'd add a macro that extends into either a function call or a probe
> depending on 
> a configure-time setting and give both variants some extended testing.

It's a 6 month cycle. We will release 2.20 in roughly 6 months.

Your comments are all valid. We'll look at an API.
Comment 38 Carlos O'Donell 2014-01-31 19:00:13 UTC
I just had a sit down Josh Stone and Frank Eigler at Red Hat to talk about using systemtap to solve some of the API issues. After the discussion we all agreed that there were solutions but they would be too ad-hoc to work in this case.

I think we're looking at strictly an in-process API that ASAN can use to get the information it needs.

I don't particularly like callbacks, but anything more complicated required libpthread and I don't like that either. So a callback should be lesser of two evils.

Kostya, can you please start a design page for the API on the glibc wiki?
https://sourceware.org/glibc/wiki/

If you can't edit the wiki you can go here and ask one of these people to add your account to the editor group:
https://sourceware.org/glibc/wiki/EditorGroup
Comment 39 Rich Felker 2014-01-31 19:40:48 UTC
I don't like hook-style callbacks either. My preference remains a query-based API, preferably documented as being AS-safe and possibly other properties that would make ASAN and similar tools happy (for example having bounded execution time even if other threads are stopped indefinitely such as via a signal handler interrupting them while holding a lock). I realize this makes a bit more work for tools like ASAN which would rather have asynchronous notification (note: they could still get it by intercepting the functions that could change the state then calling the query function during the intercept), but such tools are already doing something that's fairly hard and invasive.

So far I've been looking at this tracker thread so far with the hopes of being able to provide the same or close-to-the-same API in musl, but if it looks like that's not likely to happen or it's distracting from the goal of getting something done with respect to just glibc, I can back off somewhat from this thread, focusing any further contribution I make to it on just technical aspects.
Comment 40 Carlos O'Donell 2014-01-31 20:43:19 UTC
(In reply to Rich Felker from comment #39)
> I don't like hook-style callbacks either. My preference remains a
> query-based API, preferably documented as being AS-safe and possibly other
> properties that would make ASAN and similar tools happy (for example having
> bounded execution time even if other threads are stopped indefinitely such
> as via a signal handler interrupting them while holding a lock). I realize
> this makes a bit more work for tools like ASAN which would rather have
> asynchronous notification (note: they could still get it by intercepting the
> functions that could change the state then calling the query function during
> the intercept), but such tools are already doing something that's fairly
> hard and invasive.

I agree with you. I'd rather have us discuss a non-callback API first and see if we can come up with something.
 
> So far I've been looking at this tracker thread so far with the hopes of
> being able to provide the same or close-to-the-same API in musl, but if it
> looks like that's not likely to happen or it's distracting from the goal of
> getting something done with respect to just glibc, I can back off somewhat
> from this thread, focusing any further contribution I make to it on just
> technical aspects.

That's fine. We have other libc's to consider also including bionic, uclibc, etc. Which might need to implement this API.
Comment 41 Sergey Matveev 2014-01-31 22:33:05 UTC
(In reply to Rich Felker from comment #39)
> I realize
> this makes a bit more work for tools like ASAN which would rather have
> asynchronous notification (note: they could still get it by intercepting the
> functions that could change the state then calling the query function during
> the intercept)

Doesn't this basically translate to __dl_tls_getaddr(), as far as allocation of new DTLS blocks is concerned? It's the only function that we can intercept, and it's a hot one. If we take this approach, we'll need a fast way to check whether our record needs updating before performing a full query. Something as simple as a function which returns the current DTV's generation counter should work, I think?
Comment 42 Kostya Serebryany 2014-02-03 08:13:58 UTC
> Kostya, can you please start a design page for the API on the glibc wiki?
> https://sourceware.org/glibc/wiki/
Will do as soon as the site stops throwing me out with timeouts. :( 


(In reply to Rich Felker from comment #39)
> I don't like hook-style callbacks either. My preference remains a
> query-based API, preferably documented as being AS-safe and possibly other
> properties that would make ASAN and similar tools happy (for example having
> bounded execution time even if other threads are stopped indefinitely such
> as via a signal handler interrupting them while holding a lock). I realize
> this makes a bit more work for tools like ASAN which would rather have
> asynchronous notification (note: they could still get it by intercepting the
> functions that could change the state then calling the query function during
> the intercept), but such tools are already doing something that's fairly
> hard and invasive.

We need to be notified about both creation and destruction of dynamic tls.
It is reasonably simple to get notification of DTLS creation:
  intercept __tls_get_addr and check that we observe the DSO id for the first time. Still a bit hackish. Besides, if we still rely on the interceptor for __tls_get_addr to call the query function, it means that we introduce noticeable slowdown for all DTLS accesses. 
But how do I get notified about DTLS destruction? 


(In reply to Sergey Matveev from comment #41)
> Doesn't this basically translate to __dl_tls_getaddr(), as far as allocation
> of new DTLS blocks is concerned? It's the only function that we can

Sergey, do you mean __tls_get_addr? 
Yes, intercepting all calls to __tls_get_addr just to catch the first moment
of DTLS creation is costly and it does not solve the problem of intercepting 
DTLS destruction

> intercept, and it's a hot one. If we take this approach, we'll need a fast
> way to check whether our record needs updating before performing a full
> query. Something as simple as a function which returns the current DTV's
> generation counter should work, I think?
There is a simpler way: store a fixed size bit vector in static TLS, 
i-th bit indicating whether __tls_get_addr has been called with the 
i-th DSO.
Comment 43 Carlos O'Donell 2014-02-03 14:35:37 UTC
(In reply to Kostya Serebryany from comment #42)
> > Kostya, can you please start a design page for the API on the glibc wiki?
> > https://sourceware.org/glibc/wiki/
> Will do as soon as the site stops throwing me out with timeouts. :( 

For the record I have reached out to overseers about this and I will update libc-alpha with information as soon as I get it. I'm also seeing gateway timeouts.
Comment 44 Rich Felker 2014-02-03 15:59:35 UTC
Kostya, there are only two ways dynamic TLS can be destroyed: either a thread exiting or a call to dlclose. Simply hooking these two events will allow you to keep track of all destruction even if you're limited to a probe-based (rather than hook-based) API.
Comment 45 Kostya Serebryany 2014-02-04 14:18:14 UTC
I've created https://sourceware.org/glibc/wiki/ThreadPropertiesAPI ,
it has more questions than answers; please comment.

FTR, Editing the wiki remains painful due to frequent timeouts. :( 


(In reply to Rich Felker from comment #44)
> Kostya, there are only two ways dynamic TLS can be destroyed: either a
> thread exiting 

Properly catching thread exit is a challenge by itself, 
Today we are using yet another hack to catch thread exit -- I wonder
if you could suggest a better approach. 
I added a relevant section to the wiki page above.

> or a call to dlclose. 

When dlclose happens in one thread, we need to do something with DTLS in 
all threads, which is tricky, if at all possible, w/o knowing 
how exactly glibc itself handles this case. 
hook-based approach will not have this problem,

> Simply hooking these two events will
> allow you to keep track of all destruction even if you're limited to a
> probe-based (rather than hook-based) API.
Comment 46 joseph@codesourcery.com 2014-02-04 14:30:49 UTC
On Tue, 4 Feb 2014, konstantin.s.serebryany at gmail dot com wrote:

> FTR, Editing the wiki remains painful due to frequent timeouts. :( 

In my experience, the timeouts occur *after* the page has saved 
successfully - the save itself happens very quickly, it just doesn't tell 
you it's saved because of some linear-time process checking thousands of 
wiki accounts for whether they are watching that page and should be 
notified of the change, and it's that linear-time notification process 
that times out.  Rather than waiting for the timeout, after clicking to 
save the changes you should load a copy of the page in another window / 
tab and it will probably include your changes.
Comment 47 Rich Felker 2014-02-04 18:50:14 UTC
On Tue, Feb 04, 2014 at 02:18:14PM +0000, konstantin.s.serebryany at gmail dot com wrote:
> Properly catching thread exit is a challenge by itself, 
> Today we are using yet another hack to catch thread exit -- I wonder
> if you could suggest a better approach. 
> I added a relevant section to the wiki page above.

I don't see the lack of a hook-based approach for DTLS destruction as
a new problem, just another symptom of your lack of a good approach to
catching thread exit. Your usage case is harder than the usual (which
can be achieved simply by wrapping pthread_create to use a special
start function that installs a cancellation handler then calls the
real start function) because you want to catch the point where the
thread is truely dead (all dtors having been called, DTLS freed, etc.)
which is, formally, supposed to be invisible to application code (i.e.
atomic with respect to thread exit).

I'm not yet sure what the right approach to this is, but I'm skeptical
of providing a public API to allow applications to observe a state
that's not supposed to be observable.

> > or a call to dlclose. 
> 
> When dlclose happens in one thread, we need to do something with DTLS in 
> all threads, which is tricky, if at all possible, w/o knowing 
> how exactly glibc itself handles this case. 
> hook-based approach will not have this problem,

If the TLS query API works correctly, you should not care how glibc
implements it internally. You should just trust the results to be
correct after dlclose returns, so that wrapping dlclose to call the
real dlclose then re-query after it returns just works.

Of course this has a race window where the memory has already been
freed but you don't know about that. I'm not sure if you care about
that, but if you do, I think the right approach is just to be wrapping
mmap and malloc so that you can see if they allocate a range you
thought belonged to something else, and if so, patch up your records.
Comment 48 Kostya Serebryany 2014-02-04 19:07:58 UTC
(In reply to Rich Felker from comment #47)
> On Tue, Feb 04, 2014 at 02:18:14PM +0000, konstantin.s.serebryany at gmail
> dot com wrote:
> > Properly catching thread exit is a challenge by itself, 
> > Today we are using yet another hack to catch thread exit -- I wonder
> > if you could suggest a better approach. 
> > I added a relevant section to the wiki page above.
> 
> I don't see the lack of a hook-based approach for DTLS destruction as
> a new problem, just another symptom of your lack of a good approach to
> catching thread exit. Your usage case is harder than the usual (which
> can be achieved simply by wrapping pthread_create to use a special
> start function that installs a cancellation handler then calls the
> real start function) because you want to catch the point where the
> thread is truely dead (all dtors having been called, DTLS freed, etc.)
> which is, formally, supposed to be invisible to application code (i.e.
> atomic with respect to thread exit).

Correct. While we are at it, do you have a comment about our 
recursive pthread_setspecific hack? 
I know it's bad, but don't know how bad.

> 
> I'm not yet sure what the right approach to this is, but I'm skeptical
> of providing a public API to allow applications to observe a state
> that's not supposed to be observable.
> 
> > > or a call to dlclose. 
> > 
> > When dlclose happens in one thread, we need to do something with DTLS in 
> > all threads, which is tricky, if at all possible, w/o knowing 
> > how exactly glibc itself handles this case. 
> > hook-based approach will not have this problem,
> 
> If the TLS query API works correctly, you should not care how glibc
> implements it internally. You should just trust the results to be
> correct after dlclose returns, so that wrapping dlclose to call the
> real dlclose then re-query after it returns just works.

I frankly can't image an interface that can be invoked
right-before or right-after dlclose that will iterate over DTLS in other
threads, unless those threads are blocked somehow. 

Admittedly, this is not our only problem with dlclose, but if we are to implement a new and shiny interface I'd rather prefer to handle dlclose correctly.

> Of course this has a race window where the memory has already been
> freed but you don't know about that. I'm not sure if you care about

We do care. For msan, this race would mean a sporadic non-reproducible 
false positive or maybe a crash. 


> that, but if you do, I think the right approach is just to be wrapping
> mmap and malloc so that you can see if they allocate a range you
> thought belonged to something else, and if so, patch up your records.

We wrap malloc and at <= 2.18 we did handle DTLS somewhat satisfactorily,
because DTLS was allocated by __libc_memalign called via plt.
We can wrap the libc's mmap (and msan/tsan actually do this already),
but libc itself calls mmap bypassing 
plt, so we can not observe those mmap calls. 
Unless, of course, we complicate things my one or the other ways of 
catching all syscalls (ptrace, etc) but that will create more problems 
than it will solve (e.g. this will not work in sandboxed environments)
Comment 49 Kostya Serebryany 2014-04-04 08:50:48 UTC
Ping. We'd really want to resolve this before 2.20.
Especially so, if the patch with AS-safe TLS
(reverted just before the 2.19 release) is going to be there.
Comment 50 Carlos O'Donell 2014-04-04 20:43:13 UTC
(In reply to Kostya Serebryany from comment #49)
> Ping. We'd really want to resolve this before 2.20.
> Especially so, if the patch with AS-safe TLS
> (reverted just before the 2.19 release) is going to be there.

Pinging this discussion won't move it forward. We need to start the discussion on libc-alpha and move it forward to gather consensus for some kind of API that provides what you're looking for based on your wiki design. Right now I'm busy until at least the end of May so I won't get to looking at this until later. However, if you've hashed out what the final glibc API would look like that would be forward progress. In summary I suggest taking your design to libc-alpha and asking directly what the final API should look like.

I would ask you to look over some notes Ben Woodard placed on the wiki about the next generation of a tools interface and see if any of that is applicable and needs to be discussed for this particular API addition:
https://sourceware.org/glibc/wiki/Tools%20Interface%20NG
Comment 51 Kostya Serebryany 2014-04-22 10:23:54 UTC
(In reply to Carlos O'Donell from comment #50)
> We need to start the discussion on libc-alpha..

Started: https://sourceware.org/ml/libc-alpha/2014-04/msg00313.html
Comment 52 Frank Ch. Eigler 2016-08-22 16:06:58 UTC
Adding cc: gbenson for consideration as a possible use of the infinity bytecode widgetry.
Comment 53 Carlos O'Donell 2018-02-15 21:56:42 UTC
I should note that Rust has similar requirements here, I was talking to Josh Stone about this because recent glibc changes made it harder for Rust to know the exact stack boundaries.

Rust wants to know about the stack boundaries for every thread because it wants to handle the cases where the thread is getting close to running out of memory. 

In the past the guard size was part of the stack, and after the fixes for bug 22637 it is not, so all low-level code which needed to know this internal detail now needs to adjust for this (like calling __pthread_get_minstack, which Rust does).