Bug 11787

Summary: Program with large TLS segments fail.
Product: glibc Reporter: Paul Pluzhnikov <ppluzhnikov>
Component: dynamic-linkAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: a3at.mail, andersk, aph, asharif.tools, bas, bugdal, carlos, david.abdurachmanov, davidtgoldblatt, devurandom, fweimer, glibc-bugs, hjl.tools, i, ian, jistone, john.wellesz, jsm-csl, kouvel, law, matt, mtk.manpages, sam, siddhesh, stevenschlansker, toolchain, vapier, wangliushuai
Priority: P2 Flags: carlos_odonell: review+
fweimer: security-
Version: 2.12   
Target Milestone: ---   
Host: x86_64-unknown-linux-gnu Target: x86_64-unknown-linux-gnu
Build: x86_64-unknown-linux-gnu Last reconfirmed:
Bug Depends on:    
Bug Blocks: 22636    
Attachments: test case from pr10643
Proposed fix.
Proposed fix that only adds to the stack size if it's less than 16 times __static_tls_size

Description Paul Pluzhnikov 2010-07-02 23:02:12 UTC
This is a continuation of issue 10643.

Using test case from issue 10643 (will reattach shortly),
built with -DCRASH results in:

do_aio_write: Invalid argument
do_aio_write: Invalid argument
failure: Invalid argument
failure: Invalid argument

AFAICT, this is happening because size of TLS and TCB is subtracted from
the stack size that the application requested (in the test case, aio_write
is trying to create a thread with 16K stack, but if user requested a thread
with small stack size, pthread_create would have failed just as well).

Since glibc is the only library that knows TLS and TCB sizes, shouldn't
it *add* these sizes to the user requested size, so the user gets exactly
16K of stack he asked for, and not 16K minus "some value we wouldn't
tell you"?
Comment 1 Paul Pluzhnikov 2010-07-02 23:03:01 UTC
Created attachment 4869 [details]
test case from pr10643
Comment 2 Paul Pluzhnikov 2012-02-24 04:34:48 UTC
Ping?

This hit us again today in a different context: Chrome uses 128K thread stacks, which is normally plenty.

But when instrumented for profiling, the counters reside in TLS, and consumed most of the 128K stack, resulting in a crash that was hard to diagnose.

We then had to change the source before instrumented Chrome could run.

It would be nice if glibc just did the right thing, getting rid of the crash and the need to modify sources.
Comment 3 asharif.tools 2012-03-22 19:39:25 UTC
Created attachment 6297 [details]
Proposed fix.

It increases size by TLS size. The test case passes as well as Chrome with -fprofile-generate.
Comment 4 Carlos O'Donell 2012-03-23 20:34:37 UTC
I've reviewed the patch sent to the mailing list and provided comments:
http://sourceware.org/ml/libc-alpha/2012-03/msg01002.html

I just noticed that what I recommended shouldn't be required since the code should already take this into account.

Nobody has given any hard numbers about the static TLS size so I'm marking this issue as unconfirmed.

Please provide some real-world figures so we know how bad the problem is compared to the default stack size for x86 e.g. 2MB.
sysdeps/i386/pthreaddef.h:#define ARCH_STACK_DEFAULT_SIZE       (2 * 1024 * 1024)

For example code in nptl-init.c tries to take things into account:
~~~ nptl/nptl-inic.c:
414   /* Determine the default allowed stack size.  This is the size used
415      in case the user does not specify one.  */
416   struct rlimit limit;
417   if (getrlimit (RLIMIT_STACK, &limit) != 0
418       || limit.rlim_cur == RLIM_INFINITY)
419     /* The system limit is not usable.  Use an architecture-specific
420        default.  */
421     limit.rlim_cur = ARCH_STACK_DEFAULT_SIZE;
422   else if (limit.rlim_cur < PTHREAD_STACK_MIN)
423     /* The system limit is unusably small.
424        Use the minimal size acceptable.  */
425     limit.rlim_cur = PTHREAD_STACK_MIN;
426
427   /* Make sure it meets the minimum size that allocate_stack
428      (allocatestack.c) will demand, which depends on the page size.  */
429   const uintptr_t pagesz = GLRO(dl_pagesize);
430   const size_t minstack = pagesz + __static_tls_size + MINIMAL_REST_STACK;
431   if (limit.rlim_cur < minstack)
432     limit.rlim_cur = minstack;
433
434   /* Round the resource limit up to page size.  */
435   limit.rlim_cur = (limit.rlim_cur + pagesz - 1) & -pagesz;
436   __default_stacksize = limit.rlim_cur;
~~~

Note that we check to see that __default_stacksize takes into account __static_tls_size and a minimal rest size (2K on x86).

Is the problem that the minimal computed is *still* not enough?
Comment 5 Carlos O'Donell 2012-03-23 20:37:19 UTC
If the minimal values computes by the code are not enough then the only real solution is to fix the programs with large per-thread memory requirements e.g. set RLIMIT_STACK to a reasonable value.

We do not want to penalize all of the other programs that don't need the extra stack space.
Comment 6 Paul Pluzhnikov 2012-03-23 21:32:30 UTC
(In reply to comment #5)

Carlos,

I don't believe you've understood the problem.

Default stack sizes are fine.

But aio_write creates a small (16K) stack, and chrome creates 128K stacks.

Normally this is also just fine; and all works.

But then application creates a larger-than-usual TLS (either by allocating
4096 thread-local ints as in the test case here, or by instrumenting for
profiling), and suddenly things start crashing in hard-to-diagnose fashion.

> We do not want to penalize all of the other programs that don't need the extra
> stack space.

You aren't penalizing them much if they aren't using TLS, and if they are
using large TLS, then you are making them work instead of crashing.

From "man pthread_attr_setstacksize":


    The pthread_attr_setstacksize() function sets the stack size attribute of
    the thread attributes object referred to by attr to the value specified
    in stacksize.

It doesn't say "to the value specified in stacksize minus the size of TLS".

The fact that GLIBC chops off space for TLS from the top of stack is an
implementation detail, and (IMHO) should *not* reduce the stack size
application actually gets!
Comment 7 asharif.tools 2012-03-23 22:06:50 UTC
I can provide some information about Chrome built with -fprofile-generate:

p __static_tls_size
$1 = 114880

Now let's look at how Chrome allocates thread stacks:

http://code.google.com/p/chromium/source/search?q=kShutdownDetectorThreadStackSize&origq=kShutdownDetectorThreadStackSize&btnG=Search+Trunk

http://code.google.com/p/chromium/source/search?q=kWorkerThreadStackSize&origq=kWorkerThreadStackSize&btnG=Search+Trunk

are two examples. There are other stack sizes too. From what I have seen, if
the stack size is less than __static_tls_size, it just fails to allocate the
stack. If it is something like 128k, it gets allocated, but soon runs out of
stack into the guard page where it causes a segfault on a random function (as
it allocating locals for the function).
Comment 8 Carlos O'Donell 2012-03-23 22:46:37 UTC
(In reply to comment #6)
> (In reply to comment #5)
> 
> Carlos,
> 
> I don't believe you've understood the problem.
> 
> Default stack sizes are fine.
> 
> But aio_write creates a small (16K) stack, and chrome creates 128K stacks.
> 
> Normally this is also just fine; and all works.
> 
> But then application creates a larger-than-usual TLS (either by allocating
> 4096 thread-local ints as in the test case here, or by instrumenting for
> profiling), and suddenly things start crashing in hard-to-diagnose fashion.

That does make the problem clearer.

Please note that the aio helper thread *should* be using a default 2MB stack on x86, not 16K, I don't see anywhere that sets the helper threads stack to 16K.

> > We do not want to penalize all of the other programs that don't need the extra
> > stack space.
> 
> You aren't penalizing them much if they aren't using TLS, and if they are
> using large TLS, then you are making them work instead of crashing.

You are also increasing the memory footprint of all applications that use TLS that worked fine before.

Before making any changes we need to hear from the distros (RH, SuSE, Debian, Gentoo, Ubuntu etc) about the kind of impact this might have e.g. a quick find / readelf -a / grep to determine on average the memory increase we are looking at.
 
> From "man pthread_attr_setstacksize":
> 
> 
>     The pthread_attr_setstacksize() function sets the stack size attribute of
>     the thread attributes object referred to by attr to the value specified
>     in stacksize.
> 
> It doesn't say "to the value specified in stacksize minus the size of TLS".
> 
> The fact that GLIBC chops off space for TLS from the top of stack is an
> implementation detail, and (IMHO) should *not* reduce the stack size
> application actually gets!

The "stack" is purposely ambiguous and once handed over to the implementation the implementation has complete control and can do what it wishes including use some of it for TLS which is what we do.

http://pubs.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_09.html#tag_02_09_08

Having said that it is *bad* for us to crash in the *DEFAULT* case because TLS data takes up the entire default stack.

However, you are arguing for variable sized thread stacks based on TLS data, which is a fine suggestion but needs serious consideration.

There are multiple cases here.

(a) App. provides memory (pthread_attr_setstackaddr, pthread_attr_setstack).

(b) App. requests minimum size (pthread_attr_setstacksize).

(c) App. makes no request, and RLIMIT_STACK is set and >= PTHREAD_STACK_MIN.

(d) App. makes no request, and RLIMIT_STACK is set and < PTHREAD_STACK_MIN.

(e) App. makes no request, and RLIMIT_STACK is 0/inf.

What do you suggest for each of these cases?

Are there any other cases I might have missed?

In the current implementation we do the following:

For (a) we use the user memory for everything we need.

For (b) we allocate the minimum and use that for everything we need.

For (c) we allocate the value of RLIMIT_STACK only if it's >= minstack, otherwise minstack (nptl-init.c) and use that for everything we need.

For (d) we allocate the value of PTHREAD_STACK_MIN only if it's >= minstack, otherwise minstack (nptl-init.c) and use that for everything we need.

For (e) we allocate the value of ARCH_STACK_DEFAULT_SIZE only if it's >= minstack, otherwise minstack (nptl-init.c) and use that for everything we need.

You appear to be suggesting the following:

For (a) the behaviour remains the same.

For (b) we adjust upward by the size of the static TLS data.

For (c) "

For (d) "

For (e) "

In which case the patch is probably to nptl-init.c to change the computation of the default size.
Comment 9 Carlos O'Donell 2012-03-23 22:50:12 UTC
(In reply to comment #7)
> I can provide some information about Chrome built with -fprofile-generate:
> 
> p __static_tls_size
> $1 = 114880
> 
> Now let's look at how Chrome allocates thread stacks:
> 
> http://code.google.com/p/chromium/source/search?q=kShutdownDetectorThreadStackSize&origq=kShutdownDetectorThreadStackSize&btnG=Search+Trunk
> 
> http://code.google.com/p/chromium/source/search?q=kWorkerThreadStackSize&origq=kWorkerThreadStackSize&btnG=Search+Trunk
> 
> are two examples. There are other stack sizes too. From what I have seen, if
> the stack size is less than __static_tls_size, it just fails to allocate the
> stack. If it is something like 128k, it gets allocated, but soon runs out of
> stack into the guard page where it causes a segfault on a random function (as
> it allocating locals for the function).

That's brutal. You have a 128k stack with 114k of thread local data, a guard page, the thread descriptor, and you've barely got anything left.

Thank you for some real-world data, that helps put things into perspective.
Comment 10 Carlos O'Donell 2012-03-23 23:01:14 UTC
Paul et al.,

The next steps for this issue are:

(a) Rewrite the patch to adjust the default stack size in nptl/nptl-init.c to account for large TLS data as *additional* space.

(b) Find people from each of the major distributions to comment on this issue. Asking them the pointed question "Would you mind if the memory footprint of all applications that use TLS goes up by the size of the static TLS in order to provide assurances that your application will have some reasonable amount of thread stack not consumed by the TLS data?" I suggest: Jeff Law (Red Hat), Aurelien Jarno (Debian), Andreas Jaeger (SuSE), Ubuntu (??), Mike Frysinger (Gentoo), ... 

I'm here to guide you, but this kind of work needs to come from the ground up and from the people most interested in seeing this fixed. We need volunteers like you on the product to help fix these kinds of issues, but it does require some work to reach consensus :-)
Comment 11 Carlos O'Donell 2012-03-23 23:02:38 UTC
Joseph,

Any comments on this issue specifically with respect to the POSIX meaning of "stack" and wether it should or should include implementation dependent details like where static TLS data is stored?
Comment 12 Paul Pluzhnikov 2012-03-24 02:21:06 UTC
(In reply to comment #8)

> Please note that the aio helper thread *should* be using a default 2MB stack on
> x86, not 16K, I don't see anywhere that sets the helper threads stack to 16K.

The helper thread stack size *is* set by __pthread_get_minstack.

The test case here no longer crashes using the current git trunk, but only
because the aio thread stack is now increased to accomodate static tls:

  // nptl/nptl-init.c
  size_t
  __pthread_get_minstack (const pthread_attr_t *attr)
  {
    struct pthread_attr *iattr = (struct pthread_attr *) attr;

    return (GLRO(dl_pagesize) + __static_tls_size + PTHREAD_STACK_MIN
	    + iattr->guardsize);
  }

This was done here:

2011-12-22  Ulrich Drepper  <drepper@gmail.com>

       [BZ #13088]
       * sysdeps/unix/sysv/linux/timer_routines.c: Get minimum stack size
       through __pthread_get_minstack.
       * nptl-init.c (__pthread_initialize_minimal_internal): Get page size
       directly from _rtld_global_ro.
       (__pthread_get_minstack): New function.
       * pthreadP.h: Declare __pthread_get_minstack.
       * Versions (libpthread) [GLIBC_PRIVATE]: Add __pthread_get_minstack.

and is *precisely* the change we are asking for for the other threads.

I see that Rich Felker is in exact agreement with me:
http://sourceware.org/bugzilla/show_bug.cgi?id=13088#c1

> You are also increasing the memory footprint of all applications that use TLS
> that worked fine before.

It depends on what you call "memory footprint". Yes, we'll increase memory
we *reserve* for stacks (VM size). But we will not *use* any more memory
than what's already used (RSS).

I think the only application that would notice this is one that was almost
running out of VM space. An answer for such app would be to decrease its
default stack sizes, use smaller number of threads, or switch to 64 bits.

> Before making any changes we need to hear from the distros (RH, SuSE, Debian,
> Gentoo, Ubuntu etc) about the kind of impact this might have e.g. a quick find
> / readelf -a / grep to determine on average the memory increase we are looking
> at.

Would above still apply if it's just VM size we are talking about?
I don't see how readelf will reveal anything.

> There are multiple cases here.

> You appear to be suggesting the following:
> 
> For (a) the behaviour remains the same.
> 
> For (b) we adjust upward by the size of the static TLS data.
> 
> For (c) "
> 
> For (d) "
> 
> For (e) "

Yes, I believe that's what we are proposing.
Comment 13 Carlos O'Donell 2012-03-24 16:05:19 UTC
(In reply to comment #12)
> (In reply to comment #8)
> 
> > Please note that the aio helper thread *should* be using a default 2MB stack on
> > x86, not 16K, I don't see anywhere that sets the helper threads stack to 16K.
> 
> The helper thread stack size *is* set by __pthread_get_minstack.

Thanks, I just spotted that in nptl/sysdeps/unix/sysv/linux/aio_misc.h.

> This was done here:
> 
> 2011-12-22  Ulrich Drepper  <drepper@gmail.com>
> 
>        [BZ #13088]
>        * sysdeps/unix/sysv/linux/timer_routines.c: Get minimum stack size
>        through __pthread_get_minstack.
>        * nptl-init.c (__pthread_initialize_minimal_internal): Get page size
>        directly from _rtld_global_ro.
>        (__pthread_get_minstack): New function.
>        * pthreadP.h: Declare __pthread_get_minstack.
>        * Versions (libpthread) [GLIBC_PRIVATE]: Add __pthread_get_minstack.
> 
> and is *precisely* the change we are asking for for the other threads.

Thanks for pointing this out.

> I see that Rich Felker is in exact agreement with me:
> http://sourceware.org/bugzilla/show_bug.cgi?id=13088#c1
> 
> > You are also increasing the memory footprint of all applications that use TLS
> > that worked fine before.
> 
> It depends on what you call "memory footprint". Yes, we'll increase memory
> we *reserve* for stacks (VM size). But we will not *use* any more memory
> than what's already used (RSS).

Reservations still consume VM space and limit the maximum number of threads. Reservation is still a serious problem for kernel VM management and layout. We should not increase the reserved VM space without due consideration.

> I think the only application that would notice this is one that was almost
> running out of VM space. An answer for such app would be to decrease its
> default stack sizes, use smaller number of threads, or switch to 64 bits.

Not true. What about applications with *lots* of threads, each thread running computational kernels (little real stack usage), and lots of thread-local data. In those cases you could be doubling the reservation per thread and causing the application to be unable to spawn a larger number of threads.

> > Before making any changes we need to hear from the distros (RH, SuSE, Debian,
> > Gentoo, Ubuntu etc) about the kind of impact this might have e.g. a quick find
> > / readelf -a / grep to determine on average the memory increase we are looking
> > at.
> 
> Would above still apply if it's just VM size we are talking about?
> I don't see how readelf will reveal anything.

Yes. We need to get input from the distros. We should not make a change like this without talking to them.

You use readelf to find the TLS segment and see how much bigger the per-thread memory reservation will be and collect this data for as many applications as possible to see the expected real-world impact.

> > There are multiple cases here.
> 
> > You appear to be suggesting the following:
> > 
> > For (a) the behaviour remains the same.
> > 
> > For (b) we adjust upward by the size of the static TLS data.
> > 
> > For (c) "
> > 
> > For (d) "
> > 
> > For (e) "
> 
> Yes, I believe that's what we are proposing.

I'm glad to see that I understand the problem.

Cheers,
Carlos.
Comment 14 Mike Frysinger 2012-03-24 17:08:10 UTC
the increased size should only matter if the new size pushes past a page boundary right ?  and even then, it'd only be 4KiB per thread ?

i don't see that being a problem.
Comment 15 Paul Pluzhnikov 2012-03-24 17:39:41 UTC
(In reply to comment #14)
> the increased size should only matter if the new size pushes past a page
> boundary right ?  and even then, it'd only be 4KiB per thread ?

We really have to be careful when talking about "size".
The actual used memory (RSS) is unchanged, only the reserved
size (VM) is changing.

To re-state Carlos' example, consider a 32-bit application that creates
1000s of threads, and has large TLS (1000s of __thread variables).

Each thread is created with 64K stack (just barely sufficient to
accomodate all of TLS variables) and otherwise does not use much
of stack (e.g. a non-recursive computation).

This application works fine currently.

Under proposed change, the actual stack usage (RSS) is unchanged,
but the reserved (VM) stack size for each thread nearly doubles
to 64K+__static_tls_size.

So the application would only be able to create half as many
threads as it used to.



I assert that this is a contrived example, and has a trivial fix:
reduce 64K to 4K (or whatever *actual* stack the threads use).

But I will go ahead and contact distribution maintainers,
as Carlos suggested in comment#10.
Comment 16 Carlos O'Donell 2012-03-24 18:29:24 UTC
(In reply to comment #15)
> But I will go ahead and contact distribution maintainers,
> as Carlos suggested in comment#10.

Paul,

Many thanks for spearheading the effort. I think the idea is a good enhancement, but we need to communicate with the distros about this class of change and get consensus. By getting input from the distros we might find out that there are other problems with our solution. I don't see any, but it doesn't hurt to be careful.

I suggest an email to libc-alpha summarizing the issue, current state of glibc trunk, and CC the glibc distro contacts and see what they have to say. After a comment period we'll be ready to implement a solution. If you feel like being thorough you can provide the reworked patch based on changes to nptl/nptl-init.c along with that summarizing email.

Thank you again for working through this process.
Comment 17 Carlos O'Donell 2012-03-24 20:12:51 UTC
(In reply to comment #14)
> the increased size should only matter if the new size pushes past a page
> boundary right ?  and even then, it'd only be 4KiB per thread ?
> 
> i don't see that being a problem.

No the new size is going to be the full size of the PT_TLS segment e.g. static .tdata/.tbss which is normally allocated out of the thread stack.

Could you do a canvas of a gentoo install to see how big those TLS segments are in executables?

e.g.
Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
...
  TLS            0x185194 0x00186194 0x00186194 0x00008 0x00040 R   0x4
...

The reservations grow by MemSiz.
Comment 18 Paul Pluzhnikov 2012-03-24 20:45:11 UTC
On my desktop (based on Ubuntu Lucid), in /usr/bin, there are 430 executables that link to libpthread.so, but only 4 of them have TLS:

lftp
luatex
mono
Xvfb

Of these, the largest TLS MemSize is in mono:

 readelf -l mono | grep -A1 TLS
  TLS            0x00000000002685b0 0x00000000008685b0 0x00000000008685b0
                 0x0000000000000000 0x0000000000000048  R      8

a whopping 72 bytes. The other three have 16-byte TLS.

This isn't of course the whole story, as shared libraries to contribute to static tls size as well.

In /lib, the following 4 libraries have TLS segment:

libc-2.11.1.so
libcom_err.so.2.1
libmemusage.so
libselinux.so.1
libuuid.so.1.3.0

The largest one is in libselinux.so.1 at 160 bytes:

  TLS            0x000000000001bd50 0x000000000021bd50 0x000000000021bd50
                 0x0000000000000000 0x00000000000000a0  R      8

followed by glibc itself, at 104 bytes:

  TLS            0x0000000000179738 0x0000000000379738 0x0000000000379738
                 0x0000000000000010 0x0000000000000068  R      8

In /usr/lib, the following have TLS:

libcap-ng.so.0.0.0
libdw-0.143.so
libelf-0.143.so
libexempi.so.3.2.1
libgomp.so.1.0.0
libpulsecore-0.9.21.so
libstdc++.so.6.0.13

The largest one is libcap-ng.so.0.0.0 at 48 bytes:

  TLS            0x0000000000003db4 0x0000000000203db4 0x0000000000203db4
                 0x0000000000000030 0x0000000000000030  R      4


I rest my case ;-)
Comment 19 Carlos O'Donell 2012-03-24 20:56:57 UTC
(In reply to comment #18)
> On my desktop (based on Ubuntu Lucid), in /usr/bin, there are 430 executables
> that link to libpthread.so, but only 4 of them have TLS:
[At worst 160 bytes] 
> I rest my case ;-)

This is excellent data and is the kind of information you need to make the case for this kind of change. Gathering more data from the other distro installations makes it clearer that the change won't severely impact VM reservation sizes.

I still want to hear from our contacts with the distros.
Comment 20 Mike Frysinger 2012-03-25 05:28:43 UTC
on my Gentoo/x86_64 system with about 2000 packages installed that covers quite a range of categories of software, here's my stats (using scanelf to count non-symlinked ELFs in various dirs):

$ scanelf -yBRF '%F' /bin/ /sbin/ /usr/bin/ /usr/sbin/ | wc -l
3596
$ scanelf -yBRF '%F' {/usr,}/lib64/ | grep -v -e '\.ko$' -e '/debug/' | wc -l
7806

of those ELFs, 297 have TLS regions.  although i've got quite a number of cross-compilers installed, so if i filter out all the gcc libraries (since they're more or less just double counting), i get 85 ELFs.

of those, all but 2 are all easily under 0x100.  the only ones over are libpixman which is 0x180, and then x11vnc/libvncserver which is 0x28a0.  in this last case, it's almost entirely made up of a structure declaring the vnc color palette.  and i don't think libvncserver really needs many threads ... seems to launch short lived threads on the fly for processing input/output requests.  i doubt it'll be a problem for these at all.
Comment 21 jsm-csl@polyomino.org.uk 2012-03-26 11:02:14 UTC
I think it will least surprise users if "stack" means the stack actually 
available to the thread, not including implementation details such as TLS, 
so they can work with tools that determine the amount of stack space used 
by individual functions to work out how large a stack may be needed.
Comment 22 Carlos O'Donell 2012-03-26 14:28:05 UTC
(In reply to comment #21)
> I think it will least surprise users if "stack" means the stack actually 
> available to the thread, not including implementation details such as TLS, 
> so they can work with tools that determine the amount of stack space used 
> by individual functions to work out how large a stack may be needed.

That sounds like a reasonable position to take. Do we know if other OSs interpret POSIX `stack' in this way?
Comment 23 Carlos O'Donell 2012-03-27 19:08:13 UTC
I'm assigning this issue to myself since it seems that nptl/allocatestack.c and nptl/nptl-ini.c need cleanup regarding the definition of stack.

The fact that POSIX has a seperate set of functions for getting and setting the guard size indicates that the guard is *not* considered a part of the stack. Not to mention the POSIX wording "the implementation allocates extra memory" means that the current GLIBC implementation isn't correct (but one or two pages doesn't help here). The guard size should be removed from the stack computation.

Given that the guard size shouldn't be part of the stack size brings even more doubt on the current practice of placing the static tls and pthread descriptor into the stack.

Yet another corner case is that using PTHREAD_STACK_MIN as a minimal stack should get you a thread that starts, but under a system where a variable sized TLS data region is included you are not guaranteed this at all. In fact using PTHREAD_STACK_MIN as a size might or might not work which is in my opinion another failure to meet the principle of "least surprise."

I'm asking for some interpretation help from the Austin Group regarding the meaning of stack.
Comment 24 law 2012-03-30 16:19:32 UTC
I think the real question here isn't the change in the size of the stack but the increase in the # of pages and potential wastage.

If the TLS is small won't we end up burning an entire page for this small amount of data?  For apps that use small stacks and lots of threads, that extra page could turn out to be significant.

I'm not sure I really buy the argument that if a thread asks for 16k that they get exactly 16k usable.

However, obviously, if the TLS is taking up a significant portion of the thread's requested stack, then something needs to be done.

There's no perfect solution here.  I wonder if we could come up with a good heuristic based on the relative sizes of the TLS and thread requested stack.  If the TLS is sufficiently small relative to the size of the requested stack, then just fold the TLS into the requested stack like we've always done.  Otherwise, add the size of the TLS to the size of the requested stack (rounding to a page of course).  Harder to document and implement, but seems like it'd strike a better balance.

I don't know where the cutoff should be, 1%, 10%, 50%? Some experimentation may guide us.
Comment 25 Paul Pluzhnikov 2012-03-30 16:44:43 UTC
(In reply to comment #24)
> I think the real question here isn't the change in the size of the stack but
> the increase in the # of pages and potential wastage.
> 
> If the TLS is small won't we end up burning an entire page for this small
> amount of data?  For apps that use small stacks and lots of threads, that extra
> page could turn out to be significant.

Yes, as I said in comment#12, an application that may notice this is one
that uses lots of threads, and is *almost* running out of VM space.

> I'm not sure I really buy the argument that if a thread asks for 16k that they
> get exactly 16k usable.

Let's also make "malloc(16K)" return 16K or less of usable memory ;-)

> There's no perfect solution here.  I wonder if we could come up with a good
> heuristic based on the relative sizes of the TLS and thread requested stack. 

Great idea!

> If the TLS is sufficiently small relative to the size of the requested stack,
> then just fold the TLS into the requested stack like we've always done. 
> Otherwise, add the size of the TLS to the size of the requested stack (rounding
> to a page of course).  Harder to document and implement, but seems like it'd
> strike a better balance.
> 
> I don't know where the cutoff should be, 1%, 10%, 50%? Some experimentation may
> guide us.

Given the data Mike Frysinger and I collected (most binaries using <512
bytes of TLS), I say:

  if stack_size_requested >= 16 * __static_tls_sze 
    use current calculation
  else
    increment size request by roundup(__static_tls_size, pagesize())

Most applications today request at least 64K stack (most actually default
to *much* more than that), which would allow up to 4K of static TLS.

But if that same application is instrumented for profiling and suddenly
requires 128K of TLS, it would still work.
Comment 26 Carlos O'Donell 2012-04-01 19:46:32 UTC
(In reply to comment #25)
> There's no perfect solution here.  I wonder if we could come up with a good
> heuristic based on the relative sizes of the TLS and thread requested stack. 
> If the TLS is sufficiently small relative to the size of the requested stack,
> then just fold the TLS into the requested stack like we've always done. 
> Otherwise, add the size of the TLS to the size of the requested stack (rounding
> to a page of course).  Harder to document and implement, but seems like it'd
> strike a better balance.
> 
> I don't know where the cutoff should be, 1%, 10%, 50%? Some experimentation may
> guide us.

Jeff,

Thank you for your feedback. I'll consider this as feedback from Red Hat :-)

I'll will work on a fix for this issue.

I'm going to consider slightly more than just TLS size in the heuristic though.

In truth all of the following contribute to the implementation-defined space that is "stolen" from the stack:

(a) alignment requirements

(b) static tls

(c) guard page

(d) pthread descriptor

(e) minimal rest stack

(f) coloring increment

(g) TLS TCB/DTV size

On top of this I have "stack grows up" patches from HPPA that need applying to pthread_create.c and allocatestack.c which I'll work into the fixup.

I'll post incremental patches to libc-alpha showing each step of the cleanup.

I also noticed there are some weird pieces of code like:

      /* Adjust the stack size for alignment.  */
      size &= ~__static_tls_align_m1;
      assert (size != 0);

Which makes *no* sense to me. It should make size larger and avoid the assert (this is the case where we allocate our own stack).

Which brings me to yet another issue I'm going to fix *before* I start work on this issue: We need generic macros for aligning up and aligning down correctly.
Comment 27 asharif.tools 2012-04-24 22:02:55 UTC
Created attachment 6363 [details]
Proposed fix that only adds to the stack size if it's less than 16 times __static_tls_size

This patch checks if the stack size is less than 16 times the __static_tls_size, and adds to it in that case.

Carlos, I know you're working on this for trunk, but we'd like a fix for ChromeOS.

What are your thoughts about this patch, Paul/Carlos/Mike?
Comment 28 Carlos O'Donell 2012-04-24 22:35:39 UTC
(In reply to comment #27)
> Created attachment 6363 [details]
> Proposed fix that only adds to the stack size if it's less than 16 times
> __static_tls_size
> 
> This patch checks if the stack size is less than 16 times the
> __static_tls_size, and adds to it in that case.
> 
> Carlos, I know you're working on this for trunk, but we'd like a fix for
> ChromeOS.
> 
> What are your thoughts about this patch, Paul/Carlos/Mike?

Your proposed patch seems like a fine temporary solution for the ChromeOS threads.

There is no ABI here so you can feel free to patch locally and use that to fix the problem.

I would like to more carefully evaluate the heuristic and the point at which we go from old to new behaviour and make that deterministic instead of dependent on the size of a random variable (in the statistical sense).

To be clear I would not accept your patch for trunk.
Comment 29 Mike Frysinger 2012-04-25 02:54:54 UTC
maybe i'm reading it wrong, but if your goal is to round up to the next page, wouldn't you want to add __static_tls_size to size and then do the round up ?

size = roundup(size + __static_tls_size, __getpagesize());
Comment 30 asharif.tools 2012-04-25 18:24:01 UTC
Perhaps I should round it up to the next page size after doing the increment. In #25 Paul suggested I do the roundup for the increment before adding.

Or perhaps the safest is to do the roundup of the TLS size and after the add.
Comment 31 Carlos O'Donell 2012-04-25 20:09:56 UTC
(In reply to comment #30)
> Perhaps I should round it up to the next page size after doing the increment.
> In #25 Paul suggested I do the roundup for the increment before adding.
> 
> Or perhaps the safest is to do the roundup of the TLS size and after the add.

You need to have a reason for doing the rounding, otherwise you are wasting space. What use and what alignment is required by that use should guide your decision.

The size is going to be aligned up to __static_tls_align_m1 in the next statement which is the larger of the stack alignment or the maximum alignment for any given TLS slot (slot minimum alignment is generally 16 bytes).

If you are going to use the extra space for stack then you're fine, the alignment in the next line has ensured that you get the right size given the alignment restrictions.
Comment 32 Siddhesh Poyarekar 2012-05-22 07:05:05 UTC
Paul pointed me to this bz when I submitted a patch to fix bug 6973 and bug 6976 today:

http://sourceware.org/ml/libc-alpha/2012-05/msg01544.html

If we're looking to fix all stack size and guard related issues in this bz, does it make sense to mark the other two bzs as duplicates of this one? I understand Carlos is coming up with a much more comprehensive fix for this.
Comment 33 Carlos O'Donell 2012-05-23 16:41:15 UTC
(In reply to comment #32)
> Paul pointed me to this bz when I submitted a patch to fix bug 6973 and bug
> 6976 today:
> 
> http://sourceware.org/ml/libc-alpha/2012-05/msg01544.html
> 
> If we're looking to fix all stack size and guard related issues in this bz,
> does it make sense to mark the other two bzs as duplicates of this one? I
> understand Carlos is coming up with a much more comprehensive fix for this.

Yes, please mark them as duplicates of this BZ.
Comment 34 Siddhesh Poyarekar 2012-05-23 17:07:56 UTC
*** Bug 6973 has been marked as a duplicate of this bug. ***
Comment 35 Siddhesh Poyarekar 2012-05-23 17:08:07 UTC
*** Bug 6976 has been marked as a duplicate of this bug. ***
Comment 36 Paul Pluzhnikov 2013-04-26 18:06:47 UTC
Carlos, any update on this?

We've just run into this problem again: java loading JNI code instrumented for profiling created thread stack blocks that were too small.

It would be *really* nice for glibc to just do the right thing and account for __static_tls without users having to explicitly work around it (by adding __static_tls to their requested stack size).
Comment 37 Rich Felker 2013-04-27 02:02:42 UTC
For what it's worth, the strategy we're using in musl seems to work well:

1. For implementation-allocated stacks, the requested size is always available to the application. TLS and guard size are added onto that for the allocation.

2. For application-provided stacks, if the size of TLS is greater than 2k or greater than 1/8 of the provided stack space, additional space for the TCB/TLS/etc. is allocated and the provided stack is used only for actual stack. This ensures that application expectations are met: automatic variables in the thread have addresses in the specified range, and the amount of stack space available is "close enough" to what the application requested that, if it overflows, it's reasonable to say it's the application's fault for not leaving a better margin of error.

I'm not sure how easy it would be to make glibc/NPTL use separate regions for the stack and TLS though...
Comment 38 Carlos O'Donell 2013-05-22 18:40:25 UTC
(In reply to comment #36)
> Carlos, any update on this?
> 
> We've just run into this problem again: java loading JNI code instrumented for
> profiling created thread stack blocks that were too small.
> 
> It would be *really* nice for glibc to just do the right thing and account for
> __static_tls without users having to explicitly work around it (by adding
> __static_tls to their requested stack size).

I agree. No update on this yet. I'll see what I can do in the next week or so.
Comment 39 David Abdurachmanov 2013-11-23 09:20:11 UTC
The same test case fails on glibc 2.5 (RHEL 5.10), but that is expected. I have an application, which dlopens plugins. It happens to be that dlopen fails with new plugins, "cannot allocate memory in static TLS block".

I found some big TLS segments, e.g., 1296, 498, and 3068448. It happens that plugins consumes 3+MB of TLS.

Looking at glibc 2.5 code, TLS size seems to be 1664+ bytes. I used private symbol, _dl_get_tls_static_info, to get some information about TLS from plugin manager and it always returned 3760 as TLS size. I did not manage to find a way to get _dl_tls_static_used.

I do assume "cannot allocate memory in static TLS block" is related to huge TLS segments size. It most likely doesn't fit into TLS space. Is there a quick way to confirm this, or I need to debug dynamic loader?

glibc Release/2.18 wiki: https://sourceware.org/glibc/wiki/Release/2.18
In "1.2. Desirable this release?":

BZ#11787 - Programs with large TLS segment fail (Carlos)
A workaround here is going to be to use Siddhesh's new LIBC_PTHREAD_DEFAULT_STACK_SIZE env var to bump up default stack sizes.

The fix is not in 2.18 as I understood, any progress for 2.19?
Comment 40 Carlos O'Donell 2013-11-24 15:58:42 UTC
(In reply to David Abdurachmanov from comment #39)
> I do assume "cannot allocate memory in static TLS block" is related to huge
> TLS segments size. It most likely doesn't fit into TLS space. Is there a
> quick way to confirm this, or I need to debug dynamic loader?
> 
> The fix is not in 2.18 as I understood, any progress for 2.19?

The problem you are seeing can't be fixed. The error "cannot allocate memory in static TLS block" is not the same problem as this bug. Your error is that you have shared objects compiled with static TLS that *require* static TLS space. The runtime only has a limited amount of static TLS space. It is not possible (it's an impossibility actually) for the runtime to pre-allocate static TLS space for dynamically loaded modules because it doesn't know what you will dlopen. The only fix you have is to find the offending module and request the author recompile it without static TLS. Please file another bug to investigate your issue if you think that you have shared objects that were *not* compiled with static TLS and yet still see this problem.
Comment 41 Jackie Rosen 2014-02-16 19:42:06 UTC Comment hidden (spam)
Comment 42 Anders Kaseorg 2015-03-23 02:50:38 UTC
FYI, the Rust compiler is linking every threaded Rust application against the glibc private symbol __pthread_get_minstack in order to work around this bug:

https://github.com/rust-lang/rust/issues/6233
https://github.com/rust-lang/rust/pull/11284
https://github.com/rust-lang/rust/pull/11331
https://github.com/rust-lang/rust/pull/11885
https://github.com/rust-lang/rust/issues/23628

I’m sure you’ll agree that it would be nice if we could avoid making this a permanent arrangement.
Comment 43 Steven Schlansker 2015-12-16 19:40:36 UTC
The JVM is having trouble too, see the following core-libs-dev discussion:

http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/037557.html

If you link the JVM as a library and use thread local storage, it is very hard for JVM internal threads to ensure that their stack is usable.

From the Rust discussion, they link that the Go runtime had the exact same problem.

How many languages are going to have to work around this broken behavior before there is a glibc fix? :(
Comment 44 Carlos O'Donell 2015-12-16 20:22:41 UTC
(In reply to Steven Schlansker from comment #43)
> The JVM is having trouble too, see the following core-libs-dev discussion:
> 
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/037557.
> html
> 
> If you link the JVM as a library and use thread local storage, it is very
> hard for JVM internal threads to ensure that their stack is usable.
> 
> From the Rust discussion, they link that the Go runtime had the exact same
> problem.
> 
> How many languages are going to have to work around this broken behavior
> before there is a glibc fix? :(

I have not been working on a fix for this because of other priorities. Sorry about that. We need to see submissions like this to raise the priority of a given issue beyond just that of chrome having problems in certain scenarios.

Seeing that Rust and Java are both having problems means we should do something about this. It has been 5 years since we discovered the issue, but like all things it's not trivial and we need to rewrite the NPTL stack layout code.

I'm moving this back to NEW for now since I'm not going to work on it, but I'll see what we can do internally at Red Hat to prioritize this.

As always "Patches welcome", but I understand that the code in question that needs to be changed is very very hairy, and requires a senior community developer to make or review the changes.

Thanks again for raising the visibility.
Comment 45 David Goldblatt 2017-09-01 16:02:15 UTC
Two more data points: This also causes trouble for the Ruby runtime when configured to link against jemalloc. The full details are in https://github.com/jemalloc/jemalloc/issues/1006 . We've also hit this in some vendor-supplied code at Facebook.
Comment 46 Josh Stone 2018-01-29 23:27:02 UTC
Following on comment 42, we should note that Anders then switched Rust to use a dlsym lookup instead of weak linkage:
https://github.com/rust-lang/rust/pull/23631

That went in before Rust 1.0.0-beta.  The fallback is to just use PTHREAD_STACK_MIN, so I think it will work fine when the symbol is removed.
(assuming TLS is also moved elsewhere, I guess)
Comment 47 Florian Weimer 2018-04-05 06:57:56 UTC
*** Bug 23018 has been marked as a duplicate of this bug. ***
Comment 48 Wang Liushuai 2020-02-24 02:39:24 UTC
Hi guys,

In ByteDance, we hit this Glibc bug twice in real-world production and adopted the fix based on the OpenJDK solution. However, these temporary fixes are not extensible. In the end, it needs to be fixed on the Glibc side.

Cheers,
Wang Liushuai.
Comment 49 Carlos O'Donell 2020-02-26 02:09:50 UTC
(In reply to Wang Liushuai from comment #48)
> Hi guys,
> 
> In ByteDance, we hit this Glibc bug twice in real-world production and
> adopted the fix based on the OpenJDK solution. However, these temporary
> fixes are not extensible. In the end, it needs to be fixed on the Glibc side.

Thanks for that data point. It helps me prioritize this.
Comment 50 Fangrui Song 2022-11-06 20:12:35 UTC
PTHREAD_STACK_MIN/RLIMIT_STACK/minstack computation is now in sysdeps/nptl/pthread_early_init.h . I think there is not much change from 2012.


I agree that the current behavior is not ideal. musl has a heuristic to address the problem (https://git.musl-libc.org/cgit/musl/commit/?id=d5142642b8e6c45449158efdb8f8e87af4dafde8):

> the new strategy pthread_create now uses is to only put TLS on the
> application-provided stack if TLS is smaller than 1/8 of the stack
> size or 2k, whichever is smaller. this ensures that the application
> always has "close enough" to what it requested, and the threshold is
> chosen heuristically to make sure "sane" amounts of TLS still end up
> in the application-provided stack.
> 
> if TLS does not fit the above criteria, pthread_create uses mmap to
> obtain space for TLS, but still uses the application-provided stack
> for actual call frame stack. this is to avoid wasting memory, and for
> the sake of supporting ugly hacks like garbage collection based on
> assumptions that the implementation will use the provided stack range.

Here is a one-liner for people to investigate large p_memsz(PT_TLS)

   for i in /usr/bin/*(.) /usr/lib/x86_64-linux-gnu/*; do readelf -Wl $i NN | awk '$1=="TLS" && strtonum($6) > 64 {printf "%d\t", strtonum($6); exit(1)}' || echo $i; done

(Adjust paths and `> 64` as needed.) 

It seems that chrome has smaller PT_TLS now. On my machine there are DSOs with large p_memsz(PT_TLS):

libtsan.so.2.0.0 has a 785760 byte p_memsz(PT_TLS) due to a large vector clock (TSAN runtime v3 has a much smaller one).
liblsan.so.0.0.0 (56240)
libglog.so.0.6.0 (30457)
libmetis.so.5.1.0 (28920)
...