Summary: | Program with large TLS segments fail. | ||
---|---|---|---|
Product: | glibc | Reporter: | Paul Pluzhnikov <ppluzhnikov> |
Component: | dynamic-link | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | a3at.mail, andersk, aph, asharif.tools, bas, bugdal, carlos, david.abdurachmanov, davidtgoldblatt, devurandom, fweimer, glibc-bugs, hjl.tools, i, ian, jistone, john.wellesz, jsm-csl, kouvel, law, matt, mtk.manpages, sam, siddhesh, stevenschlansker, toolchain, vapier, wangliushuai |
Priority: | P2 | Flags: | carlos_odonell:
review+
fweimer: security- |
Version: | 2.12 | ||
Target Milestone: | --- | ||
Host: | x86_64-unknown-linux-gnu | Target: | x86_64-unknown-linux-gnu |
Build: | x86_64-unknown-linux-gnu | Last reconfirmed: | |
Bug Depends on: | |||
Bug Blocks: | 22636 | ||
Attachments: |
test case from pr10643
Proposed fix. Proposed fix that only adds to the stack size if it's less than 16 times __static_tls_size |
Description
Paul Pluzhnikov
2010-07-02 23:02:12 UTC
Created attachment 4869 [details] test case from pr10643 Ping? This hit us again today in a different context: Chrome uses 128K thread stacks, which is normally plenty. But when instrumented for profiling, the counters reside in TLS, and consumed most of the 128K stack, resulting in a crash that was hard to diagnose. We then had to change the source before instrumented Chrome could run. It would be nice if glibc just did the right thing, getting rid of the crash and the need to modify sources. Created attachment 6297 [details]
Proposed fix.
It increases size by TLS size. The test case passes as well as Chrome with -fprofile-generate.
I've reviewed the patch sent to the mailing list and provided comments: http://sourceware.org/ml/libc-alpha/2012-03/msg01002.html I just noticed that what I recommended shouldn't be required since the code should already take this into account. Nobody has given any hard numbers about the static TLS size so I'm marking this issue as unconfirmed. Please provide some real-world figures so we know how bad the problem is compared to the default stack size for x86 e.g. 2MB. sysdeps/i386/pthreaddef.h:#define ARCH_STACK_DEFAULT_SIZE (2 * 1024 * 1024) For example code in nptl-init.c tries to take things into account: ~~~ nptl/nptl-inic.c: 414 /* Determine the default allowed stack size. This is the size used 415 in case the user does not specify one. */ 416 struct rlimit limit; 417 if (getrlimit (RLIMIT_STACK, &limit) != 0 418 || limit.rlim_cur == RLIM_INFINITY) 419 /* The system limit is not usable. Use an architecture-specific 420 default. */ 421 limit.rlim_cur = ARCH_STACK_DEFAULT_SIZE; 422 else if (limit.rlim_cur < PTHREAD_STACK_MIN) 423 /* The system limit is unusably small. 424 Use the minimal size acceptable. */ 425 limit.rlim_cur = PTHREAD_STACK_MIN; 426 427 /* Make sure it meets the minimum size that allocate_stack 428 (allocatestack.c) will demand, which depends on the page size. */ 429 const uintptr_t pagesz = GLRO(dl_pagesize); 430 const size_t minstack = pagesz + __static_tls_size + MINIMAL_REST_STACK; 431 if (limit.rlim_cur < minstack) 432 limit.rlim_cur = minstack; 433 434 /* Round the resource limit up to page size. */ 435 limit.rlim_cur = (limit.rlim_cur + pagesz - 1) & -pagesz; 436 __default_stacksize = limit.rlim_cur; ~~~ Note that we check to see that __default_stacksize takes into account __static_tls_size and a minimal rest size (2K on x86). Is the problem that the minimal computed is *still* not enough? If the minimal values computes by the code are not enough then the only real solution is to fix the programs with large per-thread memory requirements e.g. set RLIMIT_STACK to a reasonable value. We do not want to penalize all of the other programs that don't need the extra stack space. (In reply to comment #5) Carlos, I don't believe you've understood the problem. Default stack sizes are fine. But aio_write creates a small (16K) stack, and chrome creates 128K stacks. Normally this is also just fine; and all works. But then application creates a larger-than-usual TLS (either by allocating 4096 thread-local ints as in the test case here, or by instrumenting for profiling), and suddenly things start crashing in hard-to-diagnose fashion. > We do not want to penalize all of the other programs that don't need the extra > stack space. You aren't penalizing them much if they aren't using TLS, and if they are using large TLS, then you are making them work instead of crashing. From "man pthread_attr_setstacksize": The pthread_attr_setstacksize() function sets the stack size attribute of the thread attributes object referred to by attr to the value specified in stacksize. It doesn't say "to the value specified in stacksize minus the size of TLS". The fact that GLIBC chops off space for TLS from the top of stack is an implementation detail, and (IMHO) should *not* reduce the stack size application actually gets! I can provide some information about Chrome built with -fprofile-generate: p __static_tls_size $1 = 114880 Now let's look at how Chrome allocates thread stacks: http://code.google.com/p/chromium/source/search?q=kShutdownDetectorThreadStackSize&origq=kShutdownDetectorThreadStackSize&btnG=Search+Trunk http://code.google.com/p/chromium/source/search?q=kWorkerThreadStackSize&origq=kWorkerThreadStackSize&btnG=Search+Trunk are two examples. There are other stack sizes too. From what I have seen, if the stack size is less than __static_tls_size, it just fails to allocate the stack. If it is something like 128k, it gets allocated, but soon runs out of stack into the guard page where it causes a segfault on a random function (as it allocating locals for the function). (In reply to comment #6) > (In reply to comment #5) > > Carlos, > > I don't believe you've understood the problem. > > Default stack sizes are fine. > > But aio_write creates a small (16K) stack, and chrome creates 128K stacks. > > Normally this is also just fine; and all works. > > But then application creates a larger-than-usual TLS (either by allocating > 4096 thread-local ints as in the test case here, or by instrumenting for > profiling), and suddenly things start crashing in hard-to-diagnose fashion. That does make the problem clearer. Please note that the aio helper thread *should* be using a default 2MB stack on x86, not 16K, I don't see anywhere that sets the helper threads stack to 16K. > > We do not want to penalize all of the other programs that don't need the extra > > stack space. > > You aren't penalizing them much if they aren't using TLS, and if they are > using large TLS, then you are making them work instead of crashing. You are also increasing the memory footprint of all applications that use TLS that worked fine before. Before making any changes we need to hear from the distros (RH, SuSE, Debian, Gentoo, Ubuntu etc) about the kind of impact this might have e.g. a quick find / readelf -a / grep to determine on average the memory increase we are looking at. > From "man pthread_attr_setstacksize": > > > The pthread_attr_setstacksize() function sets the stack size attribute of > the thread attributes object referred to by attr to the value specified > in stacksize. > > It doesn't say "to the value specified in stacksize minus the size of TLS". > > The fact that GLIBC chops off space for TLS from the top of stack is an > implementation detail, and (IMHO) should *not* reduce the stack size > application actually gets! The "stack" is purposely ambiguous and once handed over to the implementation the implementation has complete control and can do what it wishes including use some of it for TLS which is what we do. http://pubs.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_09.html#tag_02_09_08 Having said that it is *bad* for us to crash in the *DEFAULT* case because TLS data takes up the entire default stack. However, you are arguing for variable sized thread stacks based on TLS data, which is a fine suggestion but needs serious consideration. There are multiple cases here. (a) App. provides memory (pthread_attr_setstackaddr, pthread_attr_setstack). (b) App. requests minimum size (pthread_attr_setstacksize). (c) App. makes no request, and RLIMIT_STACK is set and >= PTHREAD_STACK_MIN. (d) App. makes no request, and RLIMIT_STACK is set and < PTHREAD_STACK_MIN. (e) App. makes no request, and RLIMIT_STACK is 0/inf. What do you suggest for each of these cases? Are there any other cases I might have missed? In the current implementation we do the following: For (a) we use the user memory for everything we need. For (b) we allocate the minimum and use that for everything we need. For (c) we allocate the value of RLIMIT_STACK only if it's >= minstack, otherwise minstack (nptl-init.c) and use that for everything we need. For (d) we allocate the value of PTHREAD_STACK_MIN only if it's >= minstack, otherwise minstack (nptl-init.c) and use that for everything we need. For (e) we allocate the value of ARCH_STACK_DEFAULT_SIZE only if it's >= minstack, otherwise minstack (nptl-init.c) and use that for everything we need. You appear to be suggesting the following: For (a) the behaviour remains the same. For (b) we adjust upward by the size of the static TLS data. For (c) " For (d) " For (e) " In which case the patch is probably to nptl-init.c to change the computation of the default size. (In reply to comment #7) > I can provide some information about Chrome built with -fprofile-generate: > > p __static_tls_size > $1 = 114880 > > Now let's look at how Chrome allocates thread stacks: > > http://code.google.com/p/chromium/source/search?q=kShutdownDetectorThreadStackSize&origq=kShutdownDetectorThreadStackSize&btnG=Search+Trunk > > http://code.google.com/p/chromium/source/search?q=kWorkerThreadStackSize&origq=kWorkerThreadStackSize&btnG=Search+Trunk > > are two examples. There are other stack sizes too. From what I have seen, if > the stack size is less than __static_tls_size, it just fails to allocate the > stack. If it is something like 128k, it gets allocated, but soon runs out of > stack into the guard page where it causes a segfault on a random function (as > it allocating locals for the function). That's brutal. You have a 128k stack with 114k of thread local data, a guard page, the thread descriptor, and you've barely got anything left. Thank you for some real-world data, that helps put things into perspective. Paul et al., The next steps for this issue are: (a) Rewrite the patch to adjust the default stack size in nptl/nptl-init.c to account for large TLS data as *additional* space. (b) Find people from each of the major distributions to comment on this issue. Asking them the pointed question "Would you mind if the memory footprint of all applications that use TLS goes up by the size of the static TLS in order to provide assurances that your application will have some reasonable amount of thread stack not consumed by the TLS data?" I suggest: Jeff Law (Red Hat), Aurelien Jarno (Debian), Andreas Jaeger (SuSE), Ubuntu (??), Mike Frysinger (Gentoo), ... I'm here to guide you, but this kind of work needs to come from the ground up and from the people most interested in seeing this fixed. We need volunteers like you on the product to help fix these kinds of issues, but it does require some work to reach consensus :-) Joseph, Any comments on this issue specifically with respect to the POSIX meaning of "stack" and wether it should or should include implementation dependent details like where static TLS data is stored? (In reply to comment #8) > Please note that the aio helper thread *should* be using a default 2MB stack on > x86, not 16K, I don't see anywhere that sets the helper threads stack to 16K. The helper thread stack size *is* set by __pthread_get_minstack. The test case here no longer crashes using the current git trunk, but only because the aio thread stack is now increased to accomodate static tls: // nptl/nptl-init.c size_t __pthread_get_minstack (const pthread_attr_t *attr) { struct pthread_attr *iattr = (struct pthread_attr *) attr; return (GLRO(dl_pagesize) + __static_tls_size + PTHREAD_STACK_MIN + iattr->guardsize); } This was done here: 2011-12-22 Ulrich Drepper <drepper@gmail.com> [BZ #13088] * sysdeps/unix/sysv/linux/timer_routines.c: Get minimum stack size through __pthread_get_minstack. * nptl-init.c (__pthread_initialize_minimal_internal): Get page size directly from _rtld_global_ro. (__pthread_get_minstack): New function. * pthreadP.h: Declare __pthread_get_minstack. * Versions (libpthread) [GLIBC_PRIVATE]: Add __pthread_get_minstack. and is *precisely* the change we are asking for for the other threads. I see that Rich Felker is in exact agreement with me: http://sourceware.org/bugzilla/show_bug.cgi?id=13088#c1 > You are also increasing the memory footprint of all applications that use TLS > that worked fine before. It depends on what you call "memory footprint". Yes, we'll increase memory we *reserve* for stacks (VM size). But we will not *use* any more memory than what's already used (RSS). I think the only application that would notice this is one that was almost running out of VM space. An answer for such app would be to decrease its default stack sizes, use smaller number of threads, or switch to 64 bits. > Before making any changes we need to hear from the distros (RH, SuSE, Debian, > Gentoo, Ubuntu etc) about the kind of impact this might have e.g. a quick find > / readelf -a / grep to determine on average the memory increase we are looking > at. Would above still apply if it's just VM size we are talking about? I don't see how readelf will reveal anything. > There are multiple cases here. > You appear to be suggesting the following: > > For (a) the behaviour remains the same. > > For (b) we adjust upward by the size of the static TLS data. > > For (c) " > > For (d) " > > For (e) " Yes, I believe that's what we are proposing. (In reply to comment #12) > (In reply to comment #8) > > > Please note that the aio helper thread *should* be using a default 2MB stack on > > x86, not 16K, I don't see anywhere that sets the helper threads stack to 16K. > > The helper thread stack size *is* set by __pthread_get_minstack. Thanks, I just spotted that in nptl/sysdeps/unix/sysv/linux/aio_misc.h. > This was done here: > > 2011-12-22 Ulrich Drepper <drepper@gmail.com> > > [BZ #13088] > * sysdeps/unix/sysv/linux/timer_routines.c: Get minimum stack size > through __pthread_get_minstack. > * nptl-init.c (__pthread_initialize_minimal_internal): Get page size > directly from _rtld_global_ro. > (__pthread_get_minstack): New function. > * pthreadP.h: Declare __pthread_get_minstack. > * Versions (libpthread) [GLIBC_PRIVATE]: Add __pthread_get_minstack. > > and is *precisely* the change we are asking for for the other threads. Thanks for pointing this out. > I see that Rich Felker is in exact agreement with me: > http://sourceware.org/bugzilla/show_bug.cgi?id=13088#c1 > > > You are also increasing the memory footprint of all applications that use TLS > > that worked fine before. > > It depends on what you call "memory footprint". Yes, we'll increase memory > we *reserve* for stacks (VM size). But we will not *use* any more memory > than what's already used (RSS). Reservations still consume VM space and limit the maximum number of threads. Reservation is still a serious problem for kernel VM management and layout. We should not increase the reserved VM space without due consideration. > I think the only application that would notice this is one that was almost > running out of VM space. An answer for such app would be to decrease its > default stack sizes, use smaller number of threads, or switch to 64 bits. Not true. What about applications with *lots* of threads, each thread running computational kernels (little real stack usage), and lots of thread-local data. In those cases you could be doubling the reservation per thread and causing the application to be unable to spawn a larger number of threads. > > Before making any changes we need to hear from the distros (RH, SuSE, Debian, > > Gentoo, Ubuntu etc) about the kind of impact this might have e.g. a quick find > > / readelf -a / grep to determine on average the memory increase we are looking > > at. > > Would above still apply if it's just VM size we are talking about? > I don't see how readelf will reveal anything. Yes. We need to get input from the distros. We should not make a change like this without talking to them. You use readelf to find the TLS segment and see how much bigger the per-thread memory reservation will be and collect this data for as many applications as possible to see the expected real-world impact. > > There are multiple cases here. > > > You appear to be suggesting the following: > > > > For (a) the behaviour remains the same. > > > > For (b) we adjust upward by the size of the static TLS data. > > > > For (c) " > > > > For (d) " > > > > For (e) " > > Yes, I believe that's what we are proposing. I'm glad to see that I understand the problem. Cheers, Carlos. the increased size should only matter if the new size pushes past a page boundary right ? and even then, it'd only be 4KiB per thread ? i don't see that being a problem. (In reply to comment #14) > the increased size should only matter if the new size pushes past a page > boundary right ? and even then, it'd only be 4KiB per thread ? We really have to be careful when talking about "size". The actual used memory (RSS) is unchanged, only the reserved size (VM) is changing. To re-state Carlos' example, consider a 32-bit application that creates 1000s of threads, and has large TLS (1000s of __thread variables). Each thread is created with 64K stack (just barely sufficient to accomodate all of TLS variables) and otherwise does not use much of stack (e.g. a non-recursive computation). This application works fine currently. Under proposed change, the actual stack usage (RSS) is unchanged, but the reserved (VM) stack size for each thread nearly doubles to 64K+__static_tls_size. So the application would only be able to create half as many threads as it used to. I assert that this is a contrived example, and has a trivial fix: reduce 64K to 4K (or whatever *actual* stack the threads use). But I will go ahead and contact distribution maintainers, as Carlos suggested in comment#10. (In reply to comment #15) > But I will go ahead and contact distribution maintainers, > as Carlos suggested in comment#10. Paul, Many thanks for spearheading the effort. I think the idea is a good enhancement, but we need to communicate with the distros about this class of change and get consensus. By getting input from the distros we might find out that there are other problems with our solution. I don't see any, but it doesn't hurt to be careful. I suggest an email to libc-alpha summarizing the issue, current state of glibc trunk, and CC the glibc distro contacts and see what they have to say. After a comment period we'll be ready to implement a solution. If you feel like being thorough you can provide the reworked patch based on changes to nptl/nptl-init.c along with that summarizing email. Thank you again for working through this process. (In reply to comment #14) > the increased size should only matter if the new size pushes past a page > boundary right ? and even then, it'd only be 4KiB per thread ? > > i don't see that being a problem. No the new size is going to be the full size of the PT_TLS segment e.g. static .tdata/.tbss which is normally allocated out of the thread stack. Could you do a canvas of a gentoo install to see how big those TLS segments are in executables? e.g. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align ... TLS 0x185194 0x00186194 0x00186194 0x00008 0x00040 R 0x4 ... The reservations grow by MemSiz. On my desktop (based on Ubuntu Lucid), in /usr/bin, there are 430 executables that link to libpthread.so, but only 4 of them have TLS: lftp luatex mono Xvfb Of these, the largest TLS MemSize is in mono: readelf -l mono | grep -A1 TLS TLS 0x00000000002685b0 0x00000000008685b0 0x00000000008685b0 0x0000000000000000 0x0000000000000048 R 8 a whopping 72 bytes. The other three have 16-byte TLS. This isn't of course the whole story, as shared libraries to contribute to static tls size as well. In /lib, the following 4 libraries have TLS segment: libc-2.11.1.so libcom_err.so.2.1 libmemusage.so libselinux.so.1 libuuid.so.1.3.0 The largest one is in libselinux.so.1 at 160 bytes: TLS 0x000000000001bd50 0x000000000021bd50 0x000000000021bd50 0x0000000000000000 0x00000000000000a0 R 8 followed by glibc itself, at 104 bytes: TLS 0x0000000000179738 0x0000000000379738 0x0000000000379738 0x0000000000000010 0x0000000000000068 R 8 In /usr/lib, the following have TLS: libcap-ng.so.0.0.0 libdw-0.143.so libelf-0.143.so libexempi.so.3.2.1 libgomp.so.1.0.0 libpulsecore-0.9.21.so libstdc++.so.6.0.13 The largest one is libcap-ng.so.0.0.0 at 48 bytes: TLS 0x0000000000003db4 0x0000000000203db4 0x0000000000203db4 0x0000000000000030 0x0000000000000030 R 4 I rest my case ;-) (In reply to comment #18) > On my desktop (based on Ubuntu Lucid), in /usr/bin, there are 430 executables > that link to libpthread.so, but only 4 of them have TLS: [At worst 160 bytes] > I rest my case ;-) This is excellent data and is the kind of information you need to make the case for this kind of change. Gathering more data from the other distro installations makes it clearer that the change won't severely impact VM reservation sizes. I still want to hear from our contacts with the distros. on my Gentoo/x86_64 system with about 2000 packages installed that covers quite a range of categories of software, here's my stats (using scanelf to count non-symlinked ELFs in various dirs): $ scanelf -yBRF '%F' /bin/ /sbin/ /usr/bin/ /usr/sbin/ | wc -l 3596 $ scanelf -yBRF '%F' {/usr,}/lib64/ | grep -v -e '\.ko$' -e '/debug/' | wc -l 7806 of those ELFs, 297 have TLS regions. although i've got quite a number of cross-compilers installed, so if i filter out all the gcc libraries (since they're more or less just double counting), i get 85 ELFs. of those, all but 2 are all easily under 0x100. the only ones over are libpixman which is 0x180, and then x11vnc/libvncserver which is 0x28a0. in this last case, it's almost entirely made up of a structure declaring the vnc color palette. and i don't think libvncserver really needs many threads ... seems to launch short lived threads on the fly for processing input/output requests. i doubt it'll be a problem for these at all. I think it will least surprise users if "stack" means the stack actually available to the thread, not including implementation details such as TLS, so they can work with tools that determine the amount of stack space used by individual functions to work out how large a stack may be needed. (In reply to comment #21) > I think it will least surprise users if "stack" means the stack actually > available to the thread, not including implementation details such as TLS, > so they can work with tools that determine the amount of stack space used > by individual functions to work out how large a stack may be needed. That sounds like a reasonable position to take. Do we know if other OSs interpret POSIX `stack' in this way? I'm assigning this issue to myself since it seems that nptl/allocatestack.c and nptl/nptl-ini.c need cleanup regarding the definition of stack. The fact that POSIX has a seperate set of functions for getting and setting the guard size indicates that the guard is *not* considered a part of the stack. Not to mention the POSIX wording "the implementation allocates extra memory" means that the current GLIBC implementation isn't correct (but one or two pages doesn't help here). The guard size should be removed from the stack computation. Given that the guard size shouldn't be part of the stack size brings even more doubt on the current practice of placing the static tls and pthread descriptor into the stack. Yet another corner case is that using PTHREAD_STACK_MIN as a minimal stack should get you a thread that starts, but under a system where a variable sized TLS data region is included you are not guaranteed this at all. In fact using PTHREAD_STACK_MIN as a size might or might not work which is in my opinion another failure to meet the principle of "least surprise." I'm asking for some interpretation help from the Austin Group regarding the meaning of stack. I think the real question here isn't the change in the size of the stack but the increase in the # of pages and potential wastage. If the TLS is small won't we end up burning an entire page for this small amount of data? For apps that use small stacks and lots of threads, that extra page could turn out to be significant. I'm not sure I really buy the argument that if a thread asks for 16k that they get exactly 16k usable. However, obviously, if the TLS is taking up a significant portion of the thread's requested stack, then something needs to be done. There's no perfect solution here. I wonder if we could come up with a good heuristic based on the relative sizes of the TLS and thread requested stack. If the TLS is sufficiently small relative to the size of the requested stack, then just fold the TLS into the requested stack like we've always done. Otherwise, add the size of the TLS to the size of the requested stack (rounding to a page of course). Harder to document and implement, but seems like it'd strike a better balance. I don't know where the cutoff should be, 1%, 10%, 50%? Some experimentation may guide us. (In reply to comment #24) > I think the real question here isn't the change in the size of the stack but > the increase in the # of pages and potential wastage. > > If the TLS is small won't we end up burning an entire page for this small > amount of data? For apps that use small stacks and lots of threads, that extra > page could turn out to be significant. Yes, as I said in comment#12, an application that may notice this is one that uses lots of threads, and is *almost* running out of VM space. > I'm not sure I really buy the argument that if a thread asks for 16k that they > get exactly 16k usable. Let's also make "malloc(16K)" return 16K or less of usable memory ;-) > There's no perfect solution here. I wonder if we could come up with a good > heuristic based on the relative sizes of the TLS and thread requested stack. Great idea! > If the TLS is sufficiently small relative to the size of the requested stack, > then just fold the TLS into the requested stack like we've always done. > Otherwise, add the size of the TLS to the size of the requested stack (rounding > to a page of course). Harder to document and implement, but seems like it'd > strike a better balance. > > I don't know where the cutoff should be, 1%, 10%, 50%? Some experimentation may > guide us. Given the data Mike Frysinger and I collected (most binaries using <512 bytes of TLS), I say: if stack_size_requested >= 16 * __static_tls_sze use current calculation else increment size request by roundup(__static_tls_size, pagesize()) Most applications today request at least 64K stack (most actually default to *much* more than that), which would allow up to 4K of static TLS. But if that same application is instrumented for profiling and suddenly requires 128K of TLS, it would still work. (In reply to comment #25) > There's no perfect solution here. I wonder if we could come up with a good > heuristic based on the relative sizes of the TLS and thread requested stack. > If the TLS is sufficiently small relative to the size of the requested stack, > then just fold the TLS into the requested stack like we've always done. > Otherwise, add the size of the TLS to the size of the requested stack (rounding > to a page of course). Harder to document and implement, but seems like it'd > strike a better balance. > > I don't know where the cutoff should be, 1%, 10%, 50%? Some experimentation may > guide us. Jeff, Thank you for your feedback. I'll consider this as feedback from Red Hat :-) I'll will work on a fix for this issue. I'm going to consider slightly more than just TLS size in the heuristic though. In truth all of the following contribute to the implementation-defined space that is "stolen" from the stack: (a) alignment requirements (b) static tls (c) guard page (d) pthread descriptor (e) minimal rest stack (f) coloring increment (g) TLS TCB/DTV size On top of this I have "stack grows up" patches from HPPA that need applying to pthread_create.c and allocatestack.c which I'll work into the fixup. I'll post incremental patches to libc-alpha showing each step of the cleanup. I also noticed there are some weird pieces of code like: /* Adjust the stack size for alignment. */ size &= ~__static_tls_align_m1; assert (size != 0); Which makes *no* sense to me. It should make size larger and avoid the assert (this is the case where we allocate our own stack). Which brings me to yet another issue I'm going to fix *before* I start work on this issue: We need generic macros for aligning up and aligning down correctly. Created attachment 6363 [details]
Proposed fix that only adds to the stack size if it's less than 16 times __static_tls_size
This patch checks if the stack size is less than 16 times the __static_tls_size, and adds to it in that case.
Carlos, I know you're working on this for trunk, but we'd like a fix for ChromeOS.
What are your thoughts about this patch, Paul/Carlos/Mike?
(In reply to comment #27) > Created attachment 6363 [details] > Proposed fix that only adds to the stack size if it's less than 16 times > __static_tls_size > > This patch checks if the stack size is less than 16 times the > __static_tls_size, and adds to it in that case. > > Carlos, I know you're working on this for trunk, but we'd like a fix for > ChromeOS. > > What are your thoughts about this patch, Paul/Carlos/Mike? Your proposed patch seems like a fine temporary solution for the ChromeOS threads. There is no ABI here so you can feel free to patch locally and use that to fix the problem. I would like to more carefully evaluate the heuristic and the point at which we go from old to new behaviour and make that deterministic instead of dependent on the size of a random variable (in the statistical sense). To be clear I would not accept your patch for trunk. maybe i'm reading it wrong, but if your goal is to round up to the next page, wouldn't you want to add __static_tls_size to size and then do the round up ? size = roundup(size + __static_tls_size, __getpagesize()); Perhaps I should round it up to the next page size after doing the increment. In #25 Paul suggested I do the roundup for the increment before adding. Or perhaps the safest is to do the roundup of the TLS size and after the add. (In reply to comment #30) > Perhaps I should round it up to the next page size after doing the increment. > In #25 Paul suggested I do the roundup for the increment before adding. > > Or perhaps the safest is to do the roundup of the TLS size and after the add. You need to have a reason for doing the rounding, otherwise you are wasting space. What use and what alignment is required by that use should guide your decision. The size is going to be aligned up to __static_tls_align_m1 in the next statement which is the larger of the stack alignment or the maximum alignment for any given TLS slot (slot minimum alignment is generally 16 bytes). If you are going to use the extra space for stack then you're fine, the alignment in the next line has ensured that you get the right size given the alignment restrictions. Paul pointed me to this bz when I submitted a patch to fix bug 6973 and bug 6976 today: http://sourceware.org/ml/libc-alpha/2012-05/msg01544.html If we're looking to fix all stack size and guard related issues in this bz, does it make sense to mark the other two bzs as duplicates of this one? I understand Carlos is coming up with a much more comprehensive fix for this. (In reply to comment #32) > Paul pointed me to this bz when I submitted a patch to fix bug 6973 and bug > 6976 today: > > http://sourceware.org/ml/libc-alpha/2012-05/msg01544.html > > If we're looking to fix all stack size and guard related issues in this bz, > does it make sense to mark the other two bzs as duplicates of this one? I > understand Carlos is coming up with a much more comprehensive fix for this. Yes, please mark them as duplicates of this BZ. *** Bug 6973 has been marked as a duplicate of this bug. *** *** Bug 6976 has been marked as a duplicate of this bug. *** Carlos, any update on this? We've just run into this problem again: java loading JNI code instrumented for profiling created thread stack blocks that were too small. It would be *really* nice for glibc to just do the right thing and account for __static_tls without users having to explicitly work around it (by adding __static_tls to their requested stack size). For what it's worth, the strategy we're using in musl seems to work well: 1. For implementation-allocated stacks, the requested size is always available to the application. TLS and guard size are added onto that for the allocation. 2. For application-provided stacks, if the size of TLS is greater than 2k or greater than 1/8 of the provided stack space, additional space for the TCB/TLS/etc. is allocated and the provided stack is used only for actual stack. This ensures that application expectations are met: automatic variables in the thread have addresses in the specified range, and the amount of stack space available is "close enough" to what the application requested that, if it overflows, it's reasonable to say it's the application's fault for not leaving a better margin of error. I'm not sure how easy it would be to make glibc/NPTL use separate regions for the stack and TLS though... (In reply to comment #36) > Carlos, any update on this? > > We've just run into this problem again: java loading JNI code instrumented for > profiling created thread stack blocks that were too small. > > It would be *really* nice for glibc to just do the right thing and account for > __static_tls without users having to explicitly work around it (by adding > __static_tls to their requested stack size). I agree. No update on this yet. I'll see what I can do in the next week or so. The same test case fails on glibc 2.5 (RHEL 5.10), but that is expected. I have an application, which dlopens plugins. It happens to be that dlopen fails with new plugins, "cannot allocate memory in static TLS block". I found some big TLS segments, e.g., 1296, 498, and 3068448. It happens that plugins consumes 3+MB of TLS. Looking at glibc 2.5 code, TLS size seems to be 1664+ bytes. I used private symbol, _dl_get_tls_static_info, to get some information about TLS from plugin manager and it always returned 3760 as TLS size. I did not manage to find a way to get _dl_tls_static_used. I do assume "cannot allocate memory in static TLS block" is related to huge TLS segments size. It most likely doesn't fit into TLS space. Is there a quick way to confirm this, or I need to debug dynamic loader? glibc Release/2.18 wiki: https://sourceware.org/glibc/wiki/Release/2.18 In "1.2. Desirable this release?": BZ#11787 - Programs with large TLS segment fail (Carlos) A workaround here is going to be to use Siddhesh's new LIBC_PTHREAD_DEFAULT_STACK_SIZE env var to bump up default stack sizes. The fix is not in 2.18 as I understood, any progress for 2.19? (In reply to David Abdurachmanov from comment #39) > I do assume "cannot allocate memory in static TLS block" is related to huge > TLS segments size. It most likely doesn't fit into TLS space. Is there a > quick way to confirm this, or I need to debug dynamic loader? > > The fix is not in 2.18 as I understood, any progress for 2.19? The problem you are seeing can't be fixed. The error "cannot allocate memory in static TLS block" is not the same problem as this bug. Your error is that you have shared objects compiled with static TLS that *require* static TLS space. The runtime only has a limited amount of static TLS space. It is not possible (it's an impossibility actually) for the runtime to pre-allocate static TLS space for dynamically loaded modules because it doesn't know what you will dlopen. The only fix you have is to find the offending module and request the author recompile it without static TLS. Please file another bug to investigate your issue if you think that you have shared objects that were *not* compiled with static TLS and yet still see this problem. *** Bug 260998 has been marked as a duplicate of this bug. *** Seen from the domain http://volichat.com Page where seen: http://volichat.com/adult-chat-rooms Marked for reference. Resolved as fixed @bugzilla. FYI, the Rust compiler is linking every threaded Rust application against the glibc private symbol __pthread_get_minstack in order to work around this bug: https://github.com/rust-lang/rust/issues/6233 https://github.com/rust-lang/rust/pull/11284 https://github.com/rust-lang/rust/pull/11331 https://github.com/rust-lang/rust/pull/11885 https://github.com/rust-lang/rust/issues/23628 I’m sure you’ll agree that it would be nice if we could avoid making this a permanent arrangement. The JVM is having trouble too, see the following core-libs-dev discussion: http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/037557.html If you link the JVM as a library and use thread local storage, it is very hard for JVM internal threads to ensure that their stack is usable. From the Rust discussion, they link that the Go runtime had the exact same problem. How many languages are going to have to work around this broken behavior before there is a glibc fix? :( (In reply to Steven Schlansker from comment #43) > The JVM is having trouble too, see the following core-libs-dev discussion: > > http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/037557. > html > > If you link the JVM as a library and use thread local storage, it is very > hard for JVM internal threads to ensure that their stack is usable. > > From the Rust discussion, they link that the Go runtime had the exact same > problem. > > How many languages are going to have to work around this broken behavior > before there is a glibc fix? :( I have not been working on a fix for this because of other priorities. Sorry about that. We need to see submissions like this to raise the priority of a given issue beyond just that of chrome having problems in certain scenarios. Seeing that Rust and Java are both having problems means we should do something about this. It has been 5 years since we discovered the issue, but like all things it's not trivial and we need to rewrite the NPTL stack layout code. I'm moving this back to NEW for now since I'm not going to work on it, but I'll see what we can do internally at Red Hat to prioritize this. As always "Patches welcome", but I understand that the code in question that needs to be changed is very very hairy, and requires a senior community developer to make or review the changes. Thanks again for raising the visibility. Two more data points: This also causes trouble for the Ruby runtime when configured to link against jemalloc. The full details are in https://github.com/jemalloc/jemalloc/issues/1006 . We've also hit this in some vendor-supplied code at Facebook. Following on comment 42, we should note that Anders then switched Rust to use a dlsym lookup instead of weak linkage: https://github.com/rust-lang/rust/pull/23631 That went in before Rust 1.0.0-beta. The fallback is to just use PTHREAD_STACK_MIN, so I think it will work fine when the symbol is removed. (assuming TLS is also moved elsewhere, I guess) *** Bug 23018 has been marked as a duplicate of this bug. *** Hi guys, In ByteDance, we hit this Glibc bug twice in real-world production and adopted the fix based on the OpenJDK solution. However, these temporary fixes are not extensible. In the end, it needs to be fixed on the Glibc side. Cheers, Wang Liushuai. (In reply to Wang Liushuai from comment #48) > Hi guys, > > In ByteDance, we hit this Glibc bug twice in real-world production and > adopted the fix based on the OpenJDK solution. However, these temporary > fixes are not extensible. In the end, it needs to be fixed on the Glibc side. Thanks for that data point. It helps me prioritize this. PTHREAD_STACK_MIN/RLIMIT_STACK/minstack computation is now in sysdeps/nptl/pthread_early_init.h . I think there is not much change from 2012. I agree that the current behavior is not ideal. musl has a heuristic to address the problem (https://git.musl-libc.org/cgit/musl/commit/?id=d5142642b8e6c45449158efdb8f8e87af4dafde8): > the new strategy pthread_create now uses is to only put TLS on the > application-provided stack if TLS is smaller than 1/8 of the stack > size or 2k, whichever is smaller. this ensures that the application > always has "close enough" to what it requested, and the threshold is > chosen heuristically to make sure "sane" amounts of TLS still end up > in the application-provided stack. > > if TLS does not fit the above criteria, pthread_create uses mmap to > obtain space for TLS, but still uses the application-provided stack > for actual call frame stack. this is to avoid wasting memory, and for > the sake of supporting ugly hacks like garbage collection based on > assumptions that the implementation will use the provided stack range. Here is a one-liner for people to investigate large p_memsz(PT_TLS) for i in /usr/bin/*(.) /usr/lib/x86_64-linux-gnu/*; do readelf -Wl $i NN | awk '$1=="TLS" && strtonum($6) > 64 {printf "%d\t", strtonum($6); exit(1)}' || echo $i; done (Adjust paths and `> 64` as needed.) It seems that chrome has smaller PT_TLS now. On my machine there are DSOs with large p_memsz(PT_TLS): libtsan.so.2.0.0 has a 785760 byte p_memsz(PT_TLS) due to a large vector clock (TSAN runtime v3 has a much smaller one). liblsan.so.0.0.0 (56240) libglog.so.0.6.0 (30457) libmetis.so.5.1.0 (28920) ... |