Bug 22111 - malloc: per thread cache is not returned when thread exits
Summary: malloc: per thread cache is not returned when thread exits
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: malloc (show other bugs)
Version: 2.26
: P2 normal
Target Milestone: 2.27
Assignee: Carlos O'Donell
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-07 22:26 UTC by Hal Gentz
Modified: 2021-06-30 07:54 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2017-09-10 00:00:00
fweimer: security-


Attachments
Makes 1000 batches of 10 threads then ends them. Then prints out memory consumption before exiting. (876 bytes, text/x-csrc)
2017-09-07 22:26 UTC, Hal Gentz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hal Gentz 2017-09-07 22:26:51 UTC
Created attachment 10403 [details]
Makes 1000 batches of 10 threads then ends them. Then prints out memory consumption before exiting.

Description:
glibc 2.26 now enables a per-thread cache to malloc by default. This cache consumes around 1.23 kB of memory per thread created.

This cache however is not released when the thread exits. This cache can't be detected with malloc_heap() causing many/most memory profiling tools (memcheck, massif, heapcheck, etc) unable to either detect a) its existence or b) its source.

For applications which create and remove lots of threads, this can quickly consume gigabytes of memory, with the source of the consumption appearing to be unknown.

This issue was originally reported in 2010 at https://bugzilla.redhat.com/show_bug.cgi?id=640286 and https://bugzilla.redhat.com/show_bug.cgi?id=598498

I rereported this issue 3 days ago here: https://bugs.archlinux.org/task/55532

Additional info:
Happens in: glibc 2.26-1 and glibc 2.26-2


Steps to reproduce:
1) Compile attached file.
2) Run it and wait for ~25 seconds
3) Look at VmRSS of ~127300 kB
Comment 1 Carlos O'Donell 2017-09-10 08:11:08 UTC
I can confirm this issue exists. I believe I see the problem. We are leaking the tcache structure on each thread exit. I'm testing a fix. Thanks for the report.
Comment 2 Carlos O'Donell 2017-09-10 08:18:26 UTC
Note that this issue has nothing to do with the originally referenced issues in comment #0. The issue here is that we don't set tcache_shutting_down early enough, and therefore it causes a reinitialization of the tcache (to a new tcache) and the about-to-be-freed old tcache structure chunk is placed back into the new tcache.

After the fix:
[carlos@koi swbz22111]$ ./libc_leakNEW 
/proc/self/status at exit

VmPeak:	  883040 kB
VmSize:	  694568 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	    1588 kB
VmRSS:	    1544 kB
VmData:	   43256 kB
VmStk:	     132 kB
VmExe:	       4 kB
VmLib:	    1944 kB
VmPTE:	      96 kB
VmPMD:	      16 kB
VmSwap:	       0 kB

The VmRSS is now back down to 1.5MB which is expected.

The fix itself is relatively straight forward:

diff --git a/malloc/malloc.c b/malloc/malloc.c
index 1c2a0b0..7c5e357 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -2953,8 +2953,12 @@ tcache_thread_freeres (void)
   if (!tcache)
     return;
 
+  /* Disable the tcache and prevent it from being reinitialized.  */
   tcache = NULL;
+  tcache_shutting_down = 1;
 
+  /* Free all of the entries and the tcache itself back to the arena
+     heap for coalescing.  */
   for (i = 0; i < TCACHE_MAX_BINS; ++i)
     {
       while (tcache_tmp->entries[i])
@@ -2967,7 +2971,6 @@ tcache_thread_freeres (void)
 
   __libc_free (tcache_tmp);
 
-  tcache_shutting_down = 1;
 }
 text_set_element (__libc_thread_subfreeres, tcache_thread_freeres);
---
Comment 3 Hal Gentz 2017-09-10 19:53:57 UTC
Sorry, the referenced issues gave exactly so I thought that was it.

Anyways, I've compiled glibc with those two lines changed and its fixed, YAY!

Should I mark this issue as resovled, or should I wait tell it gets merged or something?
Comment 4 Florian Weimer 2017-09-11 06:59:24 UTC
(In reply to Hal Gentz from comment #3)
> Sorry, the referenced issues gave exactly so I thought that was it.
> 
> Anyways, I've compiled glibc with those two lines changed and its fixed, YAY!
> 
> Should I mark this issue as resovled, or should I wait tell it gets merged
> or something?

Thanks for testing.  The patch will have to be posted to the mailing list, committed, and then we can close the bug.  Backporting to the 2.26 release branch is also needed.
Comment 5 Carlos O'Donell 2017-09-27 04:17:36 UTC
(In reply to Florian Weimer from comment #4)
> (In reply to Hal Gentz from comment #3)
> > Sorry, the referenced issues gave exactly so I thought that was it.
> > 
> > Anyways, I've compiled glibc with those two lines changed and its fixed, YAY!
> > 
> > Should I mark this issue as resovled, or should I wait tell it gets merged
> > or something?
> 
> Thanks for testing.  The patch will have to be posted to the mailing list,
> committed, and then we can close the bug.  Backporting to the 2.26 release
> branch is also needed.

I'm preparing the patch for this with a test that validates the leak.
Comment 6 Carlos O'Donell 2017-09-27 05:34:40 UTC
(In reply to Carlos O'Donell from comment #5)
> I'm preparing the patch for this with a test that validates the leak.

cat ~/build/glibc-work/malloc/tst-malloc-tcache-leak.out 
INFO: 688 (bytes) are in use before starting threads.
INFO: 10864 (bytes) are in use after all threads joined.

OK, I got the test done somewhat portably using mallinfo and some heuristics.

If you see greater than 1000x size difference from the start to the end, then you've got a leak.
Comment 7 Carlos O'Donell 2017-09-27 05:46:09 UTC
Posted for review:
https://www.sourceware.org/ml/libc-alpha/2017-09/msg01008.html
Comment 8 Sourceware Commits 2017-10-06 17:31:28 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  1e26d35193efbb29239c710a4c46a64708643320 (commit)
      from  d13867625894fda6c6a5034dadfa8ff86983ea12 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1e26d35193efbb29239c710a4c46a64708643320

commit 1e26d35193efbb29239c710a4c46a64708643320
Author: Carlos O'Donell <carlos@systemhalted.org>
Date:   Thu Sep 28 11:05:18 2017 -0600

    malloc: Fix tcache leak after thread destruction [BZ #22111]
    
    The malloc tcache added in 2.26 will leak all of the elements remaining
    in the cache and the cache structure itself when a thread exits. The
    defect is that we do not set tcache_shutting_down early enough, and the
    thread simply recreates the tcache and places the elements back onto a
    new tcache which is subsequently lost as the thread exits (unfreed
    memory). The fix is relatively simple, move the setting of
    tcache_shutting_down earlier in tcache_thread_freeres. We add a test
    case which uses mallinfo and some heuristics to look for unaccounted for
    memory usage between the start and end of a thread start/join loop. It
    is very reliable at detecting that there is a leak given the number of
    iterations.  Without the fix the test will consume 122MiB of leaked
    memory.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                       |    9 +++
 malloc/Makefile                 |    3 +
 malloc/malloc.c                 |    8 ++-
 malloc/tst-malloc-tcache-leak.c |  112 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 129 insertions(+), 3 deletions(-)
 create mode 100644 malloc/tst-malloc-tcache-leak.c
Comment 9 Carlos O'Donell 2017-10-06 17:33:07 UTC
Fixed for 2.27.
Comment 10 Sourceware Commits 2017-10-06 20:18:30 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, release/2.26/master has been updated
       via  6e1ea21501eac981204c3cc8212d45998f74983c (commit)
      from  dd3a7239fddff81ac31373d69978d7aa1902c65f (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6e1ea21501eac981204c3cc8212d45998f74983c

commit 6e1ea21501eac981204c3cc8212d45998f74983c
Author: Carlos O'Donell <carlos@systemhalted.org>
Date:   Thu Sep 28 11:05:18 2017 -0600

    malloc: Fix tcache leak after thread destruction [BZ #22111]
    
    The malloc tcache added in 2.26 will leak all of the elements remaining
    in the cache and the cache structure itself when a thread exits. The
    defect is that we do not set tcache_shutting_down early enough, and the
    thread simply recreates the tcache and places the elements back onto a
    new tcache which is subsequently lost as the thread exits (unfreed
    memory). The fix is relatively simple, move the setting of
    tcache_shutting_down earlier in tcache_thread_freeres. We add a test
    case which uses mallinfo and some heuristics to look for unaccounted for
    memory usage between the start and end of a thread start/join loop. It
    is very reliable at detecting that there is a leak given the number of
    iterations.  Without the fix the test will consume 122MiB of leaked
    memory.
    
    (cherry picked from commit 1e26d35193efbb29239c710a4c46a64708643320)

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                       |    9 +++
 malloc/Makefile                 |    3 +
 malloc/malloc.c                 |    8 ++-
 malloc/tst-malloc-tcache-leak.c |  112 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 129 insertions(+), 3 deletions(-)
 create mode 100644 malloc/tst-malloc-tcache-leak.c
Comment 11 Sourceware Commits 2017-10-06 20:32:00 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, release/2.26/master has been updated
       via  d5c6dea2d5b4b5c64625c5386f6baec7bf2d89b3 (commit)
      from  6e1ea21501eac981204c3cc8212d45998f74983c (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d5c6dea2d5b4b5c64625c5386f6baec7bf2d89b3

commit d5c6dea2d5b4b5c64625c5386f6baec7bf2d89b3
Author: Carlos O'Donell <carlos@systemhalted.org>
Date:   Fri Oct 6 13:31:05 2017 -0700

    Update NEWS for bug 22111.

-----------------------------------------------------------------------

Summary of changes:
 NEWS |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
Comment 12 JeffyChen 2020-10-10 07:23:03 UTC
Hi guys,

It seems this this issue is still exist in some cases, for example i can repro it with this test in glibc 2.29:

#include <pthread.h>
void *noop() { pthread_detach(pthread_self()); return NULL; }
void main() {
        pthread_t id;
        while(1) pthread_create(&id, NULL, noop, 0);
}

There's also another report at:
https://sourceware.org/ml/glibc-bugs/2018-01/msg00171.html

So it looks like the tcache_shutting_down is still not early enough, when we detach a thread with no tcache allocated(like the above test), the tcache_shutting_down would still be false:
static void
tcache_thread_shutdown (void)
{
  int i;
  tcache_perthread_struct *tcache_tmp = tcache;

  if (!tcache)
    return; <--- returned here

  /* Disable the tcache and prevent it from being reinitialized.  */
  tcache = NULL;
  tcache_shutting_down = true; <--- unreachable


And the tcache might be reinitialized later(breakpoint at later MAYBE_INIT_TCACHE):
Thread 2 (Thread 0x7f993c61e0 (LWP 8920)):
#0  0x0000007f9d444f7c in free () from /lib/libc.so.6
#1  0x0000007f9d57a3e4 in _dl_deallocate_tls () from /lib/ld-linux-aarch64.so.1
#2  0x0000007f9d541490 in free_stacks () from /lib/libpthread.so.0
#3  0x0000007f9d541a4c in __deallocate_stack () from /lib/libpthread.so.0
#4  0x0000007f9d542984 in start_thread () from /lib/libpthread.so.0
#5  0x0000007f9d49b1fc in thread_start () from /lib/libc.so.6



This works:
+++ b/malloc/malloc.c
@@ -2958,13 +2958,14 @@ tcache_thread_shutdown (void)
 {
   int i;
   tcache_perthread_struct *tcache_tmp = tcache;

+  tcache_shutting_down = true;

   if (!tcache)
     return;

   /* Disable the tcache and prevent it from being reinitialized.  */
   tcache = NULL;
-  tcache_shutting_down = true;
Comment 13 Florian Weimer 2021-06-30 07:54:00 UTC
(In reply to JeffyChen from comment #12)
> Hi guys,
> 
> It seems this this issue is still exist in some cases, for example i can
> repro it with this test in glibc 2.29:

Thanks for the report, this is now bug 28028.