Bug 20425

Summary: unbalanced and poor utilization of memory in glibc arenas may cause memory bloat and subsequent OOM
Product: glibc Reporter: Sumeet Keswani <sumeet.keswani>
Component: mallocAssignee: Florian Weimer <fweimer>
Status: RESOLVED WORKSFORME    
Severity: normal CC: drepper.fsp, fweimer
Priority: P2 Flags: fweimer: security-
Version: 2.12   
Target Milestone: ---   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1359877
Host: Target:
Build: Last reconfirmed: 2016-08-01 00:00:00
Attachments: Reproducer
Corrected Reproducer with tcl script for post-processing
reproducer done on glibc-2.22
glibc-issue

Description Sumeet Keswani 2016-07-30 01:39:23 UTC
Created attachment 9411 [details]
Reproducer

Summary:

The [g]libc allocator is not doing a good job reusing memory across arenas.

Furthermore the use of arenas is unbalanced, when some arenas see 4 times the median number of allocations other arenas see few.

Also significant amount of memory is unused and results in bloat.

We believe this bug has been exposed after this fix for
https://sourceware.org/bugzilla/show_bug.cgi?id=19048

Although our reproducer use glibc-2.12.1-192 (built by RedHat - which includes fix for Bug 19048) to demonstrate the issue. We highly suspect this issue exists in the latest version of glibc.
(note that glibc-2.12.1-166 built by RedHat does not show this issue, because it lacks fix for Bug 19048)


Reproducer:

Attached is the reproducer that contains the files needed to reproduce the issue...

pthread_arena.exe.out is a run with stats.
pthread_arena.exe.out.tcl.xls is those stats post-processed.

It's a 24 core machine with 96GiB of memory. 
Thus 192 arenas (8*24). 
The median size of an arena is about 16M. 

If you look at iteration 1 (called set 1), you'll see that all arenas are right around the median. 
i.e., the 500 threads pretty much balanced on the 192 arenas. 

But on the 2nd and subsequent iterations there are several arenas that are 4+ times the median. (and there are a few that dropped to almost nothing). 
What this says is that after the initial population of arenas, the algorithm to choose an arena when one is needed is very poor, 
causing over-subscription on many and under-subscription on a few. 


Impact on the application :
Most database application account for the memory they use.
The application _is_ freeing memory.  The [g]libc allocator is not doing a good job reusing it.

Consequently the amount of memory _used_ by the application far exceeds that what is accounted for by the application.
(i.e. for the application believes it uses 3G but the RSS is actually 7G, due to the poor utilization as a result of this bug). This can result in a OOM error/exception to the application when it goes to allocates more memory and there isn't any available on the machine


If subscription could become balanced, that might solve the problem.
Comment 1 Sumeet Keswani 2016-07-30 01:40:12 UTC
Here is someone else who ran into it...
http://codearcana.com/posts/2016/07/11/arena-leak-in-glibc.html
Comment 2 Florian Weimer 2016-08-01 12:43:47 UTC
(In reply to Sumeet Keswani from comment #0)
> Attached is the reproducer that contains the files needed to reproduce the
> issue...

The reproducer does not match the included .out file, and it is not clear how to reproduce the data you collected.
Comment 3 Sumeet Keswani 2016-08-01 14:32:58 UTC
The package contains the following files...

Makefile
pthread_arena.c
pthread_arena.exe
pthread_arena.exe.out
pthread_arena.exex.tcl.xls
glibc-2.12.1.166 (dir)
glibc-2.12.1.192 (dir)

To compile
cc -Wall -g -pthread pthread_arena.c -o pthread_arena.exe


To run, 
set the LD_LIBRARY PATH to the glibc you want and run with

e.g
LD_LIBRARY_PATH=./glibc-2.12.1-192 /pthread_arena.exe -iters 10  -stats -threads 500 -maxSize 200

This will run the reproducer and write info to the stdout (pthread_arena.exe.out)

The output generated in pthread_arena.exe.out is tabulated into
pthread_arena.exex.tcl.xls
Here it shows that some arenas get 4 times the median number of allocations and others get much less.

(In reply to Florian Weimer from comment #2)
> (In reply to Sumeet Keswani from comment #0)
> > Attached is the reproducer that contains the files needed to reproduce the
> > issue...
> 
> The reproducer does not match the included .out file, and it is not clear
> how to reproduce the data you collected.
Comment 4 Sumeet Keswani 2016-08-01 14:43:05 UTC
(In reply to Florian Weimer from comment #2)
> (In reply to Sumeet Keswani from comment #0)
> > Attached is the reproducer that contains the files needed to reproduce the
> > issue...
> 
> The reproducer does not match the included .out file, and it is not clear
> how to reproduce the data you collected.

I am going to attach a tcl script to post process the stdout and create the tabulation (xls).
Comment 5 Florian Weimer 2016-08-01 14:57:20 UTC
(In reply to Sumeet Keswani from comment #4)
> (In reply to Florian Weimer from comment #2)
> > (In reply to Sumeet Keswani from comment #0)
> > > Attached is the reproducer that contains the files needed to reproduce the
> > > issue...
> > 
> > The reproducer does not match the included .out file, and it is not clear
> > how to reproduce the data you collected.
> 
> I am going to attach a tcl script to post process the stdout and create the
> tabulation (xls).

The .c file also seems to lack the malloc_stats calls.
Comment 6 Sumeet Keswani 2016-08-01 15:51:20 UTC
Created attachment 9417 [details]
Corrected Reproducer with tcl script for post-processing

attaching the corrected tarball....

The package contains the following files...

Makefile
pthread_arena.c
pthread_arena.exe
pthread_arena.exe.out
pthread_arena.exe.out.tcl.xls
malloc_stats.01.tcl
glibc-2.12.1.166 (dir)
glibc-2.12.1.192 (dir)

To compile
cc -Wall -g -pthread pthread_arena.c -o pthread_arena.exe


To run, 
set the LD_LIBRARY PATH to the glibc you want and run with

e.g
LD_LIBRARY_PATH=./glibc-2.12.1-192 /pthread_arena.exe -iters 10  -stats -threads 500 -maxSize 200  > pthread_arena.exe.out


to process the output
tclsh malloc_stats.01.tcl pthread_arena.exe.out > pthread_arena.exe.out.tcl.xls
Comment 7 Florian Weimer 2016-08-02 09:50:39 UTC
I cannot reproduce this with the -192 glibc.  Can you provide lscpu output from your test machine?

I tried your reproducer on:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               1200.000
BogoMIPS:              4788.56
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-5,12-17
NUMA node1 CPU(s):     6-11,18-23


And:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                240
On-line CPU(s) list:   0-239
Thread(s) per core:    2
Core(s) per socket:    15
Socket(s):             8
NUMA node(s):          8
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E7-8890 v2 @ 2.80GHz
Stepping:              7
CPU MHz:               1200.000
BogoMIPS:              5585.84
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              38400K
NUMA node0 CPU(s):     0-14,120-134
NUMA node1 CPU(s):     15-29,135-149
NUMA node2 CPU(s):     30-44,150-164
NUMA node3 CPU(s):     45-59,165-179
NUMA node4 CPU(s):     60-74,180-194
NUMA node5 CPU(s):     75-89,195-209
NUMA node6 CPU(s):     90-104,210-224
NUMA node7 CPU(s):     105-119,225-239

The latter limited to 192 arenas with MALLOC_ARENA_MAX=192.
Comment 8 Florian Weimer 2016-08-02 09:54:23 UTC
Please also run the test script from https://sourceware.org/bugzilla/show_bug.cgi?id=19048#c12 towards the end of the test run.  It has to be called towards the end of the test run.
Comment 9 Sumeet Keswani 2016-08-02 17:06:23 UTC
here is the lscpu info

[13:05:55][condor@eng094:~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               2933.000
BogoMIPS:              5866.55
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23
NUMA node1 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
Comment 10 Florian Weimer 2016-08-09 13:24:19 UTC
(In reply to Sumeet Keswani from comment #9)
> here is the lscpu info

Thanks, this is fairly close to what I used.  Have you been able to run the script referenced in comment 8?

Can you reproduce the issue with more recent glibc versions than what's in Red Hat Enterprise Linux 6?
Comment 11 Sumeet Keswani 2016-08-09 13:43:40 UTC
(In reply to Florian Weimer from comment #10)
> (In reply to Sumeet Keswani from comment #9)
> > here is the lscpu info
> 
> Thanks, this is fairly close to what I used.  Have you been able to run the
> script referenced in comment 8?
> 
> Can you reproduce the issue with more recent glibc versions than what's in
> Red Hat Enterprise Linux 6?

That script requires symbols to be loaded. I tried to run that with symbols from the corresponding dbg package. unfortunately when run with that the script gave no output. i suspect there is a problem with the symbols or the way they were loaded.

Hence i now have to build glibc from sources which include symbols and start with that. Hence it gotten delayed.
Comment 12 Sumeet Keswani 2016-08-09 13:47:27 UTC
(In reply to Sumeet Keswani from comment #11)
> (In reply to Florian Weimer from comment #10)
> > (In reply to Sumeet Keswani from comment #9)
> > > here is the lscpu info
> > 
> > Thanks, this is fairly close to what I used.  Have you been able to run the
> > script referenced in comment 8?
> > 
> > Can you reproduce the issue with more recent glibc versions than what's in
> > Red Hat Enterprise Linux 6?
> 
> That script requires symbols to be loaded. I tried to run that with symbols
> from the corresponding dbg package. unfortunately when run with that the
> script gave no output. i suspect there is a problem with the symbols or the
> way they were loaded.
> 
> Hence i now have to build glibc from sources which include symbols and start
> with that. Hence it gotten delayed.


"Can you reproduce the issue with more recent glibc versions than what's in
 Red Hat Enterprise Linux 6?" 
- yes we are able to reproduce this with glibc-2.12.1-192 which is the most recent glibc in RHEL 6.x
Comment 13 Florian Weimer 2016-08-09 13:51:53 UTC
(In reply to Sumeet Keswani from comment #12)

> "Can you reproduce the issue with more recent glibc versions than what's in
>  Red Hat Enterprise Linux 6?" 
> - yes we are able to reproduce this with glibc-2.12.1-192 which is the most
> recent glibc in RHEL 6.x

That doesn't really answer my question.  I'm interested in the behavior of the current upstream sources (master branch or any of the recent release branches would be fine).

If you don't want to work with upstream sources, we should perhaps move the conversation to the bug in Red Hat Bugzilla because glibc 2.12 is not terribly relevant to the glibc upstream community at this point (that release happened more than five years ago).
Comment 14 Sumeet Keswani 2016-08-09 14:34:07 UTC
(In reply to Florian Weimer from comment #13)
> (In reply to Sumeet Keswani from comment #12)
> 
> > "Can you reproduce the issue with more recent glibc versions than what's in
> >  Red Hat Enterprise Linux 6?" 
> > - yes we are able to reproduce this with glibc-2.12.1-192 which is the most
> > recent glibc in RHEL 6.x
> 
> That doesn't really answer my question.  I'm interested in the behavior of
> the current upstream sources (master branch or any of the recent release
> branches would be fine).
> 
> If you don't want to work with upstream sources, we should perhaps move the
> conversation to the bug in Red Hat Bugzilla because glibc 2.12 is not
> terribly relevant to the glibc upstream community at this point (that
> release happened more than five years ago).

sorry I misunderstood. 

yes, I can try it with more recent sources (albeit i havn't thus far). 
we definitely have a interesting in getting to this bottom of this on more recent version of glibc what ship with RHEL7/ubuntu etc... as we have users on those versions too.
Comment 15 Sumeet Keswani 2016-08-17 16:55:07 UTC
Created attachment 9446 [details]
reproducer done on glibc-2.22

Attached reproducer was done on glibc-2.22
It demonstrates the increase in RSS memory.

There are some output files in the tarball that show a local run using the code provided on the following machine config


[centos@ip-10-147-171-138 ~]$ uname -a
Linux  3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[centos@ip-10-147-171-138 ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Stepping:              4
CPU MHz:               2500.074
BogoMIPS:              5263.77
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
Comment 16 Florian Weimer 2016-08-18 18:21:37 UTC
(In reply to Sumeet Keswani from comment #15)
> Created attachment 9446 [details]
> reproducer done on glibc-2.22
> 
> Attached reproducer was done on glibc-2.22
> It demonstrates the increase in RSS memory.

Can you clarify what your concern is?

The bug started with an observed unevenness in the arena selection.  Do you still see this as a problem, or are you solely concerned about RSS usage?

> There are some output files in the tarball that show a local run using the
> code provided on the following machine config

Thanks, I have found a remarkably similar machine (but naturally, it's not running under Xen).
Comment 17 Sumeet Keswani 2016-08-18 18:54:33 UTC
(In reply to Florian Weimer from comment #16)
> (In reply to Sumeet Keswani from comment #15)
> > Created attachment 9446 [details]
> > reproducer done on glibc-2.22
> > 
> > Attached reproducer was done on glibc-2.22
> > It demonstrates the increase in RSS memory.
> 
> Can you clarify what your concern is?
> 
> The bug started with an observed unevenness in the arena selection.  Do you
> still see this as a problem, or are you solely concerned about RSS usage?
> 
> > There are some output files in the tarball that show a local run using the
> > code provided on the following machine config
> 
> Thanks, I have found a remarkably similar machine (but naturally, it's not
> running under Xen).

sorry for the long response...

Our concern is the perpetually increasing RSS, which ultimately causes a OOM kill. This is the bottom line. We see this problem with versions/builds of glibc which have the fix for Bug 19048. Consequently we believe that this problem has been exposed by or is related to fix for 19048.

In between glibc 2.12.1-166 and 2.12.1.192 we can see "unbalanced and poor utilization" of the arenas which cause memory significant memory bloat and OOM kill.

Subsequently we ran the reproducer on glibc-2.22. Here we don’t see this "unbalanced and poor utilization", however arenas grows (leak) with each iteration which will ultimately result in an OOM.


So yes, I am happy to fork this bug into two. 
One for the 2.12 stream where we can see unbalanced use of arenas is apparent.
and another for 2.22 where arenas grow steadily.

Both result in OOM.
Comment 18 Florian Weimer 2016-08-18 19:35:13 UTC
(In reply to Sumeet Keswani from comment #17)
> Subsequently we ran the reproducer on glibc-2.22. Here we don’t see this
> "unbalanced and poor utilization", however arenas grows (leak) with each
> iteration which will ultimately result in an OOM.

When you write glibc-2.22, do you mean a build from the release tarball, or from the release/2.22/master branch in Git?  Thanks for clarifying.
Comment 19 Sumeet Keswani 2016-08-18 19:42:34 UTC
(In reply to Florian Weimer from comment #18)
> (In reply to Sumeet Keswani from comment #17)
> > Subsequently we ran the reproducer on glibc-2.22. Here we don’t see this
> > "unbalanced and poor utilization", however arenas grows (leak) with each
> > iteration which will ultimately result in an OOM.
> 
> When you write glibc-2.22, do you mean a build from the release tarball, or
> from the release/2.22/master branch in Git?  Thanks for clarifying.

I mean build from the release tarball.
Comment 20 Florian Weimer 2016-08-20 19:29:38 UTC
(In reply to Sumeet Keswani from comment #17)
> Subsequently we ran the reproducer on glibc-2.22. Here we don’t see this
> "unbalanced and poor utilization", however arenas grows (leak) with each
> iteration which will ultimately result in an OOM.

I ran the reproducer with:

-iters 20 -threads 512 -maxSize 200 -secs 60 -stats

against a glibc 2.22 release version.  For me, memory utilization stabilizes at around 4.5 GiB after the first few iterations.  I don't know why the ramp-up is slower in your pthread_arena.exe.out.2.22 log file.  It could be a kernel difference (I'm on kernel-3.10.0-327.28.2.el7.x86_64).

If you run more iterations, does RSS keep increasing?  It seems to have plateaued for you as well.
Comment 21 Florian Weimer 2016-09-05 09:45:00 UTC
I noticed that the test case is invalid because it does not check the result of pthread_create, and this makes the subsequent pthread_join invalid.

I still do not see a glibc bug here.  RSS usage could be lower, but I don't see evidence of an actual memory leak in the glibc allocator.
Comment 22 Sumeet Keswani 2016-09-08 21:53:13 UTC
(In reply to Florian Weimer from comment #21)
> I noticed that the test case is invalid because it does not check the result
> of pthread_create, and this makes the subsequent pthread_join invalid.
> 
> I still do not see a glibc bug here.  RSS usage could be lower, but I don't
> see evidence of an actual memory leak in the glibc allocator.

Sorry for the delay in response, I was out for a while.
I am working on getting you additional data to demonstrate the problem.
I will update this BZ shortly.
Comment 23 Florian Weimer 2016-10-11 17:31:09 UTC
(In reply to Sumeet Keswani from comment #22)

> Sorry for the delay in response, I was out for a while.
> I am working on getting you additional data to demonstrate the problem.
> I will update this BZ shortly.

Sumeet, do you have any additional data you can share?  Thanks.
Comment 24 Sumeet Keswani 2016-10-11 20:48:25 UTC
No, we have not had the chance to debug it further.

We believe its a case bug 11261 on steroids - in some sense. Each Arena actually holds on to 5x to 50X of the memory that is given to the application. After Bug 19048 was fixed all arenas start to demonstrate this bloat. Consequently the application quickly OOMs.

We have been trying to manage it using MALLOC_TRIM_THRESHOLD_ and MALLOC_ARENA_MAX in some cases. In other cases we have had success using jemalloc.
The dynamic mmap threshold does not help since we do pretty large allocations from time to time causing a lot of bloat.

Its still on my plate to run a periodic malloc_stats to demonstrate the long term algorithmic problem. I have not gotten to it yet.

you may mark this as Need Info
Comment 25 Sumeet Keswani 2016-10-12 14:13:10 UTC
Created attachment 9559 [details]
glibc-issue


Its been hard to build a accurate reproducer, since our workloads are complicated. But this (attached) is typical of what we see.
(this is RSz memory of three processes running on three different machines)

The arenas take so much more memory than the process actually uses over longer periods of time. It almost never returns the memory and the process OOMs in a few days.

Note how calling malloc_trim() returns the memory back to the OS. 
Hence we know that this memory is totally wasted in the arenas, because by calling malloc_trim() we can recover the memory. Hence there is something algorithmic-ally wrong with the way arenas are managed. - they either do not use the free memory they have, and/or never return the top of the heap.

With the cyclic arena fix (yellow) this problem actually became worse. yellow line will OOM sooner than blue and purple.
Comment 26 Florian Weimer 2017-01-02 16:05:46 UTC
We cannot fix this problem with a proper reproducer.