Summary: | unbalanced and poor utilization of memory in glibc arenas may cause memory bloat and subsequent OOM | ||
---|---|---|---|
Product: | glibc | Reporter: | Sumeet Keswani <sumeet.keswani> |
Component: | malloc | Assignee: | Florian Weimer <fweimer> |
Status: | RESOLVED WORKSFORME | ||
Severity: | normal | CC: | drepper.fsp, fweimer |
Priority: | P2 | Flags: | fweimer:
security-
|
Version: | 2.12 | ||
Target Milestone: | --- | ||
See Also: | https://bugzilla.redhat.com/show_bug.cgi?id=1359877 | ||
Host: | Target: | ||
Build: | Last reconfirmed: | 2016-08-01 00:00:00 | |
Attachments: |
Reproducer
Corrected Reproducer with tcl script for post-processing reproducer done on glibc-2.22 glibc-issue |
Description
Sumeet Keswani
2016-07-30 01:39:23 UTC
Here is someone else who ran into it... http://codearcana.com/posts/2016/07/11/arena-leak-in-glibc.html (In reply to Sumeet Keswani from comment #0) > Attached is the reproducer that contains the files needed to reproduce the > issue... The reproducer does not match the included .out file, and it is not clear how to reproduce the data you collected. The package contains the following files... Makefile pthread_arena.c pthread_arena.exe pthread_arena.exe.out pthread_arena.exex.tcl.xls glibc-2.12.1.166 (dir) glibc-2.12.1.192 (dir) To compile cc -Wall -g -pthread pthread_arena.c -o pthread_arena.exe To run, set the LD_LIBRARY PATH to the glibc you want and run with e.g LD_LIBRARY_PATH=./glibc-2.12.1-192 /pthread_arena.exe -iters 10 -stats -threads 500 -maxSize 200 This will run the reproducer and write info to the stdout (pthread_arena.exe.out) The output generated in pthread_arena.exe.out is tabulated into pthread_arena.exex.tcl.xls Here it shows that some arenas get 4 times the median number of allocations and others get much less. (In reply to Florian Weimer from comment #2) > (In reply to Sumeet Keswani from comment #0) > > Attached is the reproducer that contains the files needed to reproduce the > > issue... > > The reproducer does not match the included .out file, and it is not clear > how to reproduce the data you collected. (In reply to Florian Weimer from comment #2) > (In reply to Sumeet Keswani from comment #0) > > Attached is the reproducer that contains the files needed to reproduce the > > issue... > > The reproducer does not match the included .out file, and it is not clear > how to reproduce the data you collected. I am going to attach a tcl script to post process the stdout and create the tabulation (xls). (In reply to Sumeet Keswani from comment #4) > (In reply to Florian Weimer from comment #2) > > (In reply to Sumeet Keswani from comment #0) > > > Attached is the reproducer that contains the files needed to reproduce the > > > issue... > > > > The reproducer does not match the included .out file, and it is not clear > > how to reproduce the data you collected. > > I am going to attach a tcl script to post process the stdout and create the > tabulation (xls). The .c file also seems to lack the malloc_stats calls. Created attachment 9417 [details]
Corrected Reproducer with tcl script for post-processing
attaching the corrected tarball....
The package contains the following files...
Makefile
pthread_arena.c
pthread_arena.exe
pthread_arena.exe.out
pthread_arena.exe.out.tcl.xls
malloc_stats.01.tcl
glibc-2.12.1.166 (dir)
glibc-2.12.1.192 (dir)
To compile
cc -Wall -g -pthread pthread_arena.c -o pthread_arena.exe
To run,
set the LD_LIBRARY PATH to the glibc you want and run with
e.g
LD_LIBRARY_PATH=./glibc-2.12.1-192 /pthread_arena.exe -iters 10 -stats -threads 500 -maxSize 200 > pthread_arena.exe.out
to process the output
tclsh malloc_stats.01.tcl pthread_arena.exe.out > pthread_arena.exe.out.tcl.xls
I cannot reproduce this with the -192 glibc. Can you provide lscpu output from your test machine? I tried your reproducer on: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Stepping: 2 CPU MHz: 1200.000 BogoMIPS: 4788.56 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 And: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 240 On-line CPU(s) list: 0-239 Thread(s) per core: 2 Core(s) per socket: 15 Socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E7-8890 v2 @ 2.80GHz Stepping: 7 CPU MHz: 1200.000 BogoMIPS: 5585.84 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 38400K NUMA node0 CPU(s): 0-14,120-134 NUMA node1 CPU(s): 15-29,135-149 NUMA node2 CPU(s): 30-44,150-164 NUMA node3 CPU(s): 45-59,165-179 NUMA node4 CPU(s): 60-74,180-194 NUMA node5 CPU(s): 75-89,195-209 NUMA node6 CPU(s): 90-104,210-224 NUMA node7 CPU(s): 105-119,225-239 The latter limited to 192 arenas with MALLOC_ARENA_MAX=192. Please also run the test script from https://sourceware.org/bugzilla/show_bug.cgi?id=19048#c12 towards the end of the test run. It has to be called towards the end of the test run. here is the lscpu info [13:05:55][condor@eng094:~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 44 Stepping: 2 CPU MHz: 2933.000 BogoMIPS: 5866.55 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23 NUMA node1 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22 (In reply to Sumeet Keswani from comment #9) > here is the lscpu info Thanks, this is fairly close to what I used. Have you been able to run the script referenced in comment 8? Can you reproduce the issue with more recent glibc versions than what's in Red Hat Enterprise Linux 6? (In reply to Florian Weimer from comment #10) > (In reply to Sumeet Keswani from comment #9) > > here is the lscpu info > > Thanks, this is fairly close to what I used. Have you been able to run the > script referenced in comment 8? > > Can you reproduce the issue with more recent glibc versions than what's in > Red Hat Enterprise Linux 6? That script requires symbols to be loaded. I tried to run that with symbols from the corresponding dbg package. unfortunately when run with that the script gave no output. i suspect there is a problem with the symbols or the way they were loaded. Hence i now have to build glibc from sources which include symbols and start with that. Hence it gotten delayed. (In reply to Sumeet Keswani from comment #11) > (In reply to Florian Weimer from comment #10) > > (In reply to Sumeet Keswani from comment #9) > > > here is the lscpu info > > > > Thanks, this is fairly close to what I used. Have you been able to run the > > script referenced in comment 8? > > > > Can you reproduce the issue with more recent glibc versions than what's in > > Red Hat Enterprise Linux 6? > > That script requires symbols to be loaded. I tried to run that with symbols > from the corresponding dbg package. unfortunately when run with that the > script gave no output. i suspect there is a problem with the symbols or the > way they were loaded. > > Hence i now have to build glibc from sources which include symbols and start > with that. Hence it gotten delayed. "Can you reproduce the issue with more recent glibc versions than what's in Red Hat Enterprise Linux 6?" - yes we are able to reproduce this with glibc-2.12.1-192 which is the most recent glibc in RHEL 6.x (In reply to Sumeet Keswani from comment #12) > "Can you reproduce the issue with more recent glibc versions than what's in > Red Hat Enterprise Linux 6?" > - yes we are able to reproduce this with glibc-2.12.1-192 which is the most > recent glibc in RHEL 6.x That doesn't really answer my question. I'm interested in the behavior of the current upstream sources (master branch or any of the recent release branches would be fine). If you don't want to work with upstream sources, we should perhaps move the conversation to the bug in Red Hat Bugzilla because glibc 2.12 is not terribly relevant to the glibc upstream community at this point (that release happened more than five years ago). (In reply to Florian Weimer from comment #13) > (In reply to Sumeet Keswani from comment #12) > > > "Can you reproduce the issue with more recent glibc versions than what's in > > Red Hat Enterprise Linux 6?" > > - yes we are able to reproduce this with glibc-2.12.1-192 which is the most > > recent glibc in RHEL 6.x > > That doesn't really answer my question. I'm interested in the behavior of > the current upstream sources (master branch or any of the recent release > branches would be fine). > > If you don't want to work with upstream sources, we should perhaps move the > conversation to the bug in Red Hat Bugzilla because glibc 2.12 is not > terribly relevant to the glibc upstream community at this point (that > release happened more than five years ago). sorry I misunderstood. yes, I can try it with more recent sources (albeit i havn't thus far). we definitely have a interesting in getting to this bottom of this on more recent version of glibc what ship with RHEL7/ubuntu etc... as we have users on those versions too. Created attachment 9446 [details]
reproducer done on glibc-2.22
Attached reproducer was done on glibc-2.22
It demonstrates the increase in RSS memory.
There are some output files in the tarball that show a local run using the code provided on the following machine config
[centos@ip-10-147-171-138 ~]$ uname -a
Linux 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[centos@ip-10-147-171-138 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Stepping: 4
CPU MHz: 2500.074
BogoMIPS: 5263.77
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
(In reply to Sumeet Keswani from comment #15) > Created attachment 9446 [details] > reproducer done on glibc-2.22 > > Attached reproducer was done on glibc-2.22 > It demonstrates the increase in RSS memory. Can you clarify what your concern is? The bug started with an observed unevenness in the arena selection. Do you still see this as a problem, or are you solely concerned about RSS usage? > There are some output files in the tarball that show a local run using the > code provided on the following machine config Thanks, I have found a remarkably similar machine (but naturally, it's not running under Xen). (In reply to Florian Weimer from comment #16) > (In reply to Sumeet Keswani from comment #15) > > Created attachment 9446 [details] > > reproducer done on glibc-2.22 > > > > Attached reproducer was done on glibc-2.22 > > It demonstrates the increase in RSS memory. > > Can you clarify what your concern is? > > The bug started with an observed unevenness in the arena selection. Do you > still see this as a problem, or are you solely concerned about RSS usage? > > > There are some output files in the tarball that show a local run using the > > code provided on the following machine config > > Thanks, I have found a remarkably similar machine (but naturally, it's not > running under Xen). sorry for the long response... Our concern is the perpetually increasing RSS, which ultimately causes a OOM kill. This is the bottom line. We see this problem with versions/builds of glibc which have the fix for Bug 19048. Consequently we believe that this problem has been exposed by or is related to fix for 19048. In between glibc 2.12.1-166 and 2.12.1.192 we can see "unbalanced and poor utilization" of the arenas which cause memory significant memory bloat and OOM kill. Subsequently we ran the reproducer on glibc-2.22. Here we don’t see this "unbalanced and poor utilization", however arenas grows (leak) with each iteration which will ultimately result in an OOM. So yes, I am happy to fork this bug into two. One for the 2.12 stream where we can see unbalanced use of arenas is apparent. and another for 2.22 where arenas grow steadily. Both result in OOM. (In reply to Sumeet Keswani from comment #17) > Subsequently we ran the reproducer on glibc-2.22. Here we don’t see this > "unbalanced and poor utilization", however arenas grows (leak) with each > iteration which will ultimately result in an OOM. When you write glibc-2.22, do you mean a build from the release tarball, or from the release/2.22/master branch in Git? Thanks for clarifying. (In reply to Florian Weimer from comment #18) > (In reply to Sumeet Keswani from comment #17) > > Subsequently we ran the reproducer on glibc-2.22. Here we don’t see this > > "unbalanced and poor utilization", however arenas grows (leak) with each > > iteration which will ultimately result in an OOM. > > When you write glibc-2.22, do you mean a build from the release tarball, or > from the release/2.22/master branch in Git? Thanks for clarifying. I mean build from the release tarball. (In reply to Sumeet Keswani from comment #17) > Subsequently we ran the reproducer on glibc-2.22. Here we don’t see this > "unbalanced and poor utilization", however arenas grows (leak) with each > iteration which will ultimately result in an OOM. I ran the reproducer with: -iters 20 -threads 512 -maxSize 200 -secs 60 -stats against a glibc 2.22 release version. For me, memory utilization stabilizes at around 4.5 GiB after the first few iterations. I don't know why the ramp-up is slower in your pthread_arena.exe.out.2.22 log file. It could be a kernel difference (I'm on kernel-3.10.0-327.28.2.el7.x86_64). If you run more iterations, does RSS keep increasing? It seems to have plateaued for you as well. I noticed that the test case is invalid because it does not check the result of pthread_create, and this makes the subsequent pthread_join invalid. I still do not see a glibc bug here. RSS usage could be lower, but I don't see evidence of an actual memory leak in the glibc allocator. (In reply to Florian Weimer from comment #21) > I noticed that the test case is invalid because it does not check the result > of pthread_create, and this makes the subsequent pthread_join invalid. > > I still do not see a glibc bug here. RSS usage could be lower, but I don't > see evidence of an actual memory leak in the glibc allocator. Sorry for the delay in response, I was out for a while. I am working on getting you additional data to demonstrate the problem. I will update this BZ shortly. (In reply to Sumeet Keswani from comment #22) > Sorry for the delay in response, I was out for a while. > I am working on getting you additional data to demonstrate the problem. > I will update this BZ shortly. Sumeet, do you have any additional data you can share? Thanks. No, we have not had the chance to debug it further. We believe its a case bug 11261 on steroids - in some sense. Each Arena actually holds on to 5x to 50X of the memory that is given to the application. After Bug 19048 was fixed all arenas start to demonstrate this bloat. Consequently the application quickly OOMs. We have been trying to manage it using MALLOC_TRIM_THRESHOLD_ and MALLOC_ARENA_MAX in some cases. In other cases we have had success using jemalloc. The dynamic mmap threshold does not help since we do pretty large allocations from time to time causing a lot of bloat. Its still on my plate to run a periodic malloc_stats to demonstrate the long term algorithmic problem. I have not gotten to it yet. you may mark this as Need Info Created attachment 9559 [details]
glibc-issue
Its been hard to build a accurate reproducer, since our workloads are complicated. But this (attached) is typical of what we see.
(this is RSz memory of three processes running on three different machines)
The arenas take so much more memory than the process actually uses over longer periods of time. It almost never returns the memory and the process OOMs in a few days.
Note how calling malloc_trim() returns the memory back to the OS.
Hence we know that this memory is totally wasted in the arenas, because by calling malloc_trim() we can recover the memory. Hence there is something algorithmic-ally wrong with the way arenas are managed. - they either do not use the free memory they have, and/or never return the top of the heap.
With the cyclic arena fix (yellow) this problem actually became worse. yellow line will OOM sooner than blue and purple.
We cannot fix this problem with a proper reproducer. |