This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug malloc/23416] Some allocation patterns lead to a runaway use of RSS


https://sourceware.org/bugzilla/show_bug.cgi?id=23416

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at redhat dot com

--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Mike Hommey from comment #0)
> This is the second time (previously: https://glandium.org/blog/?p=3698) that
> glibc gets out of hand on memory usage, except this time my program was even
> shot the OOM killer. It's apparently caused by the allocation patterns, with
> small and increasingly large allocations being intertwined. Contrary to the
> first time, this is now happening in a purely C process (no python
> involved), and this time, I went ahead grabbing a full allocation trace
> using the stuff described in
> http://www.linuxplumbersconf.net/2016/ocw//system/presentations/3921/
> original/LPC%202016%20-%20linux%20and%20glibc_%20The%204.
> 5TiB%20malloc%20API%20trace.pdf . (Sadly two years later, this is still not
> in glibc master, and I had to find the right 2-years-old branch).
> 
> The trace is 1.8G, and by the time I stopped the process, it had sucked 10GB
> RSS, while the actual memory requested is clearly less than 1.5G. Running
> the block_size_rss tool from https://pagure.io/glibc-malloc-trace-utils says
> 1.2G is what the RSS should be. 10GB is clearly out of hand.
> 
> Compressed, the trace is much smaller (25M), but it's still too large for a
> bugzilla attachment. It can be downloaded from
> https://glandium.org/mtrace.out.21544.xz

Thanks for the trace.

My apologies that the tracer is not yet integrated upstream. It's on our list
of things to do, since the tracer is very useful when looking at these kinds of
issues. More useful however would be a heap dumper to have a look at the state
of the heap, but that can be reconstructed from the full trace.

I have downloaded the full trace you provided, unpacked it and run it through
some of my comparison scripts (https://pagure.io/glibc-malloc-trace-utils).

There is a very large 1073741844 (1.07GiB) allocation at the 730th allocation,
this is only 1/10th the 10GB RSS you claim, so not a big measure, and it never
happens again. Only a one time allocation given the full set of data.

One problem I see:
00005428     free -------------- 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000
00005428     free -------------- 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000
00005428     free -------------- 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000
00005428     free -------------- 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000
00005428     free -------------- 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000
00005428     free -------------- 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000

If your application clearing a pointer before freeing it? And thus loosing the
memory? Just a quick grep and I see over 600,000 such records. We have not seen
any record errors where a free would fail to record the data it saw.

You have lots of free(NULL), is this normal?

At this point I would turn on all my instrumentation in the trace_run.c
simulator to do the following:

* Get a step-by-step tally of demand e.g. what the API requested.
* Get a step-by-step view of process RSS e.g. what the process was using.

If these two lines deviate then you have a problem in the allocator, otherwise
it's just increased RSS based on increased demand. Unfortunately this requires
my verbose trace patches for the simulator which I don't have readily around.

We do plan to come back to this infrastructure this year though, to start
looking at RSS issues. Keep in mind that our tcache work was performance
focused, and now we need to comeback and look at RSS.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]