Bug 14581

Summary:	memalign allocations are often not reused after free
Product:	glibc	Reporter:	Kirill Korotaev <dev>
Component:	malloc	Assignee:	dj <dj>
Status:	ASSIGNED ---
Severity:	enhancement	CC:	acoplan, bugdal, carlos, fweimer, mail, mika.fischer, mirh, nsz, ondra
Priority:	P2	Flags:	fweimer: security-
Version:	2.15
Target Milestone:	---
See Also:	https://bugzilla.redhat.com/show_bug.cgi?id=843478 https://sourceware.org/bugzilla/show_bug.cgi?id=27103
Host:		Target:
Build:		Last reconfirmed:
Attachments:	improved test that reports final heap Example case for image processing use-case

Description Kirill Korotaev 2012-09-14 09:41:12 UTC

If you run the program below you will find that it never really allocates more then 100MB of RAM, while it's RSS grows >700MB (and higher if the internal loop made longer):

--------------------------
Arena 0:
system bytes     =  759681024
in use bytes     =   75144000
Total (incl. mmap):
system bytes     =  759681024
in use bytes     =   75144000
max mmap regions =          0
max mmap bytes   =          0
---------------------------

The same is confirmed by kernel RSS if allocated memory is touched (below only VSZ grows as memory is not touched).

Please note, that this program does a VERY simple thing it allocates/frees objects of 2 fixed sizes only and number of objects is strictly limited, there is no heap fragmentation if you print virtual addresses of allocated objects and kernel VMAs, i.e. you will find a huge unused memory extents which are never reused by glibc. So there is no place for excuses for such behavior like fragmentation.

Tested on RHEL6.3 (glibc-2.12-1.80.el6_3.5) and on FC16 (glibc-2.14.90-24.fc16.9), FC17 (glibc-2.15).

--------------------------------------
#include <stdio.h>
#include <malloc.h>

#define LSIZE (65536+4096)
#define SSIZE 556
#define NL 500
#define NS 70000

int main(int argc, char **argv)
{
        void * bigalloc[NL];
        void * smallalloc[NS];
        int bptr = 0;
        int sptr = 0;
        int i;

        memset(bigalloc, 0, sizeof(bigalloc));
        memset(smallalloc, 0, sizeof(smallalloc));

        for (i = 0; i < (16*1024*1024*1024ULL)/65536; i++) {
                free(bigalloc[i % NL]);
                free(smallalloc[i % NS]);
                smallalloc[i % NS] = malloc(SSIZE);
                bigalloc[i % NL] = memalign(4096, LSIZE);
        }

        malloc_stats();
        system("ps axv|fgrep stressalloc");
}

--------------------------------------

Linked from https://bugzilla.redhat.com/show_bug.cgi?id=843478

Comment 1 Rich Felker 2012-09-14 19:26:39 UTC

Is this supposed to be a bug report or a trick question for CS students? :) You claimed "there is no heap fragmentation", but the allocation/free pattern you're performing seems like a classic fragmentation stress test pattern. There is no general-purpose allocation strategy that can avoid all instances of pathological fragmentation, and this pattern is one that would be especially hard to avoid without having special-cased it (and probably making malloc behavior much worse for common real-world cases). Did you run into this issue in a real-world application, or did you construct this test as a worst-case?

By the way, the pathological fragmentation observed here does not seem specific to glibc. It seems like it will occur with any allocator utilizing the basic dlmalloc strategy. I just confirmed that the same issue happens with ours in musl libc.

Comment 2 Kirill Korotaev 2012-09-15 10:40:39 UTC

> Is this supposed to be a bug report or a trick question for CS students? :) You
> claimed "there is no heap fragmentation", but the allocation/free pattern
> you're performing seems like a classic fragmentation stress test pattern.

I guess you haven't looked closely to the code and result... :(
1. Just print object addresses and allocated VMAs allocated using /proc/self/maps and you will find HUGE holes of not used RAM.
2. Do you realize that test NEVER has more then 100MB allocated, while RSS grows infinitely and to unbounded values?

> There is no general-purpose allocation strategy that can avoid all instances of
> pathological fragmentation, and this pattern is one that would be especially
> hard to avoid without having special-cased it (and probably making malloc
> behavior much worse for common real-world cases).

No comments, see above.

> Did you run into this issue in a real-world application, or did you construct this test as a worst-case?

Yes, we hit this in real-life application. Very simple one, an application processing external requests from network. When a message arrives you allocate 556 bytes for message content to read from socket. Then you need 68KB to process the request. And then you free both. See the result.

> By the way, the pathological fragmentation observed here does not seem specific
> to glibc. It seems like it will occur with any allocator utilizing the basic
> dlmalloc strategy. I just confirmed that the same issue happens with ours in musl libc.

This pattern doesn't lead to unbound usage on TCmalloc, buddy allocator and original ptmalloc.
See my next comment. Bug is in memalign.

Comment 3 Kirill Korotaev 2012-09-15 10:43:46 UTC

one more finding:  not reproducible if replace memalign with malloc() (even if add paddings as memalign does).

Comment from ANK on this:
Interesting.

memalign() in glibc is very simple: it is just malloc(size + alignment + some_small_number)
then tail and head returned to pool.

Probably, this is the culprit. Long shot: when result of memalign is freed, malloc
does not merge fragments, so hole will be of three free parts: head, body, tail
and it is unusable for future memalign.

Comment 4 Rich Felker 2012-09-15 12:39:28 UTC

Very interesting. I'll have to look at the issue in more detail. It doesn't seem to be the memalign fragment merging issue alone causing this, because it also happens on at least one dlmalloc-style allocator that never allows adjacent non-merged chunks to exist.

Comment 5 Rich Felker 2012-09-15 14:08:25 UTC

Created attachment 6634 [details]
improved test that reports final heap

This version of the test semi-portably reports the state of the heap after the fragmentation loop. Most of the gaps are too small to ever be used (natural fragmentation, but there are a number of larger gaps with sizes in the megabyte range. I believe they're also just an artifact of normal fragmentation, but I haven't investigated enough yet to be confident in that claim.

Running the program multiple times with ASLR disabled and different iteration counts would allow one to diff the output between any iterations N and N+K to get a clearer idea of what's happening.

I'm still not convinced that memory growth is unbounded, just that fragmentation is very bad...

Comment 6 Rich Felker 2012-09-15 21:00:22 UTC

Could you explain what you mean by "if you print virtual addresses of allocated objects and kernel VMAs, i.e. you will find a huge unused memory extents which are never reused by glibc"? I'm not aware of any way the kernel VMA data could inform you about heap utilization, which is entirely under userspace control.

I did a simulation of your test loop with much smaller sizes using pen and paper, and 

With SSIZE=2, ALIGN=8, LSIZE=5, NS=[many], NL=4:

3z: SS LLLLLSS LLLLLSS LLLLLSS LLLLL
4a: SS      SS LLLLLSS LLLLLSS LLLLL
4z: SSSS    SS LLLLLSS LLLLLSS LLLLLLLLLL
5a: SSSS    SS      SS LLLLLSS LLLLLLLLLL
5z: SSSSSS  SS LLLLLSS LLLLLSS LLLLLLLLLL
6a: SSSSSS  SS LLLLLSS      SS LLLLLLLLLL
6z: SSSSSSSSSS LLLLLSS LLLLLSS LLLLLLLLLL
7a: SSSSSSSSSS LLLLLSS LLLLLSS      LLLLL
7z: SSSSSSSSSS LLLLLSS LLLLLSSSS    LLLLLLLLLL
...

where 3z means "at the end of iteration 3" and 4a means "after the free steps of iteration 4", etc. I might have gotten some details wrong, but it seems this pattern necessarily enforces fragmentation by destroying the alignment potential of each LSIZE-sized free range.

Obviously there are some allocation strategies that avoid the issue, but I don't see it being avoidable in dlmalloc-type schemes in general. If you have an idea for how to avoid it without destroying the good properties of the allocator strategy, please share.

Comment 7 Kirill Korotaev 2012-09-16 09:44:02 UTC

(In reply to comment #6)
> Could you explain what you mean by "if you print virtual addresses of allocated
> objects and kernel VMAs, i.e. you will find a huge unused memory extents which
> are never reused by glibc"? I'm not aware of any way the kernel VMA data could
> inform you about heap utilization, which is entirely under userspace control.

It's very simple. There is RSS and VSZ in /proc/pid/status.
RSS tells you how much physical memory was really allocated by kernel. If you add memset() of objects after being allocated you will find that it's really 700MB which corresponds to VSZ as well. i.e. this memory is committed.

> I did a simulation of your test loop with much smaller sizes using pen and
> paper, and 
> 
> With SSIZE=2, ALIGN=8, LSIZE=5, NS=[many], NL=4:
> 
> 3z: SS LLLLLSS LLLLLSS LLLLLSS LLLLL
> 4a: SS      SS LLLLLSS LLLLLSS LLLLL
> 4z: SSSS    SS LLLLLSS LLLLLSS LLLLLLLLLL
> 5a: SSSS    SS      SS LLLLLSS LLLLLLLLLL
> 5z: SSSSSS  SS LLLLLSS LLLLLSS LLLLLLLLLL
> 6a: SSSSSS  SS LLLLLSS      SS LLLLLLLLLL
> 6z: SSSSSSSSSS LLLLLSS LLLLLSS LLLLLLLLLL
> 7a: SSSSSSSSSS LLLLLSS LLLLLSS      LLLLL
> 7z: SSSSSSSSSS LLLLLSS LLLLLSSSS    LLLLLLLLLL
> ...
> 
> where 3z means "at the end of iteration 3" and 4a means "after the free steps
> of iteration 4", etc. I might have gotten some details wrong, but it seems this
> pattern necessarily enforces fragmentation by destroying the alignment
> potential of each LSIZE-sized free range.

First 500 iterations are not interesting that much, cause they do not free any previously allocated objects.
Have you noticed that array index wraps after NL and NS iterations passed and then most interesting begins?


> Obviously there are some allocation strategies that avoid the issue, but I
> don't see it being avoidable in dlmalloc-type schemes in general. If you have
> an idea for how to avoid it without destroying the good properties of the
> allocator strategy, please share.

Looks like I start to realize what you mean...

Actually, theoretically any allocator should not ever allocate physical RAM more then 2*allocated_size due to fragmentation on pattern like this, right? (it's simple: if you allocated more then 2x times, this means you have unused holes bigger then any single object and could allocate from it...). In our case we see about 10x times ratio...

And there are many which behave like that: TCMalloc, buddy etc.
What is not natural in this test is that memalign replaced with malloc() fixes the problem...

Comment 8 Rich Felker 2012-09-16 12:46:01 UTC

> It's very simple. There is RSS and VSZ in /proc/pid/status.
> RSS tells you how much physical memory was really allocated by kernel. If you
> add memset() of objects after being allocated you will find that it's really
> 700MB which corresponds to VSZ as well. i.e. this memory is committed.

Of course, but it can't show you gaps in the heap, only the total size of the heap.

> First 500 iterations are not interesting that much, cause they do not free any
> previously allocated objects.
> Have you noticed that array index wraps after NL and NS iterations passed and
> then most interesting begins?

That's why my experiment on paper had NL=4, to see quickly what happens after the index wraps.

> Actually, theoretically any allocator should not ever allocate physical RAM
> more then 2*allocated_size due to fragmentation on pattern like this, right?

No, the theoretical limit is many orders of magnitude worse, especially with alignment constraints. Picture wanting to allocate an object of size K, but with N objects of size 1 spaced evenly every K-1 units. In this case you have N*(K-1) units of memory "free", but unable to accommodate the size-K object, thus requiring new heap space for it. This fragmentation can grow unboundedly; N can be made arbitrarily large. Also, the size-1 objects can be spaced even farther apart and still block out allocation of a size-K object if the latter has alignment requirements. I think your test case is one situation where the alignment issue matters.

Comment 9 Siddhesh Poyarekar 2013-05-13 09:30:23 UTC

This is not a bug; the allocator is working as designed and does not claim to be immune to fragmentation.  If there are better strategies to reduce fragmentation, then patches to implement them are welcome on libc-alpha.

Comment 10 OndrejBilka 2013-05-20 15:12:23 UTC

It is valid corner case so it will be open until resolved.

Rich Allocation happen to be 65536 bytes with 4096 alignment.
This is not pathologic case as when you allocate 65536+4096 bytes and pick aligned address a modified bound mentioned should hold.

The bound says that we must not allocate more than twice usable memory but usable memory is taken as maximum over program history. 

With alignment this example factor should be less than 2*(65536+4096)/65536

Comment 11 Florian Weimer 2018-07-29 07:36:54 UTC

The Ruby allocator calls posix_memalign (16384, 16344) extensively, so we need to fix this.

One way to fix this would be to attribute the alignment gap to the allocation and  not return it to the allocator for independent reuse.  This way, a freed aligned block will have a sufficiently large size that can be split again for the next aligned allocation.

Comment 12 Rich Felker 2018-07-29 16:59:17 UTC

Since last looking at this issue, I've been thinking about allocator designs to avoid fragmentation, and Florian's latest idea in comment 11 matches my current view on what's likely to be best. However it does seem to waste up to half of the memory consumed (i.e. up to 100% overhead) for the duration that it's allocated. This is probably the right tradeoff still when the alternative is permanent-ish fragmentation for the process lifetime, but maybe there are ways to make it better/less-costly.

Comment 13 Florian Weimer 2018-07-30 10:41:22 UTC

(In reply to Rich Felker from comment #12)
> Since last looking at this issue, I've been thinking about allocator designs
> to avoid fragmentation, and Florian's latest idea in comment 11 matches my
> current view on what's likely to be best. However it does seem to waste up
> to half of the memory consumed (i.e. up to 100% overhead) for the duration
> that it's allocated.

Right.  I thought it would be possible to avoid this, but it doesn't look like it is.

> This is probably the right tradeoff still when the
> alternative is permanent-ish fragmentation for the process lifetime, but
> maybe there are ways to make it better/less-costly.

Other approaches would be to search for a smaller, sufficiently aligned allocation.  This might not be so bad with the current glibc allocator and its unsorted bins.

Or we could use a separate allocator for aligned allocations which has separate freelists for different alignments, and sorts the freelist bins by size and alignment.  However, this will prevent us from reusing memory which has been used for aligned allocations for unaligned ones.

Comment 14 Carlos O'Donell 2018-07-30 16:09:45 UTC

(In reply to Florian Weimer from comment #13)
> > This is probably the right tradeoff still when the
> > alternative is permanent-ish fragmentation for the process lifetime, but
> > maybe there are ways to make it better/less-costly.
> 
> Other approaches would be to search for a smaller, sufficiently aligned
> allocation.  This might not be so bad with the current glibc allocator and
> its unsorted bins.

We should test this approach, but I suspect keeping the pre-gap, allocation, and post-gap together in some way will yield the best fragmentation reducing results. This is just a hunch though, we should continue to use the trace, simulation, and other tools we have to derive a data-driven approach to making changes in malloc. Then we should validate the changes to show the model we have is predicting real behaviour in the applications e.g. re-run fluentd + ruby with the changes and see if we managed to make an RSS difference (if fluentd suffers from this problem, we don't know).

> Or we could use a separate allocator for aligned allocations which has
> separate freelists for different alignments, and sorts the freelist bins by
> size and alignment.  However, this will prevent us from reusing memory which
> has been used for aligned allocations for unaligned ones.

This is definitely outside of the box thinking, and takes us down a path of heaps which have specific properties. I was thinking about this the other day, but need to re-review the literature on this topic.

I think the best approach in terms of value for the design is going to be grouping together the aligned allocation.

Comment 15 Mika Fischer 2019-05-07 17:30:04 UTC

Created attachment 11763 [details]
Example case for image processing use-case

We hit this issue (or at least I assume it is this issue) in production.

Our use-case is that we have a loop with:
- two large aligned allocations (large temporary images)
- followed by a small unaligned allocation (the result)
- the large allocations are freed, the small result is kept

For some sizes (see attached program) this leads to glibc never freeing any memory! The attached program demonstrates the issue. It runs out of memory in seconds...

Comment 16 Carlos O'Donell 2019-05-08 04:38:26 UTC

(In reply to Mika Fischer from comment #15)
> Created attachment 11763 [details]
> Example case for image processing use-case
> 
> We hit this issue (or at least I assume it is this issue) in production.
> 
> Our use-case is that we have a loop with:
> - two large aligned allocations (large temporary images)
> - followed by a small unaligned allocation (the result)
> - the large allocations are freed, the small result is kept
> 
> For some sizes (see attached program) this leads to glibc never freeing any
> memory! The attached program demonstrates the issue. It runs out of memory
> in seconds...

Yes, it's the same case.

In your example the 16MiB aligned block needs worst case 16MiB + alignment + MIN_SIZE, but memalign, after getting back that block, will split the leader and header from the block and reuse them. If a long-lived allocation takes one of leader/remainder blocks then we can no longer find enough space on a subsequent memalign to reallocate the 16MiB + alignment + MIN_SIZE block, and we end up having to extend the heap again by 16MiB + alignment + MIN_SIZE. This continues every iteration, growing the heap by sizes which are just shy of the needed space.

Your only workaround is the usual trick:
GLIBC_TUNABLES=glibc.malloc.mmap_threshold=10485760 strace -ff -ttt -o loop2.log ./memalign-loop
Which avoids the dynamic thresholding which is moving the allocation from mmap to the brk heap.

Looking over the code again, I think the problem is really this:

4698   /* Call malloc with worst case padding to hit alignment. */
4699 
4700   m = (char *) (_int_malloc (av, nb + alignment + MINSIZE));
4701 

We need to avoid this pessimistic allocation and scan the bins ourselves looking for a suitably aligned allocation. It's there, and ready to be used, but not if we lazily use _int_malloc instead of scanning the bins ourself first. I think such a scan will yield the best results overall and be straight forward to implement.

Comment 17 Carlos O'Donell 2019-06-24 02:57:26 UTC

The Fedora glibc team is now testing a patch in Fedora Rawhide to add aligned chunk scanning to memalign to reduce external fragmentation.

https://lists.fedoraproject.org/archives/list/glibc@lists.fedoraproject.org/thread/2PCHP5UWONIOAEUG34YBAQQYD7JL5JJ4/

If the existing chunks are scanned to look for a sufficiently aligned chunk, this should go a long way to reusing those chunks after free for similar sized and similar aligned requests.

Comment 18 mirh 2022-08-02 22:20:58 UTC

The Fedora patch was backed out hours after the last comment here, but there's now new steam behind it. 
https://patchwork.sourceware.org/project/glibc/patch/xn4jz19fts.fsf@greed.delorie.com/

Comment 19 Carlos O'Donell 2022-08-10 04:00:28 UTC

(In reply to mirh from comment #18)
> The Fedora patch was backed out hours after the last comment here, but
> there's now new steam behind it. 
> https://patchwork.sourceware.org/project/glibc/patch/xn4jz19fts.fsf@greed.
> delorie.com/

Yes, the original Fedora patch was incomplete. I reverted it and set it aside as we worked on other things. We're coming back to this work again to improve the reuse of aligned allocations.

Just a note that Ruby, which was a driving use case for fixing the aligned allocation reuse, has entirely switched to mmap to allocate aligned pages that it needs for internal uses.

While we still want to fix this issue, it's really bad external fragmentation, but Ruby is no longer a specific problem.

If you get a chance, please feel free to try out DJ's latest patch and comment.

Comment 20 Szabolcs Nagy 2022-12-08 14:10:37 UTC

i think i ran into this or a similar issue, but the proposed patch "[v4,1/1] memalign: Support scanning for aligned chunks." does not seem to help.

example: 100 iterations where each iteration leaks 16+64 bytes, everything else is freed, but every iteration seems to increase the heap by 3231744 bytes at least. with GLIBC_TUNABLES=glibc.malloc.mmap_threshold=600000 there is no issue (large allocations turn into mmap/munmap pairs).

#include <stdlib.h>
#include <string.h>
static void *volatile q;
static void use(void *p){ q = p; }
int main(int argc, char *argv[])
{
        void *p1;
        void *p2;
        void *p3;
        for (int i=0; i < 100; i++) {
                p1 = aligned_alloc(512, 1176064);
                memset(p1, 0, 1176064);
                p2 = malloc(16);
                p3 = malloc(64);
                use(p1);
                use(p2);
                use(p3);
                free(p1);
                p1 = aligned_alloc(256, 646656);
                p2 = aligned_alloc(1024, 3232768);
                memset(p2, 0, 3232768);
                p3 = malloc(128);
                use(p1);
                use(p2);
                use(p3);
                free(p2);
                free(p3);
                free(p1);
        }
}

Comment 21 Carlos O'Donell 2022-12-08 16:13:19 UTC

(In reply to Szabolcs Nagy from comment #20)
> i think i ran into this or a similar issue, but the proposed patch "[v4,1/1]
> memalign: Support scanning for aligned chunks." does not seem to help.

The patch in question is here:
https://patchwork.sourceware.org/project/glibc/patch/xn4jz19fts.fsf@greed.delorie.com/

I pinged DJ to look at the test case. Thanks for putting something together.

Comment 22 dj@redhat.com 2022-12-14 21:50:18 UTC

(In reply to Szabolcs Nagy from comment #20)
> example: 100 iterations where each iteration leaks 16+64 bytes, everything
> else is freed, but every iteration seems to increase the heap by 3231744
> bytes at least. with GLIBC_TUNABLES=glibc.malloc.mmap_threshold=600000 there
> is no issue (large allocations turn into mmap/munmap pairs).

I did a bunch of testing with this example, but didn't see continued growth.  Typically, after a few loops, malloc learned to use one of the large chunks for the 16/64 and to reuse all the others.  I tried with and without tcache and fastbins in case those made a difference.  Note: this is *without* my memalign patch.

Comment 23 Szabolcs Nagy 2022-12-15 10:29:16 UTC

i tried on both aarch64 and x86_64 upstream master and if i add vmstat() at the very end of main with

static void vmstat(void)
{
        char buf[1000];
        FILE *f = fopen("/proc/self/status","r");
        while (fgets(buf, sizeof buf, f))
                if (buf[0] == 'V') fputs(buf, stdout);
        fclose(f);
}

then it prints

VmPeak:   316888 kB
VmSize:   316888 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    314636 kB
VmRSS:    314636 kB
VmData:   314652 kB
VmStk:       132 kB
VmExe:         8 kB
VmLib:      1480 kB
VmPTE:       660 kB
VmSwap:        0 kB

i.e. >300MB is used (if compiled with -O1 then memset is avoided so RSS is small but VmPeak is still big). i also see >100 brk calls in strace. i dont know what might be different in my environment.

Comment 24 Szabolcs Nagy 2022-12-15 10:45:36 UTC

(In reply to dj@redhat.com from comment #22)
> I did a bunch of testing with this example, but didn't see continued growth.
> Typically, after a few loops, malloc learned to use one of the large chunks

ok after some tries it seems i don't get large vm usage on x86_64 with static linking, but if i add

use(malloc(12345));

at the begining of main then i do. so it seems the initial heap state matters (which can be different between static/dynamic linking or depend on the environment).

Comment 25 dj@redhat.com 2022-12-15 21:44:22 UTC

Hmmm... so it seems this is reproducible if tcache+fastbins are enabled, and partially reproducible if only fastbins are.  If both are disabled via tunables, it shows no growth.  I think this is a general side effect of things in tcache/fastbins/unsorted not allowing coalescing.