Summary: | memalign allocations are often not reused after free | ||
---|---|---|---|
Product: | glibc | Reporter: | Kirill Korotaev <dev> |
Component: | malloc | Assignee: | dj <dj> |
Status: | ASSIGNED --- | ||
Severity: | enhancement | CC: | acoplan, bugdal, carlos, fweimer, mail, mika.fischer, mirh, nsz, ondra |
Priority: | P2 | Flags: | fweimer:
security-
|
Version: | 2.15 | ||
Target Milestone: | --- | ||
See Also: |
https://bugzilla.redhat.com/show_bug.cgi?id=843478 https://sourceware.org/bugzilla/show_bug.cgi?id=27103 |
||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Attachments: |
improved test that reports final heap
Example case for image processing use-case |
Description
Kirill Korotaev
2012-09-14 09:41:12 UTC
Is this supposed to be a bug report or a trick question for CS students? :) You claimed "there is no heap fragmentation", but the allocation/free pattern you're performing seems like a classic fragmentation stress test pattern. There is no general-purpose allocation strategy that can avoid all instances of pathological fragmentation, and this pattern is one that would be especially hard to avoid without having special-cased it (and probably making malloc behavior much worse for common real-world cases). Did you run into this issue in a real-world application, or did you construct this test as a worst-case? By the way, the pathological fragmentation observed here does not seem specific to glibc. It seems like it will occur with any allocator utilizing the basic dlmalloc strategy. I just confirmed that the same issue happens with ours in musl libc. > Is this supposed to be a bug report or a trick question for CS students? :) You > claimed "there is no heap fragmentation", but the allocation/free pattern > you're performing seems like a classic fragmentation stress test pattern. I guess you haven't looked closely to the code and result... :( 1. Just print object addresses and allocated VMAs allocated using /proc/self/maps and you will find HUGE holes of not used RAM. 2. Do you realize that test NEVER has more then 100MB allocated, while RSS grows infinitely and to unbounded values? > There is no general-purpose allocation strategy that can avoid all instances of > pathological fragmentation, and this pattern is one that would be especially > hard to avoid without having special-cased it (and probably making malloc > behavior much worse for common real-world cases). No comments, see above. > Did you run into this issue in a real-world application, or did you construct this test as a worst-case? Yes, we hit this in real-life application. Very simple one, an application processing external requests from network. When a message arrives you allocate 556 bytes for message content to read from socket. Then you need 68KB to process the request. And then you free both. See the result. > By the way, the pathological fragmentation observed here does not seem specific > to glibc. It seems like it will occur with any allocator utilizing the basic > dlmalloc strategy. I just confirmed that the same issue happens with ours in musl libc. This pattern doesn't lead to unbound usage on TCmalloc, buddy allocator and original ptmalloc. See my next comment. Bug is in memalign. one more finding: not reproducible if replace memalign with malloc() (even if add paddings as memalign does). Comment from ANK on this: Interesting. memalign() in glibc is very simple: it is just malloc(size + alignment + some_small_number) then tail and head returned to pool. Probably, this is the culprit. Long shot: when result of memalign is freed, malloc does not merge fragments, so hole will be of three free parts: head, body, tail and it is unusable for future memalign. Very interesting. I'll have to look at the issue in more detail. It doesn't seem to be the memalign fragment merging issue alone causing this, because it also happens on at least one dlmalloc-style allocator that never allows adjacent non-merged chunks to exist. Created attachment 6634 [details]
improved test that reports final heap
This version of the test semi-portably reports the state of the heap after the fragmentation loop. Most of the gaps are too small to ever be used (natural fragmentation, but there are a number of larger gaps with sizes in the megabyte range. I believe they're also just an artifact of normal fragmentation, but I haven't investigated enough yet to be confident in that claim.
Running the program multiple times with ASLR disabled and different iteration counts would allow one to diff the output between any iterations N and N+K to get a clearer idea of what's happening.
I'm still not convinced that memory growth is unbounded, just that fragmentation is very bad...
Could you explain what you mean by "if you print virtual addresses of allocated objects and kernel VMAs, i.e. you will find a huge unused memory extents which are never reused by glibc"? I'm not aware of any way the kernel VMA data could inform you about heap utilization, which is entirely under userspace control. I did a simulation of your test loop with much smaller sizes using pen and paper, and With SSIZE=2, ALIGN=8, LSIZE=5, NS=[many], NL=4: 3z: SS LLLLLSS LLLLLSS LLLLLSS LLLLL 4a: SS SS LLLLLSS LLLLLSS LLLLL 4z: SSSS SS LLLLLSS LLLLLSS LLLLLLLLLL 5a: SSSS SS SS LLLLLSS LLLLLLLLLL 5z: SSSSSS SS LLLLLSS LLLLLSS LLLLLLLLLL 6a: SSSSSS SS LLLLLSS SS LLLLLLLLLL 6z: SSSSSSSSSS LLLLLSS LLLLLSS LLLLLLLLLL 7a: SSSSSSSSSS LLLLLSS LLLLLSS LLLLL 7z: SSSSSSSSSS LLLLLSS LLLLLSSSS LLLLLLLLLL ... where 3z means "at the end of iteration 3" and 4a means "after the free steps of iteration 4", etc. I might have gotten some details wrong, but it seems this pattern necessarily enforces fragmentation by destroying the alignment potential of each LSIZE-sized free range. Obviously there are some allocation strategies that avoid the issue, but I don't see it being avoidable in dlmalloc-type schemes in general. If you have an idea for how to avoid it without destroying the good properties of the allocator strategy, please share. (In reply to comment #6) > Could you explain what you mean by "if you print virtual addresses of allocated > objects and kernel VMAs, i.e. you will find a huge unused memory extents which > are never reused by glibc"? I'm not aware of any way the kernel VMA data could > inform you about heap utilization, which is entirely under userspace control. It's very simple. There is RSS and VSZ in /proc/pid/status. RSS tells you how much physical memory was really allocated by kernel. If you add memset() of objects after being allocated you will find that it's really 700MB which corresponds to VSZ as well. i.e. this memory is committed. > I did a simulation of your test loop with much smaller sizes using pen and > paper, and > > With SSIZE=2, ALIGN=8, LSIZE=5, NS=[many], NL=4: > > 3z: SS LLLLLSS LLLLLSS LLLLLSS LLLLL > 4a: SS SS LLLLLSS LLLLLSS LLLLL > 4z: SSSS SS LLLLLSS LLLLLSS LLLLLLLLLL > 5a: SSSS SS SS LLLLLSS LLLLLLLLLL > 5z: SSSSSS SS LLLLLSS LLLLLSS LLLLLLLLLL > 6a: SSSSSS SS LLLLLSS SS LLLLLLLLLL > 6z: SSSSSSSSSS LLLLLSS LLLLLSS LLLLLLLLLL > 7a: SSSSSSSSSS LLLLLSS LLLLLSS LLLLL > 7z: SSSSSSSSSS LLLLLSS LLLLLSSSS LLLLLLLLLL > ... > > where 3z means "at the end of iteration 3" and 4a means "after the free steps > of iteration 4", etc. I might have gotten some details wrong, but it seems this > pattern necessarily enforces fragmentation by destroying the alignment > potential of each LSIZE-sized free range. First 500 iterations are not interesting that much, cause they do not free any previously allocated objects. Have you noticed that array index wraps after NL and NS iterations passed and then most interesting begins? > Obviously there are some allocation strategies that avoid the issue, but I > don't see it being avoidable in dlmalloc-type schemes in general. If you have > an idea for how to avoid it without destroying the good properties of the > allocator strategy, please share. Looks like I start to realize what you mean... Actually, theoretically any allocator should not ever allocate physical RAM more then 2*allocated_size due to fragmentation on pattern like this, right? (it's simple: if you allocated more then 2x times, this means you have unused holes bigger then any single object and could allocate from it...). In our case we see about 10x times ratio... And there are many which behave like that: TCMalloc, buddy etc. What is not natural in this test is that memalign replaced with malloc() fixes the problem... > It's very simple. There is RSS and VSZ in /proc/pid/status. > RSS tells you how much physical memory was really allocated by kernel. If you > add memset() of objects after being allocated you will find that it's really > 700MB which corresponds to VSZ as well. i.e. this memory is committed. Of course, but it can't show you gaps in the heap, only the total size of the heap. > First 500 iterations are not interesting that much, cause they do not free any > previously allocated objects. > Have you noticed that array index wraps after NL and NS iterations passed and > then most interesting begins? That's why my experiment on paper had NL=4, to see quickly what happens after the index wraps. > Actually, theoretically any allocator should not ever allocate physical RAM > more then 2*allocated_size due to fragmentation on pattern like this, right? No, the theoretical limit is many orders of magnitude worse, especially with alignment constraints. Picture wanting to allocate an object of size K, but with N objects of size 1 spaced evenly every K-1 units. In this case you have N*(K-1) units of memory "free", but unable to accommodate the size-K object, thus requiring new heap space for it. This fragmentation can grow unboundedly; N can be made arbitrarily large. Also, the size-1 objects can be spaced even farther apart and still block out allocation of a size-K object if the latter has alignment requirements. I think your test case is one situation where the alignment issue matters. This is not a bug; the allocator is working as designed and does not claim to be immune to fragmentation. If there are better strategies to reduce fragmentation, then patches to implement them are welcome on libc-alpha. It is valid corner case so it will be open until resolved. Rich Allocation happen to be 65536 bytes with 4096 alignment. This is not pathologic case as when you allocate 65536+4096 bytes and pick aligned address a modified bound mentioned should hold. The bound says that we must not allocate more than twice usable memory but usable memory is taken as maximum over program history. With alignment this example factor should be less than 2*(65536+4096)/65536 The Ruby allocator calls posix_memalign (16384, 16344) extensively, so we need to fix this. One way to fix this would be to attribute the alignment gap to the allocation and not return it to the allocator for independent reuse. This way, a freed aligned block will have a sufficiently large size that can be split again for the next aligned allocation. Since last looking at this issue, I've been thinking about allocator designs to avoid fragmentation, and Florian's latest idea in comment 11 matches my current view on what's likely to be best. However it does seem to waste up to half of the memory consumed (i.e. up to 100% overhead) for the duration that it's allocated. This is probably the right tradeoff still when the alternative is permanent-ish fragmentation for the process lifetime, but maybe there are ways to make it better/less-costly. (In reply to Rich Felker from comment #12) > Since last looking at this issue, I've been thinking about allocator designs > to avoid fragmentation, and Florian's latest idea in comment 11 matches my > current view on what's likely to be best. However it does seem to waste up > to half of the memory consumed (i.e. up to 100% overhead) for the duration > that it's allocated. Right. I thought it would be possible to avoid this, but it doesn't look like it is. > This is probably the right tradeoff still when the > alternative is permanent-ish fragmentation for the process lifetime, but > maybe there are ways to make it better/less-costly. Other approaches would be to search for a smaller, sufficiently aligned allocation. This might not be so bad with the current glibc allocator and its unsorted bins. Or we could use a separate allocator for aligned allocations which has separate freelists for different alignments, and sorts the freelist bins by size and alignment. However, this will prevent us from reusing memory which has been used for aligned allocations for unaligned ones. (In reply to Florian Weimer from comment #13) > > This is probably the right tradeoff still when the > > alternative is permanent-ish fragmentation for the process lifetime, but > > maybe there are ways to make it better/less-costly. > > Other approaches would be to search for a smaller, sufficiently aligned > allocation. This might not be so bad with the current glibc allocator and > its unsorted bins. We should test this approach, but I suspect keeping the pre-gap, allocation, and post-gap together in some way will yield the best fragmentation reducing results. This is just a hunch though, we should continue to use the trace, simulation, and other tools we have to derive a data-driven approach to making changes in malloc. Then we should validate the changes to show the model we have is predicting real behaviour in the applications e.g. re-run fluentd + ruby with the changes and see if we managed to make an RSS difference (if fluentd suffers from this problem, we don't know). > Or we could use a separate allocator for aligned allocations which has > separate freelists for different alignments, and sorts the freelist bins by > size and alignment. However, this will prevent us from reusing memory which > has been used for aligned allocations for unaligned ones. This is definitely outside of the box thinking, and takes us down a path of heaps which have specific properties. I was thinking about this the other day, but need to re-review the literature on this topic. I think the best approach in terms of value for the design is going to be grouping together the aligned allocation. Created attachment 11763 [details]
Example case for image processing use-case
We hit this issue (or at least I assume it is this issue) in production.
Our use-case is that we have a loop with:
- two large aligned allocations (large temporary images)
- followed by a small unaligned allocation (the result)
- the large allocations are freed, the small result is kept
For some sizes (see attached program) this leads to glibc never freeing any memory! The attached program demonstrates the issue. It runs out of memory in seconds...
(In reply to Mika Fischer from comment #15) > Created attachment 11763 [details] > Example case for image processing use-case > > We hit this issue (or at least I assume it is this issue) in production. > > Our use-case is that we have a loop with: > - two large aligned allocations (large temporary images) > - followed by a small unaligned allocation (the result) > - the large allocations are freed, the small result is kept > > For some sizes (see attached program) this leads to glibc never freeing any > memory! The attached program demonstrates the issue. It runs out of memory > in seconds... Yes, it's the same case. In your example the 16MiB aligned block needs worst case 16MiB + alignment + MIN_SIZE, but memalign, after getting back that block, will split the leader and header from the block and reuse them. If a long-lived allocation takes one of leader/remainder blocks then we can no longer find enough space on a subsequent memalign to reallocate the 16MiB + alignment + MIN_SIZE block, and we end up having to extend the heap again by 16MiB + alignment + MIN_SIZE. This continues every iteration, growing the heap by sizes which are just shy of the needed space. Your only workaround is the usual trick: GLIBC_TUNABLES=glibc.malloc.mmap_threshold=10485760 strace -ff -ttt -o loop2.log ./memalign-loop Which avoids the dynamic thresholding which is moving the allocation from mmap to the brk heap. Looking over the code again, I think the problem is really this: 4698 /* Call malloc with worst case padding to hit alignment. */ 4699 4700 m = (char *) (_int_malloc (av, nb + alignment + MINSIZE)); 4701 We need to avoid this pessimistic allocation and scan the bins ourselves looking for a suitably aligned allocation. It's there, and ready to be used, but not if we lazily use _int_malloc instead of scanning the bins ourself first. I think such a scan will yield the best results overall and be straight forward to implement. The Fedora glibc team is now testing a patch in Fedora Rawhide to add aligned chunk scanning to memalign to reduce external fragmentation. https://lists.fedoraproject.org/archives/list/glibc@lists.fedoraproject.org/thread/2PCHP5UWONIOAEUG34YBAQQYD7JL5JJ4/ If the existing chunks are scanned to look for a sufficiently aligned chunk, this should go a long way to reusing those chunks after free for similar sized and similar aligned requests. The Fedora patch was backed out hours after the last comment here, but there's now new steam behind it. https://patchwork.sourceware.org/project/glibc/patch/xn4jz19fts.fsf@greed.delorie.com/ (In reply to mirh from comment #18) > The Fedora patch was backed out hours after the last comment here, but > there's now new steam behind it. > https://patchwork.sourceware.org/project/glibc/patch/xn4jz19fts.fsf@greed. > delorie.com/ Yes, the original Fedora patch was incomplete. I reverted it and set it aside as we worked on other things. We're coming back to this work again to improve the reuse of aligned allocations. Just a note that Ruby, which was a driving use case for fixing the aligned allocation reuse, has entirely switched to mmap to allocate aligned pages that it needs for internal uses. While we still want to fix this issue, it's really bad external fragmentation, but Ruby is no longer a specific problem. If you get a chance, please feel free to try out DJ's latest patch and comment. i think i ran into this or a similar issue, but the proposed patch "[v4,1/1] memalign: Support scanning for aligned chunks." does not seem to help. example: 100 iterations where each iteration leaks 16+64 bytes, everything else is freed, but every iteration seems to increase the heap by 3231744 bytes at least. with GLIBC_TUNABLES=glibc.malloc.mmap_threshold=600000 there is no issue (large allocations turn into mmap/munmap pairs). #include <stdlib.h> #include <string.h> static void *volatile q; static void use(void *p){ q = p; } int main(int argc, char *argv[]) { void *p1; void *p2; void *p3; for (int i=0; i < 100; i++) { p1 = aligned_alloc(512, 1176064); memset(p1, 0, 1176064); p2 = malloc(16); p3 = malloc(64); use(p1); use(p2); use(p3); free(p1); p1 = aligned_alloc(256, 646656); p2 = aligned_alloc(1024, 3232768); memset(p2, 0, 3232768); p3 = malloc(128); use(p1); use(p2); use(p3); free(p2); free(p3); free(p1); } } (In reply to Szabolcs Nagy from comment #20) > i think i ran into this or a similar issue, but the proposed patch "[v4,1/1] > memalign: Support scanning for aligned chunks." does not seem to help. The patch in question is here: https://patchwork.sourceware.org/project/glibc/patch/xn4jz19fts.fsf@greed.delorie.com/ I pinged DJ to look at the test case. Thanks for putting something together. (In reply to Szabolcs Nagy from comment #20) > example: 100 iterations where each iteration leaks 16+64 bytes, everything > else is freed, but every iteration seems to increase the heap by 3231744 > bytes at least. with GLIBC_TUNABLES=glibc.malloc.mmap_threshold=600000 there > is no issue (large allocations turn into mmap/munmap pairs). I did a bunch of testing with this example, but didn't see continued growth. Typically, after a few loops, malloc learned to use one of the large chunks for the 16/64 and to reuse all the others. I tried with and without tcache and fastbins in case those made a difference. Note: this is *without* my memalign patch. i tried on both aarch64 and x86_64 upstream master and if i add vmstat() at the very end of main with static void vmstat(void) { char buf[1000]; FILE *f = fopen("/proc/self/status","r"); while (fgets(buf, sizeof buf, f)) if (buf[0] == 'V') fputs(buf, stdout); fclose(f); } then it prints VmPeak: 316888 kB VmSize: 316888 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 314636 kB VmRSS: 314636 kB VmData: 314652 kB VmStk: 132 kB VmExe: 8 kB VmLib: 1480 kB VmPTE: 660 kB VmSwap: 0 kB i.e. >300MB is used (if compiled with -O1 then memset is avoided so RSS is small but VmPeak is still big). i also see >100 brk calls in strace. i dont know what might be different in my environment. (In reply to dj@redhat.com from comment #22) > I did a bunch of testing with this example, but didn't see continued growth. > Typically, after a few loops, malloc learned to use one of the large chunks ok after some tries it seems i don't get large vm usage on x86_64 with static linking, but if i add use(malloc(12345)); at the begining of main then i do. so it seems the initial heap state matters (which can be different between static/dynamic linking or depend on the environment). Hmmm... so it seems this is reproducible if tcache+fastbins are enabled, and partially reproducible if only fastbins are. If both are disabled via tunables, it shows no growth. I think this is a general side effect of things in tcache/fastbins/unsorted not allowing coalescing. |