posix_memalign performance regression in 2.38?
Florian Weimer
fweimer@redhat.com
Wed Aug 9 16:59:00 GMT 2023
* Florian Weimer:
> * Xi Ruoyao via Libc-alpha:
>
>> On Mon, 2023-08-07 at 23:38 -0400, DJ Delorie wrote:
>>>
>>> Reproduced.
>>>
>>> In the case where I reproduced it, the most common problematic case was
>>> an allocation of 64-byte aligned chunks of 472 bytes, where 30 smallbin
>>> chunks were tested without finding a match.
>>>
>>> The most common non-problematic case was a 64-byte-aligned request for
>>> 24 bytes.
>>>
>>> There were a LOT of other size requests. The smallest I saw was TWO
>>> bytes. WHY? I'm tempted to not fix this, to teach developers to not
>>> use posix_memalign() unless they REALLY need it ;-)
>>
>>
>> Have you tested this?
>>
>> $ cat t.c
>> #include <stdlib.h>
>> int main()
>> {
>> void *buf;
>> for (int i = 0; i < (1 << 16); i++)
>> posix_memalign(&buf, 64, 64);
>> }
>>
>> To me this is quite reasonable (if we just want many blocks each can fit
>> into a cache line), but this costs 17.7 seconds on my system. Do you
>> think people just should avoid this? If so we at least need to document
>> the issue in the manual.
>
> This code doesn't work well for glibc malloc (and other dlmalloc-style
> mallocs), and never has. Even with glibc 2.37, it produces a heap
> layout like this:
>
> v: 64-byte allocation boundary (all characters are 8 byte wide)
> U: available user data
> u: unused userdata tail
> m: glibc metadata
> -: data available for allocation
>
> v v v v v v v v
> UUUUUUUUum--------------UUUUUUUUum--------------UUUUUUUUum
>
> This can be seen from the 192 byte increments in the pointers. The gaps
> are not wide enough for reuse, so that part is expected.
>
> However, we should not produce these gaps because with a clean heap, we
> split from the remainder, so we should produce this more compact layout
> instead:
>
> v v v v v v v v
> UUUUUUUUum------UUUUUUUUum------UUUUUUUUum------UUUUUUUUum
>
> It seems to me that this doesn't happen because we call _int_free to
> give back the unused memory, and _int_free will use tcache and fastbins,
> so it does not make memory available for consolidation. Eventually this
> memory is flushed to the low-level allocator, but that's too late
> because then we already have another allocation after 112 bytes that
> block further consolidation. And of course these 112 byte chunks are
> all not suitably aligned for re-use.
>
> (Even the compact layout wastes 50% of memory, but at least it's better
> than what any glibc version produces today.)
There's a second issue that makes this loop really sensitive to initial
heap alignment. In _int_memalign, we have this:
/* Also give back spare room at the end */
if (!chunk_is_mmapped (p))
{
size = chunksize (p);
if ((unsigned long) (size) > (unsigned long) (nb + MINSIZE))
{
remainder_size = size - nb;
remainder = chunk_at_offset (p, nb);
set_head (remainder, remainder_size | PREV_INUSE |
(av != &main_arena ? NON_MAIN_ARENA : 0));
set_head_size (p, nb);
_int_free (av, remainder, 1);
}
}
The MINSIZE slack is necessary to avoid creating chunks of less than
MINSIZE bytes, which the allocator cannot deal with. But it also
prevents merging the tail with a following chunk that is unused
(including the top chunk).
Could someone who can reproduce this with a non-synthetic program please
file a bug in Bugzilla? I'm going to post a draft patch, too.
Thanks,
Florian
More information about the Libc-alpha
mailing list