posix_memalign performance regression in 2.38?

Wed Aug 9 16:59:00 GMT 2023

* Florian Weimer:

> * Xi Ruoyao via Libc-alpha:
>
>> On Mon, 2023-08-07 at 23:38 -0400, DJ Delorie wrote:
>>> 
>>> Reproduced.
>>> 
>>> In the case where I reproduced it, the most common problematic case was
>>> an allocation of 64-byte aligned chunks of 472 bytes, where 30 smallbin
>>> chunks were tested without finding a match.
>>> 
>>> The most common non-problematic case was a 64-byte-aligned request for
>>> 24 bytes.
>>> 
>>> There were a LOT of other size requests.  The smallest I saw was TWO
>>> bytes.  WHY?  I'm tempted to not fix this, to teach developers to not
>>> use posix_memalign() unless they REALLY need it ;-)
>>
>>
>> Have you tested this?
>>
>> $ cat t.c
>> #include <stdlib.h>
>> int main()
>> {
>> 	void *buf;
>> 	for (int i = 0; i < (1 << 16); i++)
>> 		posix_memalign(&buf, 64, 64);
>> }
>>
>> To me this is quite reasonable (if we just want many blocks each can fit
>> into a cache line), but this costs 17.7 seconds on my system.  Do you
>> think people just should avoid this?  If so we at least need to document
>> the issue in the manual.
>
> This code doesn't work well for glibc malloc (and other dlmalloc-style
> mallocs), and never has.  Even with glibc 2.37, it produces a heap
> layout like this:
>
> v: 64-byte allocation boundary (all characters are 8 byte wide)
> U: available user data
> u: unused userdata tail
> m: glibc metadata
> -: data available for allocation
>
>    v       v       v       v       v       v       v       v
>    UUUUUUUUum--------------UUUUUUUUum--------------UUUUUUUUum
>
> This can be seen from the 192 byte increments in the pointers.  The gaps
> are not wide enough for reuse, so that part is expected.
>
> However, we should not produce these gaps because with a clean heap, we
> split from the remainder, so we should produce this more compact layout
> instead:
>
>    v       v       v       v       v       v       v       v
>    UUUUUUUUum------UUUUUUUUum------UUUUUUUUum------UUUUUUUUum
>
> It seems to me that this doesn't happen because we call _int_free to
> give back the unused memory, and _int_free will use tcache and fastbins,
> so it does not make memory available for consolidation.  Eventually this
> memory is flushed to the low-level allocator, but that's too late
> because then we already have another allocation after 112 bytes that
> block further consolidation.  And of course these 112 byte chunks are
> all not suitably aligned for re-use.
>
> (Even the compact layout wastes 50% of memory, but at least it's better
> than what any glibc version produces today.)

There's a second issue that makes this loop really sensitive to initial
heap alignment.  In _int_memalign, we have this:

  /* Also give back spare room at the end */
  if (!chunk_is_mmapped (p))
    {
      size = chunksize (p);
      if ((unsigned long) (size) > (unsigned long) (nb + MINSIZE))
        {
          remainder_size = size - nb;
          remainder = chunk_at_offset (p, nb);
          set_head (remainder, remainder_size | PREV_INUSE |
                    (av != &main_arena ? NON_MAIN_ARENA : 0));
          set_head_size (p, nb);
          _int_free (av, remainder, 1);
        }
    }

The MINSIZE slack is necessary to avoid creating chunks of less than
MINSIZE bytes, which the allocator cannot deal with.  But it also
prevents merging the tail with a following chunk that is unused
(including the top chunk).

Could someone who can reproduce this with a non-synthetic program please
file a bug in Bugzilla?  I'm going to post a draft patch, too.

Thanks,
Florian