[PATCH v3] malloc: Optimize small memory zeroing for calloc

H.J. Lu hjl.tools@gmail.com
Thu Nov 28 22:02:40 GMT 2024


On Fri, Nov 29, 2024 at 12:24 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi H.J.,
>
> +static __always_inline void
> +clear_small_memory (INTERNAL_SIZE_T *mem, unsigned long nclears)
> +{
> +  /* Since x86-64 has fast unaligned 16-byte vector stores, arrange the
> +     codes to help compiler vectorize stores with overlapping unaligned
> +     vector stores with 1 branch, instead of up to 3, and up to 5 stores,
> +     instead of 9.  */
> +  *(mem + 0) = 0;
> +  *(mem + 1) = 0;
> +  *(mem + 2) = 0;
> +  *(mem + nclears - 2) = 0;
> +  *(mem + nclears - 2 + 1) = 0;
> +  if (nclears > 6)
> +    {
> +      *(mem + 3) = 0;
> +      *(mem + 3 + 1) = 0;
> +      *(mem + nclears - 4) = 0;
> +      *(mem + nclears - 4 + 1) = 0;
> +    }
> +}
>
> Actually this is generic code that works on all targets - this gives ~7% gain
> with 16 threads on Neoverse V2. If I remove all this and just call memset, the
> gain is 8.5%! So the gain has nothing to do with wide vector stores, but is due
> to avoiding unpredictable branches in this benchmark.

Wangang, can you try memset only on Xeon like this?

> So there is a high risk that we're training for this particular benchmark. The
> sizes are so random that it prefers removing all branches based on size...
> I ran bench-malloc-simple using calloc (which uses predictable sizes) - this
> ends up slower both with the above code and with always calling memset.
>
> Perhaps the generic code could be like:
>
>   if (nclears > 5)
>     return memset (d, 0, clearsize);
>
>   *(d + 0) = 0;
>   *(d + 1) = 0;
>   *(d + 2) = 0;
>   *(d + nclears - 2) = 0;
>   *(d + nclears - 1) = 0;

This doesn't help 32-bit targets.

>   return mem;
> }
>
> This still gets the 7% gain, removes another branch and avoids trying to inline too
> much of memset. Note the assert (nclears >= 3) costs about 1%, I wonder whether
> we could replace it with a static assert?
>
> Cheers,
> Wilco



-- 
H.J.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: p.diff
Type: text/x-patch
Size: 1358 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/libc-alpha/attachments/20241129/e57c0c54/attachment.bin>


More information about the Libc-alpha mailing list