[PATCH v3] malloc: Optimize small memory zeroing for calloc
H.J. Lu
hjl.tools@gmail.com
Thu Nov 28 22:02:40 GMT 2024
On Fri, Nov 29, 2024 at 12:24 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi H.J.,
>
> +static __always_inline void
> +clear_small_memory (INTERNAL_SIZE_T *mem, unsigned long nclears)
> +{
> + /* Since x86-64 has fast unaligned 16-byte vector stores, arrange the
> + codes to help compiler vectorize stores with overlapping unaligned
> + vector stores with 1 branch, instead of up to 3, and up to 5 stores,
> + instead of 9. */
> + *(mem + 0) = 0;
> + *(mem + 1) = 0;
> + *(mem + 2) = 0;
> + *(mem + nclears - 2) = 0;
> + *(mem + nclears - 2 + 1) = 0;
> + if (nclears > 6)
> + {
> + *(mem + 3) = 0;
> + *(mem + 3 + 1) = 0;
> + *(mem + nclears - 4) = 0;
> + *(mem + nclears - 4 + 1) = 0;
> + }
> +}
>
> Actually this is generic code that works on all targets - this gives ~7% gain
> with 16 threads on Neoverse V2. If I remove all this and just call memset, the
> gain is 8.5%! So the gain has nothing to do with wide vector stores, but is due
> to avoiding unpredictable branches in this benchmark.
Wangang, can you try memset only on Xeon like this?
> So there is a high risk that we're training for this particular benchmark. The
> sizes are so random that it prefers removing all branches based on size...
> I ran bench-malloc-simple using calloc (which uses predictable sizes) - this
> ends up slower both with the above code and with always calling memset.
>
> Perhaps the generic code could be like:
>
> if (nclears > 5)
> return memset (d, 0, clearsize);
>
> *(d + 0) = 0;
> *(d + 1) = 0;
> *(d + 2) = 0;
> *(d + nclears - 2) = 0;
> *(d + nclears - 1) = 0;
This doesn't help 32-bit targets.
> return mem;
> }
>
> This still gets the 7% gain, removes another branch and avoids trying to inline too
> much of memset. Note the assert (nclears >= 3) costs about 1%, I wonder whether
> we could replace it with a static assert?
>
> Cheers,
> Wilco
--
H.J.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: p.diff
Type: text/x-patch
Size: 1358 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/libc-alpha/attachments/20241129/e57c0c54/attachment.bin>
More information about the Libc-alpha
mailing list