[PATCH] powerpc64: Add optimized strcpy and stpcpy for POWER10
Paul E Murphy
murphyp@linux.ibm.com
Tue Jun 18 16:04:06 GMT 2024
On 6/16/24 6:49 AM, bmahi496@linux.ibm.com wrote:
> From: MAHESH BODAPATI <bmahi496@linux.ibm.com>
>
> Improvements compared to POWER9 version:
>
> Use simple comparisons for the first ~512 bytes
> The main loop is good for long strings, but comparing 16B each time is better
> for shorter strings. After aligning the address to 16 bytes, we unroll
> the loop four times, checking 128 bytes each time. There may be some overlap
> with the main loop for unaligned strings, but it is better for shorter strings.
>
> Use new P10 instructions
> lxvp is used to load 32B with a single instruction, reducing contention in
> the load queue.
>
> The degradations for smaller strings are not consistent and the overall
> performance numbers are good.
It is very helpful to include some or all of the benchmark results which
change in the commit message. It helps the reviewers to more quickly
understand the tradeoffs of a new implementation. Can you share some of
the benchmark results for the V1 patch?
More information about the Libc-alpha
mailing list