[PATCH] powerpc64: Add optimized strcpy and stpcpy for POWER10

Paul E Murphy murphyp@linux.ibm.com
Tue Jun 18 16:04:06 GMT 2024



On 6/16/24 6:49 AM, bmahi496@linux.ibm.com wrote:
> From: MAHESH BODAPATI <bmahi496@linux.ibm.com>
> 
> Improvements compared to POWER9 version:
> 
> Use simple comparisons for the first ~512 bytes
>    The main loop is good for long strings, but comparing 16B each time is better
>    for shorter strings. After aligning the address to 16 bytes, we unroll
>    the loop four times, checking 128 bytes each time. There may be some overlap
>    with the main loop for unaligned strings, but it is better for shorter strings.
> 
> Use new P10 instructions
>    lxvp is used to load 32B with a single instruction, reducing contention in
>    the load queue.
> 
> The degradations for smaller strings are not consistent and the overall
> performance numbers are good.

It is very helpful to include some or all of the benchmark results which 
change in the commit message.  It helps the reviewers to more quickly 
understand the tradeoffs of a new implementation.  Can you share some of 
the benchmark results for the V1 patch?


More information about the Libc-alpha mailing list