[PATCH 2/4] riscv: vectorized mem* functions

Wed May 3 06:56:03 GMT 2023

On Thu, Mar 02, 2023 at 10:37:04AM +0000, Sergei Lewis wrote:
> 
> I note this memcpy implementation performs a vsetvli, load, store, three
> ALU ops and a branch per loop.
> 
> If, instead, you start by copying just enough bytes to align the
> destination pointer to vlen*lmul, you can pull the vsetvli out of the loop,
> since you know the number of bytes remaining is a multiple of the operation
> size - essentially, you've done the tail handling up front. Then, since you
> no longer need to track the number of bytes remaining in order to feed it
> to vsetvli, you can also calculate the expected buffer end outside the
> loop, have the conditional branch compare current destination with that
> instead of comparing number of bytes remaining with 0, and therefore also
> lose an ALU op from the loop.
> 
> You can see this pattern in the vectorised mem* patchset I posted a while
> back. In my testing I found this can improve performance by 10-15% compared
> to the naive approach (depending on chosen LMUL, operation size and L1
> size).
I've run glibc benchtests and rtl simulation to evaluate the performance between https://patchwork.sourceware.org/project/glibc/patch/20230201095232.15942-2-slewis@rivosinc.com/ and this implementation. In my test, the implementations in this thread have less instruction/cycle count in most test cases.
> 
> The other mem* functions can also similarly benefit from this pattern.