[PATCH 0/2] LoongArch: Add optimized functions.

Adhemerval Zanella Netto adhemerval.zanella@linaro.org
Thu Sep 22 18:05:24 GMT 2022

On 20/09/22 06:54, Xi Ruoyao wrote:
> On Mon, 2022-09-19 at 17:16 -0300, Adhemerval Zanella Netto via Libc-
> alpha wrote:
>> Do you have any breakdown if either loop unrolling or missing string-fzi.h/
>> string-fza.h is what is making difference in string routines? 
> It looks like there are some difficulties... LoongArch does not have a
> dedicated instruction for finding a zero byte among the 8 bytes in a
> register (I guess the LoongArch SIMD eXtension will provide such an
> instruction, but the full LSX manual is not published yet and some
> LoongArch processors may lack LSX).  So the assembly code submitted by
> dengjianbo relies on a register to cache the bit pattern
> 0x0101010101010101.  We can't just rematerialize it (with 3
> instructions) in has_zero or has_eq etc. or the performance will be
> likely horribly bad.  

The 0x0101010101010101 is already created on find_zero_low (lsb), so creating
it again on another static inline function should provide enough information
to compiler to optimize the materialization to avoid doing it twice. So
maybe adding a LoongArch specific index_first_zero_eq should be suffice.

Maybe we can parametrize strchr with an extra function to do what the final
step does:

    op_t found = index_first_zero_eq (word, repeated_c);
    if (extractbyte (word, found) == c)
      return (char *) (word_ptr) + found;
    return NULL;

So LoongArch can reimplement it with a better strategy as well.

The idea is this generic implementation is exactly to find the missing spots
where C code could not produce the best instruction and parametrize in way
that allows each architecture to reimplement in the best way.

>> Checking on last iteration [1], it seems that strchr is issuing 2 loads
>> on each loop iteration and using bit-manipulation instruction that I am
>> not sure compiler could emit with generic code. Maybe we can tune the
>> generic implementation to get similar performance, as Richard has done
>> for alpha, hppa, sh, and powerpc?
>> I am asking because from the brief description of the algorithm, the
>> general idea is essentially what my generic code aims to do (mask-off
>> initial bytes, use word-aligned load and vectorized compares, extract
>> final bytes), and I am hoping that architecture would provide 
>> string-fz{i,a}.h to get better code generation instead of pushing
>> for more and more hand-write assembly routines.

More information about the Libc-alpha mailing list