[PATCH v11 10/29] string: Improve generic stpcpy

Xi Ruoyao xry111@xry111.site
Wed Feb 1 17:29:36 GMT 2023


On Wed, 2023-02-01 at 14:03 -0300, Adhemerval Zanella wrote:
> +static __always_inline char *
> +stpcpy_unaligned_loop (op_t *restrict dst, const op_t *restrict src,
> +                      uintptr_t ofs)
> +{
> +  op_t w2a = *src++;
> +  uintptr_t sh_1 = ofs * CHAR_BIT;
> +  uintptr_t sh_2 = OPSIZ * CHAR_BIT - sh_1;

Hmm, on 64-bit LoongArch if we "clone" the function 7 times to
stpcpy_unaligned_loop_{1..7} and call them with a switch (ofs) { ... }
construction, we'd be able to use bytepick.d instruction for MERGE,
saving 2 instructions in the iteration.  But maybe this is going too
far.  I'm not sure if this "optimization" applies for other
architectures.

> +  op_t w2 = MERGE (w2a, sh_1, (op_t)-1, sh_2);
> +  if (!has_zero (w2))
> +    {
> +      op_t w2b;
> +
> +      /* Unaligned loop.  The invariant is that W2B, which is "ahead" of W1,
> +        does not contain end-of-string.  Therefore it is safe (and necessary)
> +        to read another word from each while we do not have a difference.  */
> +      while (1)
> +       {
> +         w2b = *src++;
> +         w2 = MERGE (w2a, sh_1, w2b, sh_2);
> +         /* Check if there is zero on w2a.  */
> +         if (has_zero (w2))
> +           goto out;
> +         *dst++ = w2;
> +         if (has_zero (w2b))
> +           break;
> +         w2a = w2b;
> +       }
> +
> +      /* Align the final partial of P2.  */
> +      w2 = MERGE (w2b, sh_1, 0, sh_2);
> +    }
> +
> +out:
> +  return write_byte_from_word (dst, w2);
> +}
> +

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University


More information about the Libc-alpha mailing list