[PATCH v11 10/29] string: Improve generic stpcpy
Xi Ruoyao
xry111@xry111.site
Wed Feb 1 17:29:36 GMT 2023
On Wed, 2023-02-01 at 14:03 -0300, Adhemerval Zanella wrote:
> +static __always_inline char *
> +stpcpy_unaligned_loop (op_t *restrict dst, const op_t *restrict src,
> + uintptr_t ofs)
> +{
> + op_t w2a = *src++;
> + uintptr_t sh_1 = ofs * CHAR_BIT;
> + uintptr_t sh_2 = OPSIZ * CHAR_BIT - sh_1;
Hmm, on 64-bit LoongArch if we "clone" the function 7 times to
stpcpy_unaligned_loop_{1..7} and call them with a switch (ofs) { ... }
construction, we'd be able to use bytepick.d instruction for MERGE,
saving 2 instructions in the iteration. But maybe this is going too
far. I'm not sure if this "optimization" applies for other
architectures.
> + op_t w2 = MERGE (w2a, sh_1, (op_t)-1, sh_2);
> + if (!has_zero (w2))
> + {
> + op_t w2b;
> +
> + /* Unaligned loop. The invariant is that W2B, which is "ahead" of W1,
> + does not contain end-of-string. Therefore it is safe (and necessary)
> + to read another word from each while we do not have a difference. */
> + while (1)
> + {
> + w2b = *src++;
> + w2 = MERGE (w2a, sh_1, w2b, sh_2);
> + /* Check if there is zero on w2a. */
> + if (has_zero (w2))
> + goto out;
> + *dst++ = w2;
> + if (has_zero (w2b))
> + break;
> + w2a = w2b;
> + }
> +
> + /* Align the final partial of P2. */
> + w2 = MERGE (w2b, sh_1, 0, sh_2);
> + }
> +
> +out:
> + return write_byte_from_word (dst, w2);
> +}
> +
--
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University
More information about the Libc-alpha
mailing list