This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 10/27] S390: Optimize stpcpy and wcpcpy.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Stefan Liebler <stli at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Fri, 3 Jul 2015 22:11:05 +0200
- Subject: Re: [PATCH 10/27] S390: Optimize stpcpy and wcpcpy.
- Authentication-results: sourceware.org; auth=none
- References: <1435930721-27922-1-git-send-email-stli at linux dot vnet dot ibm dot com> <1435930721-27922-11-git-send-email-stli at linux dot vnet dot ibm dot com>
On Fri, Jul 03, 2015 at 03:38:24PM +0200, Stefan Liebler wrote:
> This patch provides optimized versions of stpcpy and wcpcpy with the z13
> vector instructions.
> +
> + /* Find zero in 16byte aligned loop. */
> +.Lloop2:
> + vst %v18,0(%r5,%r2) /* Store previous part without zero to dst. */
> + aghi %r5,16
> +.Lloop1:
> + vl %v16,0(%r5,%r3) /* Load s. */
> + vfenezbs %v17,%v16,%v16 /* Find element not equal with zero search. */
> + je .Lfound_v16 /* Jump away if zero was found. */
> + vl %v18,16(%r5,%r3) /* Load next part of s. */
> + vst %v16,0(%r5,%r2) /* Store previous part without zero to dst. */
> + aghi %r5,16
> + vfenezbs %v17,%v18,%v18
> + je .Lfound_v18
> + vl %v16,16(%r5,%r3)
> + vst %v18,0(%r5,%r2)
> + aghi %r5,16
> + vfenezbs %v17,%v16,%v16
> + je .Lfound_v16
> + vl %v18,16(%r5,%r3)
> + vst %v16,0(%r5,%r2)
> + aghi %r5,16
> + vfenezbs %v17,%v18,%v18
> + jo .Lloop2 /* No zero found -> loop. */
> +
Here you could improve performance by using different ends so you could
increase src and dest by 64 at end. That will allow simpler addressing
which may help with ooo execution. If space is concern then following
pattern looks promising:
while (1)
{
if (has_zero(src))
goto add0;
if (has_zero(src+16))
goto add16;
if (has_zero(src+32))
goto add32;
if (has_zero(src+48))
goto add48;
x+=64;
n+=64;
}
add48:
src+=16;
dest+=16;
add32:
src+=16;
dest+=16;
add16:
src+=16;
dest+=16;
add0:
> +.Lfound_v18:
> + vlr %v16,%v18
> +.Lfound_v16:
> + la %r3,0(%r5,%r2)
> + vlgvb %r1,%v17,7 /* Load byte index of zero. */
> + vstl %v16,%r1,0(%r3) /* Copy characters including zero. */
> + algr %r5,%r1
> + la %r2,0(%r5,%r2) /* Return pointer to zero. */
> + br %r14