This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [Patch, MIPS] Modify memset.S for mips32r6/mips64r6
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Steve Ellcey <sellcey at imgtec dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Sat, 20 Dec 2014 10:09:33 +0100
- Subject: Re: [Patch, MIPS] Modify memset.S for mips32r6/mips64r6
- Authentication-results: sourceware.org; auth=none
- References: <2923c970-026c-4e00-be7a-0650e82421b5 at BAMAIL02 dot ba dot imgtec dot org>
On Fri, Dec 19, 2014 at 03:26:44PM -0800, Steve Ellcey wrote:
> Here is the last of my patches for mips32r6/mips64r6 support. It updates
> memset to use byte copies instead of stl or str to align the destination
> because those instructions are not supported in mips32r6 or mips64r6.
> It also avoids using the 'prepare for store' prefetch hint because that
> is not supported on mips32r6 or mips64r6 either.
>
> Tested with the mips32r6/mips64r6 GCC, binutils and qemu simulator.
>
> OK to checkin?
>
> Steve Ellcey
> sellcey@imgtec.com
>
>
> PTR_ADDU a0,a0,t2
> +#else /* R6_CODE */
> + andi t2,a0,7
> + lapc t9,L(atable)
> + PTR_LSA t9,t2,t9,2
> + jrc t9
> +L(atable):
> + bc L(aligned)
> + bc L(lb7)
That could be performance regression, test if its faster than existing
loop on unpredictable branches [B]
Also try if just branches are better, like in following c code [A]
Table lookup could be even slower in real workloads as it adds latency
when table is not in cache.
>From practical standpoint realigning code looks like dead code, on x64
83% percents of calls are 16 byte aligned and I cannot find application
that makes call unaligned to 8 bytes.
You will get better speedup by adding a check if its already aligned and
moving realignment code to bottom of file to improve instruction cache
usage.
[A]
if (((int) x) & 1)
*x = mask;
x &= ~1;
if (((int) x) & 2)
*((uint16_t*) x) = mask;
x &= ~2;
if (((int) x) & 4)
*((uint32_t*) x) = mask;
x &= ~4;
[B]
#include <string.h>
int main (int x)
{
char foo[100];
int i;
for (i = 0; i < 100000000; i++)
memset (foo + (i % 16), 1, 32 - (i % 16));
return foo[17];
}