This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [Patch, MIPS] Modify memcpy.S for mips32r6/mips64r6
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Richard Henderson <rth at twiddle dot net>
- Cc: sellcey at imgtec dot com, Joseph Myers <joseph at codesourcery dot com>, libc-alpha at sourceware dot org
- Date: Tue, 23 Dec 2014 21:30:46 +0100
- Subject: Re: [Patch, MIPS] Modify memcpy.S for mips32r6/mips64r6
- Authentication-results: sourceware.org; auth=none
- References: <7ec2bf7e-fc1e-428b-ac0a-747f2a3ab3e6 at BAMAIL02 dot ba dot imgtec dot org> <alpine dot DEB dot 2 dot 10 dot 1412221758190 dot 5278 at digraph dot polyomino dot org dot uk> <1419354526 dot 27606 dot 73 dot camel at ubuntu-sellcey> <5499ABF8 dot 3060307 at twiddle dot net>
On Tue, Dec 23, 2014 at 09:52:56AM -0800, Richard Henderson wrote:
> On 12/23/2014 09:08 AM, Steve Ellcey wrote:
> > + andi t8,a0,7
> > + lapc t9,L(atable)
> > + PTR_LSA t9,t8,t9,2
> > + jrc t9
> > +L(atable):
> > + bc L(lb0)
> > + bc L(lb7)
> > + bc L(lb6)
> > + bc L(lb5)
> > + bc L(lb4)
> > + bc L(lb3)
> > + bc L(lb2)
> > + bc L(lb1)
> > +L(lb7):
> > + lb a3, 6(a1)
> > + sb a3, 6(a0)
> > +L(lb6):
> > + lb a3, 5(a1)
> > + sb a3, 5(a0)
> > +L(lb5):
> > + lb a3, 4(a1)
> > + sb a3, 4(a0)
> > +L(lb4):
> > + lb a3, 3(a1)
> > + sb a3, 3(a0)
> > +L(lb3):
> > + lb a3, 2(a1)
> > + sb a3, 2(a0)
> > +L(lb2):
> > + lb a3, 1(a1)
> > + sb a3, 1(a0)
> > +L(lb1):
> > + lb a3, 0(a1)
> > + sb a3, 0(a0)
> L(lbx):
> > +
> > + li t9,8
> > + subu t8,t9,t8
> > + PTR_SUBU a2,a2,t8
> > + PTR_ADDU a0,a0,t8
> > + PTR_ADDU a1,a1,t8
> > +L(lb0):
>
> This table is regular enough that I wonder if it wouldn't be better to do some
> arithmetic instead of a branch-to-branch. E.g.
>
> andi t7,a0,7
> li t8,L(lb0)-L(lbx)
> lsa t8,t7,t8,8
> lapc t9,L(lb0)
> selnez t8,t8,t7
> PTR_SUBU t9,t9,t8
> jrc t9
>
> Which is certainly smaller than your 12 insns, unlikely to be slower on any
> conceivable hardware, but probably faster on most.
>
Do you have that hardware? I already objected versus table but do not
have data. I wouldn't be surprised if its slower than byte-by-byte copy
with if after each byte. Or just copy 8 bytes without condition but I am
not sure how hardware handles overlapping stores. Difference will be
bigger in practice, in profiling around 50% calls are 8 byte aligned and
you save address calculation cost on these.