This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [Patch, MIPS] Modify memcpy.S for mips32r6/mips64r6


Ondřej Bílka <neleai@seznam.cz>  writes:
> On Tue, Dec 23, 2014 at 09:52:56AM -0800, Richard Henderson wrote:
> > On 12/23/2014 09:08 AM, Steve Ellcey wrote:
> > > +	andi	t8,a0,7
> > > +	lapc	t9,L(atable)
> > > +	PTR_LSA	t9,t8,t9,2
> > > +	jrc	t9
> > > +L(atable):
> > > +	bc	L(lb0)
> > > +	bc	L(lb7)
> > > +	bc	L(lb6)
> > > +	bc	L(lb5)
> > > +	bc	L(lb4)
> > > +	bc	L(lb3)
> > > +	bc	L(lb2)
> > > +	bc	L(lb1)
> > > +L(lb7):
> > > +	lb	a3, 6(a1)
> > > +	sb	a3, 6(a0)
> > > +L(lb6):
> > > +	lb	a3, 5(a1)
> > > +	sb	a3, 5(a0)
> > > +L(lb5):
> > > +	lb	a3, 4(a1)
> > > +	sb	a3, 4(a0)
> > > +L(lb4):
> > > +	lb	a3, 3(a1)
> > > +	sb	a3, 3(a0)
> > > +L(lb3):
> > > +	lb	a3, 2(a1)
> > > +	sb	a3, 2(a0)
> > > +L(lb2):
> > > +	lb	a3, 1(a1)
> > > +	sb	a3, 1(a0)
> > > +L(lb1):
> > > +	lb	a3, 0(a1)
> > > +	sb	a3, 0(a0)
> > L(lbx):
> > > +
> > > +	li	t9,8
> > > +	subu	t8,t9,t8
> > > +	PTR_SUBU a2,a2,t8
> > > +	PTR_ADDU a0,a0,t8
> > > +	PTR_ADDU a1,a1,t8
> > > +L(lb0):
> >
> > This table is regular enough that I wonder if it wouldn't be better to
> > do some arithmetic instead of a branch-to-branch.  E.g.
> >
> > 	andi	t7,a0,7
> > 	li	t8,L(lb0)-L(lbx)
> > 	lsa	t8,t7,t8,8
> > 	lapc	t9,L(lb0)
> > 	selnez	t8,t8,t7
> > 	PTR_SUBU t9,t9,t8
> > 	jrc	t9
> >
> > Which is certainly smaller than your 12 insns, unlikely to be slower
> > on any conceivable hardware, but probably faster on most.
> >
> Do you have that hardware? I already objected versus table but do not
> have data. I wouldn't be surprised if its slower than byte-by-byte copy
> with if after each byte. Or just copy 8 bytes without condition but I am
> not sure how hardware handles overlapping stores. Difference will be
> bigger in practice, in profiling around 50% calls are 8 byte aligned and
> you save address calculation cost on these.

I think Richard's idea is good but I do agree with Steve that the tried and
tested code should go in first and then optimise it. There is lots of
exploration to do with MIPSR6 and there are many new ways to optimize. If
we don't have R6 support in glibc 2.21 then there is a definite performance
regression on R6 as the R5/R2 code will trap and emulate on an R6 core making
any non-trapping code several orders of magnitude better.

Overall we are trying to hit as many package release dates as possible to
provide everyone with initial R6 support for experimentation. For GLIBC that
not only includes all the specific R6 patches from Steve but also requires
the .MIPS.abiflags (FPXX/FP64 ABI) patch from myself.

Thanks,
Matthew


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]