Optimized arm string routines
Schwarz, Konrad
konrad.schwarz@siemens.com
Fri Jan 23 08:15:00 GMT 2009
I'm sorry, I got confused. Eric Blake contributed a sophisticated
strstr() using the Two Way algorithm, see
http://sourceware.org/ml/newlib/2008/msg00029.html. I was wondering how
this coexists with machine-specific implementations.
Regards,
Konrad Schwarz
> -----Original Message-----
> From: Richard Earnshaw [mailto:rearnsha@arm.com]
> Sent: Thursday, January 22, 2009 19:35
> To: Jeff Johnston
> Cc: Schwarz, Konrad; newlib@sourceware.org
> Subject: Re: Optimized arm string routines
>
> On Thu, 2009-01-22 at 13:05 -0500, Jeff Johnston wrote:
> > Richard Earnshaw wrote:
> > > On Thu, 2009-01-22 at 08:59 +0100, Schwarz, Konrad wrote:
> > >
> > >>> Subject: Re: Optimized arm string routines
> > >>>
> > >> How do these fit in with the optimized strcmp() provided
> by Eric Blake?
> > >>
> > >>
> > >
> > > I've not seen that. Do you have a link?
> > >
> > > R.
> > >
> > >
> > I believe Konrad is referring to the work Eric did for handling
> > unaligned data to avoid performance penalties. A similar
> optimization
> > might make sense in the ASM version.
> >
>
> The strcmp I posted handles all cases (except where termination occurs
> very early on during the alignment phase) by using word loads (and
> shifts if needed).
>
> > Perhaps if you were to publish some performance numbers (aligned,
> > unaliged data) of the new code vs the generic, that would probably
> > help. Do you happen to have any of these numbers handy?
> >
>
> Not at the moment. I'll see if I can find time to knock out the
> numbers. However, looking at the strcmp code from the C variant when
> compiled for arm we have:
>
> 4c: e3530000 cmp r3, #0 ; 0x0
> 50: 1a00001e bne d0 <strcmp+0xd0>
> 54: e5b02004 ldr r2, [r0, #4]!
> 58: e28234ff add r3, r2, #-16777216 ; 0xff000000
> 5c: e2433801 sub r3, r3, #65536 ; 0x10000
> 60: e2433c01 sub r3, r3, #256 ; 0x100
> 64: e2433001 sub r3, r3, #1 ; 0x1
> 68: e1c33002 bic r3, r3, r2
> 6c: e3c3347f bic r3, r3, #2130706432 ; 0x7f000000
> 70: e5b1c004 ldr ip, [r1, #4]!
> 74: e3c3387f bic r3, r3, #8323072 ; 0x7f0000
> 78: e3c33c7f bic r3, r3, #32512 ; 0x7f00
> 7c: e152000c cmp r2, ip
> 80: e3c3307f bic r3, r3, #127 ; 0x7f
> 84: 0afffff0 beq 4c <strcmp+0x4c>
>
> for the aligned case. That comes to 2 loads 12 data insns
> and 1 branch
> (treating the not-taken branch as a data op) for each pair of 4 bytes.
> Also, one of the loads is used in the immediately following insn,
> creating a stall on many cpus. For the unaligned case the c code
> compiles to
> a4: e5d12000 ldrb r2, [r1]
> a8: e1530002 cmp r3, r2
> ac: e2800001 add r0, r0, #1 ; 0x1
> b0: 1a000004 bne c8 <strcmp+0xc8>
> b4: e5d03001 ldrb r3, [r0, #1]
> b8: e3530000 cmp r3, #0 ; 0x0
> bc: e2811001 add r1, r1, #1 ; 0x1
> c0: 1afffff7 bne a4 <strcmp+0xa4>
>
> which gives 2 loads, 5 data ops and 1 branch for each pair of BYTES
> compared. Additionally both loads are used in the following insn,
> leading to stalls (though both are easily rectifiable in this case by
> scheduling for a different CPU).
>
> The equivalent to the above in the new code is:
>
> 50: e04c2004 sub r2, ip, r4
> 54: e15c0003 cmp ip, r3
> 58: 0022200c eoreq r2, r2, ip
> 5c: 01120384 tsteq r2, r4, lsl #7
> 60: 0490c004 ldreq ip, [r0], #4
> 64: 04913004 ldreq r3, [r1], #4
> 68: 0afffff8 beq 50 <strcmp+0x50>
>
> which comes to 2 loads, 4 data insns and one branch for every 4 bytes
> compared.
>
> and does the following when the strings do not have a mutual
> alignment:
>
> e4: e3c4c4ff bic ip, r4, #-16777216 ; 0xff000000
> e8: e15c0425 cmp ip, r5, lsr #8
> ec: e0443002 sub r3, r4, r2
> f0: e0233004 eor r3, r3, r4
> f4: 1a000007 bne 118 <strcmp_unaligned+0x88>
> f8: e0133382 ands r3, r3, r2, lsl #7
> fc: 04915004 ldreq r5, [r1], #4
> 100: 1a000006 bne 120 <strcmp_unaligned+0x90>
> 104: e02cc004 eor ip, ip, r4
> 108: e15c0c05 cmp ip, r5, lsl #24
> 10c: 1a000008 bne 134 <strcmp_unaligned+0xa4>
> 110: e4904004 ldr r4, [r0], #4
> 114: eafffff2 b e4 <strcmp_unaligned+0x54>
>
> Note there are three possible cases for unaligned comparisons, each of
> which has it's own variant loop (to avoid the need for expensive
> variable shifts).
>
>
> R.
> --
> Richard Earnshaw Email: Richard.Earnshaw@arm.com
> ARM Ltd Phone: +44 1223 400569 (Direct +
> VoiceMail)
> 110 Fulbourn Road Switchboard: +44 1223 400400
> Cherry Hinton Fax: +44 1223 400410
> Cambridge CB1 9NJ Web: http://www.arm.com/
> UK
>
>
More information about the Newlib
mailing list