Optimized arm string routines
Richard Earnshaw
rearnsha@arm.com
Thu Jan 22 18:35:00 GMT 2009
On Thu, 2009-01-22 at 13:05 -0500, Jeff Johnston wrote:
> Richard Earnshaw wrote:
> > On Thu, 2009-01-22 at 08:59 +0100, Schwarz, Konrad wrote:
> >
> >>> Subject: Re: Optimized arm string routines
> >>>
> >> How do these fit in with the optimized strcmp() provided by Eric Blake?
> >>
> >>
> >
> > I've not seen that. Do you have a link?
> >
> > R.
> >
> >
> I believe Konrad is referring to the work Eric did for handling
> unaligned data to avoid performance penalties. A similar optimization
> might make sense in the ASM version.
>
The strcmp I posted handles all cases (except where termination occurs
very early on during the alignment phase) by using word loads (and
shifts if needed).
> Perhaps if you were to publish some performance numbers (aligned,
> unaliged data) of the new code vs the generic, that would probably
> help. Do you happen to have any of these numbers handy?
>
Not at the moment. I'll see if I can find time to knock out the
numbers. However, looking at the strcmp code from the C variant when
compiled for arm we have:
4c: e3530000 cmp r3, #0 ; 0x0
50: 1a00001e bne d0 <strcmp+0xd0>
54: e5b02004 ldr r2, [r0, #4]!
58: e28234ff add r3, r2, #-16777216 ; 0xff000000
5c: e2433801 sub r3, r3, #65536 ; 0x10000
60: e2433c01 sub r3, r3, #256 ; 0x100
64: e2433001 sub r3, r3, #1 ; 0x1
68: e1c33002 bic r3, r3, r2
6c: e3c3347f bic r3, r3, #2130706432 ; 0x7f000000
70: e5b1c004 ldr ip, [r1, #4]!
74: e3c3387f bic r3, r3, #8323072 ; 0x7f0000
78: e3c33c7f bic r3, r3, #32512 ; 0x7f00
7c: e152000c cmp r2, ip
80: e3c3307f bic r3, r3, #127 ; 0x7f
84: 0afffff0 beq 4c <strcmp+0x4c>
for the aligned case. That comes to 2 loads 12 data insns and 1 branch
(treating the not-taken branch as a data op) for each pair of 4 bytes.
Also, one of the loads is used in the immediately following insn,
creating a stall on many cpus. For the unaligned case the c code
compiles to
a4: e5d12000 ldrb r2, [r1]
a8: e1530002 cmp r3, r2
ac: e2800001 add r0, r0, #1 ; 0x1
b0: 1a000004 bne c8 <strcmp+0xc8>
b4: e5d03001 ldrb r3, [r0, #1]
b8: e3530000 cmp r3, #0 ; 0x0
bc: e2811001 add r1, r1, #1 ; 0x1
c0: 1afffff7 bne a4 <strcmp+0xa4>
which gives 2 loads, 5 data ops and 1 branch for each pair of BYTES
compared. Additionally both loads are used in the following insn,
leading to stalls (though both are easily rectifiable in this case by
scheduling for a different CPU).
The equivalent to the above in the new code is:
50: e04c2004 sub r2, ip, r4
54: e15c0003 cmp ip, r3
58: 0022200c eoreq r2, r2, ip
5c: 01120384 tsteq r2, r4, lsl #7
60: 0490c004 ldreq ip, [r0], #4
64: 04913004 ldreq r3, [r1], #4
68: 0afffff8 beq 50 <strcmp+0x50>
which comes to 2 loads, 4 data insns and one branch for every 4 bytes
compared.
and does the following when the strings do not have a mutual alignment:
e4: e3c4c4ff bic ip, r4, #-16777216 ; 0xff000000
e8: e15c0425 cmp ip, r5, lsr #8
ec: e0443002 sub r3, r4, r2
f0: e0233004 eor r3, r3, r4
f4: 1a000007 bne 118 <strcmp_unaligned+0x88>
f8: e0133382 ands r3, r3, r2, lsl #7
fc: 04915004 ldreq r5, [r1], #4
100: 1a000006 bne 120 <strcmp_unaligned+0x90>
104: e02cc004 eor ip, ip, r4
108: e15c0c05 cmp ip, r5, lsl #24
10c: 1a000008 bne 134 <strcmp_unaligned+0xa4>
110: e4904004 ldr r4, [r0], #4
114: eafffff2 b e4 <strcmp_unaligned+0x54>
Note there are three possible cases for unaligned comparisons, each of
which has it's own variant loop (to avoid the need for expensive
variable shifts).
R.
--
Richard Earnshaw Email: Richard.Earnshaw@arm.com
ARM Ltd Phone: +44 1223 400569 (Direct + VoiceMail)
110 Fulbourn Road Switchboard: +44 1223 400400
Cherry Hinton Fax: +44 1223 400410
Cambridge CB1 9NJ Web: http://www.arm.com/
UK
More information about the Newlib
mailing list