Optimized arm string routines

Thu Jan 22 18:35:00 GMT 2009

On Thu, 2009-01-22 at 13:05 -0500, Jeff Johnston wrote:
> Richard Earnshaw wrote:
> > On Thu, 2009-01-22 at 08:59 +0100, Schwarz, Konrad wrote:
> >   
> >>> Subject: Re: Optimized arm string routines
> >>>       
> >> How do these fit in with the optimized strcmp() provided by Eric Blake?
> >>
> >>     
> >
> > I've not seen that.  Do you have a link?
> >
> > R.
> >
> >   
> I believe Konrad is referring to the work Eric did for handling 
> unaligned data to avoid performance penalties.  A similar optimization 
> might make sense in the ASM version.
> 

The strcmp I posted handles all cases (except where termination occurs
very early on during the alignment phase) by using word loads (and
shifts if needed).

> Perhaps if you were to publish some performance numbers (aligned, 
> unaliged data) of the new code vs the generic, that would probably 
> help.  Do you happen to have any of these numbers handy?
> 

Not at the moment.  I'll see if I can find time to knock out the
numbers.  However, looking at the strcmp code from the C variant when
compiled for arm we have:

  4c:   e3530000        cmp     r3, #0  ; 0x0
  50:   1a00001e        bne     d0 <strcmp+0xd0>
  54:   e5b02004        ldr     r2, [r0, #4]!
  58:   e28234ff        add     r3, r2, #-16777216      ; 0xff000000
  5c:   e2433801        sub     r3, r3, #65536  ; 0x10000
  60:   e2433c01        sub     r3, r3, #256    ; 0x100
  64:   e2433001        sub     r3, r3, #1      ; 0x1
  68:   e1c33002        bic     r3, r3, r2
  6c:   e3c3347f        bic     r3, r3, #2130706432     ; 0x7f000000
  70:   e5b1c004        ldr     ip, [r1, #4]!
  74:   e3c3387f        bic     r3, r3, #8323072        ; 0x7f0000
  78:   e3c33c7f        bic     r3, r3, #32512  ; 0x7f00
  7c:   e152000c        cmp     r2, ip
  80:   e3c3307f        bic     r3, r3, #127    ; 0x7f
  84:   0afffff0        beq     4c <strcmp+0x4c>

for the aligned case.   That comes to 2 loads 12 data insns and 1 branch
(treating the not-taken branch as a data op) for each pair of 4 bytes.
Also, one of the loads is used in the immediately following insn,
creating a stall on many cpus.  For the unaligned case the c code
compiles to 
  a4:   e5d12000        ldrb    r2, [r1]
  a8:   e1530002        cmp     r3, r2
  ac:   e2800001        add     r0, r0, #1      ; 0x1
  b0:   1a000004        bne     c8 <strcmp+0xc8>
  b4:   e5d03001        ldrb    r3, [r0, #1]
  b8:   e3530000        cmp     r3, #0  ; 0x0
  bc:   e2811001        add     r1, r1, #1      ; 0x1
  c0:   1afffff7        bne     a4 <strcmp+0xa4>

which gives 2 loads, 5 data ops and 1 branch for each pair of BYTES
compared.  Additionally both loads are used in the following insn,
leading to stalls (though both are easily rectifiable in this case by
scheduling for a different CPU).

The equivalent to the above in the new code is:

  50:   e04c2004        sub     r2, ip, r4
  54:   e15c0003        cmp     ip, r3
  58:   0022200c        eoreq   r2, r2, ip
  5c:   01120384        tsteq   r2, r4, lsl #7
  60:   0490c004        ldreq   ip, [r0], #4
  64:   04913004        ldreq   r3, [r1], #4
  68:   0afffff8        beq     50 <strcmp+0x50>

which comes to 2 loads, 4 data insns and one branch for every 4 bytes
compared.

and does the following when the strings do not have a mutual alignment:

  e4:   e3c4c4ff        bic     ip, r4, #-16777216      ; 0xff000000
  e8:   e15c0425        cmp     ip, r5, lsr #8
  ec:   e0443002        sub     r3, r4, r2
  f0:   e0233004        eor     r3, r3, r4
  f4:   1a000007        bne     118 <strcmp_unaligned+0x88>
  f8:   e0133382        ands    r3, r3, r2, lsl #7
  fc:   04915004        ldreq   r5, [r1], #4
 100:   1a000006        bne     120 <strcmp_unaligned+0x90>
 104:   e02cc004        eor     ip, ip, r4
 108:   e15c0c05        cmp     ip, r5, lsl #24
 10c:   1a000008        bne     134 <strcmp_unaligned+0xa4>
 110:   e4904004        ldr     r4, [r0], #4
 114:   eafffff2        b       e4 <strcmp_unaligned+0x54>

Note there are three possible cases for unaligned comparisons, each of
which has it's own variant loop (to avoid the need for expensive
variable shifts).

R.
-- 
Richard Earnshaw             Email: Richard.Earnshaw@arm.com
ARM Ltd                      Phone: +44 1223 400569 (Direct + VoiceMail)
110 Fulbourn Road            Switchboard: +44 1223 400400
Cherry Hinton                Fax: +44 1223 400410
Cambridge CB1 9NJ            Web: http://www.arm.com/
UK