This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: Optimized arm string routines


I'm sorry, I got confused.  Eric Blake contributed a sophisticated
strstr() using the Two Way algorithm, see
http://sourceware.org/ml/newlib/2008/msg00029.html.  I was wondering how
this coexists with machine-specific implementations.

Regards,

Konrad Schwarz

> -----Original Message-----
> From: Richard Earnshaw [mailto:rearnsha@arm.com] 
> Sent: Thursday, January 22, 2009 19:35
> To: Jeff Johnston
> Cc: Schwarz, Konrad; newlib@sourceware.org
> Subject: Re: Optimized arm string routines
> 
> On Thu, 2009-01-22 at 13:05 -0500, Jeff Johnston wrote:
> > Richard Earnshaw wrote:
> > > On Thu, 2009-01-22 at 08:59 +0100, Schwarz, Konrad wrote:
> > >   
> > >>> Subject: Re: Optimized arm string routines
> > >>>       
> > >> How do these fit in with the optimized strcmp() provided 
> by Eric Blake?
> > >>
> > >>     
> > >
> > > I've not seen that.  Do you have a link?
> > >
> > > R.
> > >
> > >   
> > I believe Konrad is referring to the work Eric did for handling 
> > unaligned data to avoid performance penalties.  A similar 
> optimization 
> > might make sense in the ASM version.
> > 
> 
> The strcmp I posted handles all cases (except where termination occurs
> very early on during the alignment phase) by using word loads (and
> shifts if needed).
> 
> > Perhaps if you were to publish some performance numbers (aligned, 
> > unaliged data) of the new code vs the generic, that would probably 
> > help.  Do you happen to have any of these numbers handy?
> > 
> 
> Not at the moment.  I'll see if I can find time to knock out the
> numbers.  However, looking at the strcmp code from the C variant when
> compiled for arm we have:
> 
>   4c:   e3530000        cmp     r3, #0  ; 0x0
>   50:   1a00001e        bne     d0 <strcmp+0xd0>
>   54:   e5b02004        ldr     r2, [r0, #4]!
>   58:   e28234ff        add     r3, r2, #-16777216      ; 0xff000000
>   5c:   e2433801        sub     r3, r3, #65536  ; 0x10000
>   60:   e2433c01        sub     r3, r3, #256    ; 0x100
>   64:   e2433001        sub     r3, r3, #1      ; 0x1
>   68:   e1c33002        bic     r3, r3, r2
>   6c:   e3c3347f        bic     r3, r3, #2130706432     ; 0x7f000000
>   70:   e5b1c004        ldr     ip, [r1, #4]!
>   74:   e3c3387f        bic     r3, r3, #8323072        ; 0x7f0000
>   78:   e3c33c7f        bic     r3, r3, #32512  ; 0x7f00
>   7c:   e152000c        cmp     r2, ip
>   80:   e3c3307f        bic     r3, r3, #127    ; 0x7f
>   84:   0afffff0        beq     4c <strcmp+0x4c>
> 
> for the aligned case.   That comes to 2 loads 12 data insns 
> and 1 branch
> (treating the not-taken branch as a data op) for each pair of 4 bytes.
> Also, one of the loads is used in the immediately following insn,
> creating a stall on many cpus.  For the unaligned case the c code
> compiles to 
>   a4:   e5d12000        ldrb    r2, [r1]
>   a8:   e1530002        cmp     r3, r2
>   ac:   e2800001        add     r0, r0, #1      ; 0x1
>   b0:   1a000004        bne     c8 <strcmp+0xc8>
>   b4:   e5d03001        ldrb    r3, [r0, #1]
>   b8:   e3530000        cmp     r3, #0  ; 0x0
>   bc:   e2811001        add     r1, r1, #1      ; 0x1
>   c0:   1afffff7        bne     a4 <strcmp+0xa4>
> 
> which gives 2 loads, 5 data ops and 1 branch for each pair of BYTES
> compared.  Additionally both loads are used in the following insn,
> leading to stalls (though both are easily rectifiable in this case by
> scheduling for a different CPU).
> 
> The equivalent to the above in the new code is:
> 
>   50:   e04c2004        sub     r2, ip, r4
>   54:   e15c0003        cmp     ip, r3
>   58:   0022200c        eoreq   r2, r2, ip
>   5c:   01120384        tsteq   r2, r4, lsl #7
>   60:   0490c004        ldreq   ip, [r0], #4
>   64:   04913004        ldreq   r3, [r1], #4
>   68:   0afffff8        beq     50 <strcmp+0x50>
> 
> which comes to 2 loads, 4 data insns and one branch for every 4 bytes
> compared.
> 
> and does the following when the strings do not have a mutual 
> alignment:
> 
>   e4:   e3c4c4ff        bic     ip, r4, #-16777216      ; 0xff000000
>   e8:   e15c0425        cmp     ip, r5, lsr #8
>   ec:   e0443002        sub     r3, r4, r2
>   f0:   e0233004        eor     r3, r3, r4
>   f4:   1a000007        bne     118 <strcmp_unaligned+0x88>
>   f8:   e0133382        ands    r3, r3, r2, lsl #7
>   fc:   04915004        ldreq   r5, [r1], #4
>  100:   1a000006        bne     120 <strcmp_unaligned+0x90>
>  104:   e02cc004        eor     ip, ip, r4
>  108:   e15c0c05        cmp     ip, r5, lsl #24
>  10c:   1a000008        bne     134 <strcmp_unaligned+0xa4>
>  110:   e4904004        ldr     r4, [r0], #4
>  114:   eafffff2        b       e4 <strcmp_unaligned+0x54>
> 
> Note there are three possible cases for unaligned comparisons, each of
> which has it's own variant loop (to avoid the need for expensive
> variable shifts).
> 
> 
> R.
> -- 
> Richard Earnshaw             Email: Richard.Earnshaw@arm.com
> ARM Ltd                      Phone: +44 1223 400569 (Direct + 
> VoiceMail)
> 110 Fulbourn Road            Switchboard: +44 1223 400400
> Cherry Hinton                Fax: +44 1223 400410
> Cambridge CB1 9NJ            Web: http://www.arm.com/
> UK
> 
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]