This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] aarch64: Optimized implementation of memcmp


Hi Xuelei,

> The loop body is expanded from a 16-byte comparison to a 64-byte
> comparison, and the usage of ldp is replaced by the Post-index
> mode to the Base plus offset mode. Hence, compare can faster 18%
> around > 128 bytes in all.

This looks quite good - I can reproduce significant gains for large sizes
on various microarchitectures. It seems there are some regressions in
the 8-16 byte range, presumably due to handling these sizes differently.

A few comments inline:

+       /* Compare data bytes and set return value to 0, -1 or 1.  */
+L(return64):
+       cmp     data1, data2
         bne     L(return)
+L(return_pre):
         mov     data1, data1h
         mov     data2, data2h
-       cmp     data1, data2
 L(return):

The label return_pre is unused. So why not use 2xCSEL rather than a branch across 
the moves? That's going to be faster since the branch will be hard to predict.

 L(less8):
         adds    limit, limit, 4
         b.lo    L(less4)
-       ldr     data1w, [src1], 4
-       ldr     data2w, [src2], 4
-       cmp     data1w, data2w
+       ldr     data1w, [src1]
+       ldr     data2w, [src2]
+       ccmp    data1, data2, 0, ne

Using data1w and data2w would be better here.
 
         b.eq    L(byte_loop)
-       sub     result, data1w, data2w
+       sub         result, data1w, data2w

The formatting has gone wrong...

+       ret
+L(ret_0):
+       mov     result, 0
         ret








Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]