This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: PATCH: Optimize memcmp for ia32
On Tue, Feb 10, 2004 at 04:20:01PM +0100, Jakub Jelinek wrote:
> On Tue, Feb 10, 2004 at 09:18:30AM -0800, H. J. Lu wrote:
> > On Tue, Feb 10, 2004 at 03:48:19PM +0100, Jakub Jelinek wrote:
> > > On Wed, Feb 04, 2004 at 04:11:26PM -0800, H. J. Lu wrote:
> > > > This patch optimizes memcmp for ia32. I got average speeup by around
> > > > 400%.
> > >
> > > If not anything else, you should certainly handle PIC vs. !PIC differently
> > > (for !PIC you don't need to call thunk etc.).
> >
> > I can change it.
> >
It will require 2 jump tables since the current one is PIC. We can't
use it for !PIC code. I can change it after we decide if we want to
change the jump table.
> > > Also, why do you need to use %ebx register when for example %eax is always
> > > available?
> >
> > I will take a look.
I used ebx for __i686.get_pc_thunk.bx. I can use __i686.get_pc_thunk.ax
for eax if we really don't need ebx.
> >
> > > Why do you need 4 separate L(Nbytes) sequences, the only difference between
> > > them is in the last few instructions? The bigger the routine is, the more
> > > other instructions will be kicked out of the caches (especially for a
> > > routine which is not the topmost in the benchmarks).
> > > I'd say avoiding the table_32bytes table altogether, using just one of the
> > > 4 sequences (with adjusted start) and computing the jump destination in
> > > registers shouldn't slow things down.
> >
> > The adjustement may cause the slow down. With the jump table, we don't
> > need to adjust anything at all for memoy block smaller than 32 bytes.
> > That is where the speedup comes from.
>
> I meant instead of
> addl %ecx, %edx
> addl %ecx, %esi
> do:
> andl $-4, %ecx
> addl %ecx, %edx
> addl %ecx, %esi
> or something like that (then you'd just start with -28(%esi) -> for 4
> cases). The %ecx & 3 previous value would need to be preserved till
> the end, e.g. in the %ebx register which could be replaced with %eax
> and you could hardcode that it jumps to L(28bytes) + 14 * (INDEX / 4).
>
You may be trading speed for space. It will save some bytes, but the
code may be slower since it has to do more.
BTW, I can't hard code L(28bytes) + 14 * (INDEX / 4) since it won't
work with PIC.
Here is proposed patch for the other changes. Any comments?
H.J.
---
--- memcmp.S.p4 2004-02-09 10:06:19.000000000 -0800
+++ memcmp.S 2004-02-23 10:25:53.000000000 -0800
@@ -351,6 +351,7 @@ L(set):
popl %esi
RETURN
+ .section .rodata
ALIGN (2)
L(table_32bytes) :
.long L(0bytes) - . + 0x0