This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PATCH: Optimize memcmp for ia32


On Tue, Feb 10, 2004 at 04:20:01PM +0100, Jakub Jelinek wrote:
> On Tue, Feb 10, 2004 at 09:18:30AM -0800, H. J. Lu wrote:
> > On Tue, Feb 10, 2004 at 03:48:19PM +0100, Jakub Jelinek wrote:
> > > On Wed, Feb 04, 2004 at 04:11:26PM -0800, H. J. Lu wrote:
> > > > This patch optimizes memcmp for ia32. I got average speeup by around
> > > > 400%.
> > > 
> > > If not anything else, you should certainly handle PIC vs. !PIC differently
> > > (for !PIC you don't need to call thunk etc.).
> > 
> > I can change it.
> > 

It will require 2 jump tables since the current one is PIC. We can't
use it for !PIC code. I can change it after we decide if we want to
change the jump table.

> > > Also, why do you need to use %ebx register when for example %eax is always
> > > available?
> > 
> > I will take a look.

I used ebx for __i686.get_pc_thunk.bx. I can use __i686.get_pc_thunk.ax
for eax if we really don't need ebx.

> > 
> > > Why do you need 4 separate L(Nbytes) sequences, the only difference between
> > > them is in the last few instructions?  The bigger the routine is, the more
> > > other instructions will be kicked out of the caches (especially for a
> > > routine which is not the topmost in the benchmarks).
> > > I'd say avoiding the table_32bytes table altogether, using just one of the
> > > 4 sequences (with adjusted start) and computing the jump destination in
> > > registers shouldn't slow things down.
> > 
> > The adjustement may cause the slow down. With the jump table, we don't
> > need to adjust anything at all for memoy block smaller than 32 bytes.
> > That is where the speedup comes from.
> 
> I meant instead of
>         addl    %ecx, %edx
>         addl    %ecx, %esi
> do:
> 	andl	$-4, %ecx
>         addl    %ecx, %edx
>         addl    %ecx, %esi
> or something like that (then you'd just start with -28(%esi) -> for 4
> cases).  The %ecx & 3 previous value would need to be preserved till
> the end, e.g. in the %ebx register which could be replaced with %eax
> and you could hardcode that it jumps to L(28bytes) + 14 * (INDEX / 4).
> 

You may be trading speed for space. It will save some bytes, but the
code may be slower since it has to do more.

BTW, I can't hard code L(28bytes) + 14 * (INDEX / 4) since it won't
work with PIC.

Here is proposed patch for the other changes. Any comments?


H.J.
---
--- memcmp.S.p4	2004-02-09 10:06:19.000000000 -0800
+++ memcmp.S	2004-02-23 10:25:53.000000000 -0800
@@ -351,6 +351,7 @@ L(set):
 	popl	%esi
 	RETURN
 
+	.section	.rodata
 	ALIGN (2)
 L(table_32bytes) :
 	.long	L(0bytes) - . + 0x0


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]