This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] faster strcmp by avoiding sse42.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org
- Date: Wed, 7 Aug 2013 14:28:03 +0200
- Subject: Re: [RFC] faster strcmp by avoiding sse42.
- References: <20130806213033 dot GA5290 at domone dot kolej dot mff dot cuni dot cz>
On Tue, Aug 06, 2013 at 11:30:33PM +0200, OndÅej BÃlka wrote:
> Hi,
>
> Continuing to improving implementation that needlessly use sse42 we move
> to strcmp. A strcmp_sse42 is actually faster than existing
> implementations. It is mostly caused by lack of unrolling in other
> implementations than sse4 itself.
>
Hi, I recalled that in strlen I could improve loop speed by using memory operands instead registers.
This improves performance especially when data is in L2 cache and more.
A updated graphs are at
http://kam.mff.cuni.cz/~ondra/benchmark_string/strcmp_profile.html
http://kam.mff.cuni.cz/~ondra/strcmp_profile070813.tar.bz2
A optimized loop that I use is following:
.L17:
addq $64, %rdi
addq $64, %rsi
.L12:
movdqu (%rsi), %xmm4
pcmpeqb (%rdi), %xmm4
pminub (%rdi), %xmm4
movdqu 16(%rsi), %xmm3
pcmpeqb 16(%rdi), %xmm3
pminub 16(%rdi), %xmm3
movdqu 32(%rsi), %xmm2
pcmpeqb 32(%rdi), %xmm2
pminub 32(%rdi), %xmm2
movdqu 48(%rsi), %xmm0
pcmpeqb 48(%rdi), %xmm0
pminub 48(%rdi), %xmm0
pminub %xmm4, %xmm0
pminub %xmm3, %xmm0
pminub %xmm2, %xmm0
pcmpeqb %xmm6, %xmm0
pmovmskb %xmm0, %eax
testl %eax, %eax
je .L17
jmp .L15