Re: [RFC] faster strcmp by avoiding sse42.

On Tue, Aug 06, 2013 at 11:30:33PM +0200, OndÅej BÃlka wrote:
> Hi,
> Continuing to improving implementation that needlessly use sse42 we move
> to strcmp. A strcmp_sse42 is actually faster than existing
> implementations. It is mostly caused by lack of unrolling in other
> implementations than sse4 itself.
Hi, I recalled that in strlen I could improve loop speed by using memory operands instead registers.
This improves performance especially when data is in L2 cache and more.
A updated graphs are at 

A optimized loop that I use is following:

        addq    $64, %rdi
        addq    $64, %rsi
        movdqu  (%rsi), %xmm4
        pcmpeqb (%rdi), %xmm4
        pminub  (%rdi), %xmm4
        movdqu  16(%rsi), %xmm3
        pcmpeqb 16(%rdi), %xmm3
        pminub  16(%rdi), %xmm3
        movdqu  32(%rsi), %xmm2
        pcmpeqb 32(%rdi), %xmm2
        pminub  32(%rdi), %xmm2
        movdqu  48(%rsi), %xmm0
        pcmpeqb 48(%rdi), %xmm0
        pminub  48(%rdi), %xmm0
        pminub  %xmm4, %xmm0
        pminub  %xmm3, %xmm0
        pminub  %xmm2, %xmm0
        pcmpeqb %xmm6, %xmm0
        pmovmskb        %xmm0, %eax
        testl   %eax, %eax
        je      .L17
        jmp     .L15

