This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Faster strlen
+ pmovmskb %xmm3, %edx
+ sub %rdi, %rax
+ movq %rdx, %rcx
+ negq %rcx
+ andq %rdx, %rcx
Please, use <tab>instruction<tab> format instead of different styles
on different lines.
And I suggest to use L macro for new labels to improve readability and
to satisfy to the style of other assembler files in glibc.
+ add $16, %rax
+ .p2align 4
+ .align64_loop:
L(align64_loop):
--
Liubov Dmitrieva
2012/10/9 H.J. Lu <hjl.tools@gmail.com>:
> On Sun, Oct 7, 2012 at 10:27 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
>> Hello, I investigated strlen bit more and improved pminub variant.
>>
>> I got upto 10% speedup by unrolling main loop. I did not measured
>> difference when I unrolled loop more.
>>
>> I also benchmarked atom and added variant which is identical to
>> strlen-sse2-pminub except bsf is replaced by table lookup.
>>
>> Last addition is attempt to generate VEX encoded strlen. I need only to
>> pass -mavx flag when compiling strlen_avx.S but do not know how.
>>
>> Benchmarks are at usual place. To fit all functions consider only random
>> alignment. I also increased granularity of sampling.
>>
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/
>>
>> Results for this patch are
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/benchmark_strlen_7_10_2012.tar.bz2
>>
>> On sandy bridge
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_sandy_bridge/strlen/html/test_r.html
>> there is phase change around sizes 1500-2000. Do you know what caused it?
>>
>> Other optimalization is prefetching. Most of time prefetching variant is
>> slower than nonprefetching(as large strings are rare.)
>> On sandy bridge prefetching is free. I need additional flag to ifunc to
>> indicate that.
>>
>> I disabled prefetching in my patch.
>>
>> On atom ironicaly strlen-sse2-no-bsf was slower than pminub variant
>> except for string less than 16 bytes long.
>>
>> For exit from main loop of no-bsf variant using bsfq instead binary
>> search saves 10 cycles. Multiplication+table lookup is also slow in atom
>> because 64bit multiplication is slow.
>>
>> I used pminub variant with bsf instruction replaced by my table lookup. This
>> is by about 8 cycles faster on atom.
>>
>> I did not reschedule instructions for atom for easier review.
>>
>> sse2, pminub, no-bsf, sse4 variants are everywhere slower than my patch so I
>> remove them. pminub and no-bsf are used in strcat and will be removed in
>> separate patch.
>>
>> 2012-10-07 Ondrej Bilka <neleai@seznam.cz>
>> * sysdeps/x86_64/strlen.S:
>> Use unrolled pminub variant by default.
>> * sysdeps/x86_64/multiarch/strlen_avx.S:
>> Recode default variant using VEX prefix.
>> * sysdeps/x86_64/multiarch/strlen_atom.S:
>> New variant tailored to atom.
>> * sysdeps/x86_64/strlen.S: Updated function selection.
>> * sysdeps/x86_64/multiarch/strlen-sse4.S: deleted
>> * sysdeps/x86_64/multiarch/Makefile: updated
>>
>
> Please rename strlen_atom.S to strlen-no-bsf.S since it
> depends on bit_Slow_BSF, not Atom.
>
> Thanks.
>
> --
> H.J.