Created attachment 12507 [details] report When the two strings being compared are at the end of their pages, __strncmp_avx2 will fall back to a one-byte-at-a-time loop named "cross_page_loop". This loop is incorrect if the length of the comparison exactly matches VEC_SIZE*4, which is 128 on my machine. Full report in attached pdf.
Created attachment 12508 [details] test case I reduced the bug to a stand-alone test case, now attached.
By extending your testing to check for more alignments and sizes: for (size_t s = 99; s <= 4 * VEC_SIZE; s++) for (size_t s1a = 31; s1a < 32; s1a++) for (size_t s2a = 30; s2a < 32; s2a++) { ret = strncmp (s1 + PAGE_SIZE - s - s1a, s1 + PAGE_SIZE - s - s2a, s); assert (ret == 0); } It seems that another page cross also requires fixing: 580 xorl %r8d, %r8d 581 /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes. */ 582 subl $(VEC_SIZE * 2), %ecx 583 jle 1f 584 /* Skip ECX bytes. */ 585 shrq %cl, %rdi 586 /* R8 has number of bytes skipped. */ 587 movl %ecx, %r8d 588 1: 589 /* Before jumping back to the loop, set ESI to the number of 590 VEC_SIZE * 4 blocks before page crossing. */ 591 movl $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi 592 593 testq %rdi, %rdi 594 je L(back_to_loop) 595 tzcntq %rdi, %rcx 596 addq %r10, %rcx 597 /* Adjust for number of bytes skipped. */ It should not jump back to loop if the ecx is negative (as some cases).
There is a test case at https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr25933/master
Created attachment 12601 [details] strncmp_avx2 patch for pr25933 Tested attached patch on https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr25933/master
(In reply to Sunil Pandey from comment #4) > Created attachment 12601 [details] > strncmp_avx2 patch for pr25933 > > Tested attached patch on > > https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr25933/master Looks good. Please try this diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S index 48d03a9f46..dabc3e7590 100644 --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S @@ -256,6 +256,11 @@ L(next_3_vectors): vpmovmskb %ymm0, %ecx testl %ecx, %ecx jne L(return_3_vec_size) +# ifdef USE_AS_STRNCMP + /* Check if VEC_SIZE * 4 already exceeded max compare count %r11 */ + cmpq $(VEC_SIZE * 4), %r11 + jbe L(zero) +# endif L(main_loop_header): leaq (VEC_SIZE * 4)(%rdi), %rdx movl $PAGE_SIZE, %ecx
(In reply to H.J. Lu from comment #5) > (In reply to Sunil Pandey from comment #4) > > Created attachment 12601 [details] > > strncmp_avx2 patch for pr25933 > > > > Tested attached patch on > > > > https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr25933/master > > Looks good. Please try this > > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S > b/sysdeps/x86_64/multiarch/strcmp-avx2.S > index 48d03a9f46..dabc3e7590 100644 > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S > @@ -256,6 +256,11 @@ L(next_3_vectors): > vpmovmskb %ymm0, %ecx > testl %ecx, %ecx > jne L(return_3_vec_size) > +# ifdef USE_AS_STRNCMP > + /* Check if VEC_SIZE * 4 already exceeded max compare count %r11 */ > + cmpq $(VEC_SIZE * 4), %r11 > + jbe L(zero) > +# endif > L(main_loop_header): > leaq (VEC_SIZE * 4)(%rdi), %rdx > movl $PAGE_SIZE, %ecx It fixes the issue on my setup as expected. $ ./test-strncmp simple_strncmp stupid_strncmp __strncmp_avx2 __strncmp_sse42 __strncmp_ssse3 __strncmp_sse2 $ echo $? 0 $ git diff diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S index 48d03a9f46..84ffe2cd5c 100644 --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S @@ -256,6 +256,11 @@ L(next_3_vectors): vpmovmskb %ymm0, %ecx testl %ecx, %ecx jne L(return_3_vec_size) +# ifdef USE_AS_STRNCMP + /* Check if VEC_SIZE * 4 already exceeded max compare count %r11 */ + cmpq $(VEC_SIZE * 4), %r11 + jbe L(zero) +# endif L(main_loop_header): leaq (VEC_SIZE * 4)(%rdi), %rdx movl $PAGE_SIZE, %ecx
(In reply to Sunil Pandey from comment #6) > (In reply to H.J. Lu from comment #5) > > (In reply to Sunil Pandey from comment #4) > > > Created attachment 12601 [details] > > > strncmp_avx2 patch for pr25933 > > > > > > Tested attached patch on > > > > > > https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr25933/master > > > > Looks good. Please try this > > > > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S > > b/sysdeps/x86_64/multiarch/strcmp-avx2.S > > index 48d03a9f46..dabc3e7590 100644 > > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S > > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S > > @@ -256,6 +256,11 @@ L(next_3_vectors): > > vpmovmskb %ymm0, %ecx > > testl %ecx, %ecx > > jne L(return_3_vec_size) > > +# ifdef USE_AS_STRNCMP > > + /* Check if VEC_SIZE * 4 already exceeded max compare count %r11 */ > > + cmpq $(VEC_SIZE * 4), %r11 > > + jbe L(zero) > > +# endif > > L(main_loop_header): > > leaq (VEC_SIZE * 4)(%rdi), %rdx > > movl $PAGE_SIZE, %ecx > > It fixes the issue on my setup as expected. > > $ ./test-strncmp > simple_strncmp stupid_strncmp __strncmp_avx2 > __strncmp_sse42 __strncmp_ssse3 __strncmp_sse2 > $ echo $? > 0 > Did you run "make check"?
(In reply to H.J. Lu from comment #7) > (In reply to Sunil Pandey from comment #6) > > (In reply to H.J. Lu from comment #5) > > > (In reply to Sunil Pandey from comment #4) > > > > Created attachment 12601 [details] > > > > strncmp_avx2 patch for pr25933 > > > > > > > > Tested attached patch on > > > > > > > > https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr25933/master > > > > > > Looks good. Please try this > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S > > > b/sysdeps/x86_64/multiarch/strcmp-avx2.S > > > index 48d03a9f46..dabc3e7590 100644 > > > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S > > > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S > > > @@ -256,6 +256,11 @@ L(next_3_vectors): > > > vpmovmskb %ymm0, %ecx > > > testl %ecx, %ecx > > > jne L(return_3_vec_size) > > > +# ifdef USE_AS_STRNCMP > > > + /* Check if VEC_SIZE * 4 already exceeded max compare count %r11 */ > > > + cmpq $(VEC_SIZE * 4), %r11 > > > + jbe L(zero) > > > +# endif > > > L(main_loop_header): > > > leaq (VEC_SIZE * 4)(%rdi), %rdx > > > movl $PAGE_SIZE, %ecx > > > > It fixes the issue on my setup as expected. > > > > $ ./test-strncmp > > simple_strncmp stupid_strncmp __strncmp_avx2 > > __strncmp_sse42 __strncmp_ssse3 __strncmp_sse2 > > $ echo $? > > 0 > > > > Did you run "make check"? yes. it fixes strncmp and no new failure.
(In reply to Sunil Pandey from comment #8) > > yes. it fixes strncmp and no new failure. I got FAIL: wcsmbs/test-wcsncmp
(In reply to H.J. Lu from comment #9) > (In reply to Sunil Pandey from comment #8) > > > > yes. it fixes strncmp and no new failure. > > I got > > FAIL: wcsmbs/test-wcsncmp Please rebase users/hjl/pr25933/master branch.
I think L(loop_cross_page) block is incorrect. Please compare it against L(loop_cross_page) block in strcmp-sse2-unaligned.S.
The bug is around 593 testq %rdi, %rdi 594 je L(back_to_loop) At this point, there may be less than 4 vector length remaining: Breakpoint 1, __strncmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:594 594 je L(back_to_loop) (gdb) p $r11 $2 = 97 (gdb) next 303 vmovdqa (%rax), %ymm0 (gdb) 304 vmovdqa VEC_SIZE(%rax), %ymm3 (gdb) 305 VPCMPEQ (%rdx), %ymm0, %ymm4 (gdb) 306 VPCMPEQ VEC_SIZE(%rdx), %ymm3, %ymm1 (gdb) 307 VPMINU %ymm0, %ymm4, %ymm4 (gdb) 308 VPMINU %ymm3, %ymm1, %ymm1 (gdb) 309 vmovdqa (VEC_SIZE * 2)(%rax), %ymm2 (gdb) 310 VPMINU %ymm1, %ymm4, %ymm0 (gdb) 311 vmovdqa (VEC_SIZE * 3)(%rax), %ymm3 (gdb) 312 VPCMPEQ (VEC_SIZE * 2)(%rdx), %ymm2, %ymm5 (gdb) 313 VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm3, %ymm6 (gdb) Program received signal SIGSEGV, Segmentation fault. __strncmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:313 313 VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm3, %ymm6 (gdb)
There are L(loop_cross_page_2_vec): /* The first VEC_SIZE * 2 bytes match or are ignored. */ vmovdqu (VEC_SIZE * 2)(%rax, %r10), %ymm2 vmovdqu (VEC_SIZE * 3)(%rax, %r10), %ymm3 VPCMPEQ (VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5 VPMINU %ymm2, %ymm5, %ymm5 VPCMPEQ (VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6 VPCMPEQ %ymm7, %ymm5, %ymm5 VPMINU %ymm3, %ymm6, %ymm6 VPCMPEQ %ymm7, %ymm6, %ymm6 vpmovmskb %ymm5, %edi vpmovmskb %ymm6, %esi salq $32, %rsi xorq %rsi, %rdi xorl %r8d, %r8d /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes. */ subl $(VEC_SIZE * 2), %ecx jle 1f /* Skip ECX bytes. */ shrq %cl, %rdi /* R8 has number of bytes skipped. */ movl %ecx, %r8d 1: /* Before jumping back to the loop, set ESI to the number of VEC_SIZE * 4 blocks before page crossing. */ movl $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi testq %rdi, %rdi je L(back_to_loop) When this branch is taken, there are (VEC_SIZE * 4) + %r10 matching bytes starting at %rax, which may be >= the maximum offset.
Created attachment 12610 [details] strncmp_avx2 patch.1 for pr25933 I ran glibc make check and all test pass with this patch.
(In reply to Sunil Pandey from comment #14) > Created attachment 12610 [details] > strncmp_avx2 patch.1 for pr25933 > > I ran glibc make check and all test pass with this patch. You removed loop unrolling. Please provide all relevant glibc micro benchmarks data before and after your change.
Created attachment 12612 [details] strcmp_avx2 micro benchmark comparison. strcmp_avx2 micro benchmark before and after test with RDTSCP and CLOCK time.
Created attachment 12613 [details] strncmp_avx2 micro benchmark comparison. strncmp_avx2 micro benchmark before and after test with RDTSCP and CLOCK time.
Created attachment 12614 [details] wcscmp_avx2 micro benchmark comparison. wcscmp_avx2 micro benchmark before and after test with RDTSCP and CLOCK time.
Created attachment 12615 [details] wcsncmp_avx2 micro benchmark comparison. wcsncmp_avx2 micro benchmark before and after test with RDTSCP and CLOCK time.
(In reply to H.J. Lu from comment #15) > (In reply to Sunil Pandey from comment #14) > > Created attachment 12610 [details] > > strncmp_avx2 patch.1 for pr25933 > > > > I ran glibc make check and all test pass with this patch. > > You removed loop unrolling. Please provide all relevant glibc micro > benchmarks data before and after your change. I beleive strncmp_avx2 changes affects strcmp strncmp wcscmp wcsncmp Let me know if I miss any other relevant micro benchmark corresponding to this change.
(In reply to Sunil Pandey from comment #20) > (In reply to H.J. Lu from comment #15) > > (In reply to Sunil Pandey from comment #14) > > > Created attachment 12610 [details] > > > strncmp_avx2 patch.1 for pr25933 > > > > > > I ran glibc make check and all test pass with this patch. > > > > You removed loop unrolling. Please provide all relevant glibc micro > > benchmarks data before and after your change. > > I beleive strncmp_avx2 changes affects > > strcmp > strncmp > wcscmp > wcsncmp > > Let me know if I miss any other relevant micro benchmark corresponding to > this change. I added more bench tests to users/hjl/pr25933/master branch. Please re-collect numbers.
A patch is posted at https://sourceware.org/pipermail/libc-alpha/2020-June/115012.html
The release/2.31/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4e8a33a9590edc5c3a2cc5e726a3f2a73b66cdc0 commit 4e8a33a9590edc5c3a2cc5e726a3f2a73b66cdc0 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sat Jul 4 09:45:21 2020 -0700 NEWS: Mention BZ 25933 fix
The release/2.30/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=94abcef26ebbe89861128a9a62741e410104a342 commit 94abcef26ebbe89861128a9a62741e410104a342 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sat Jul 4 09:45:21 2020 -0700 NEWS: Mention BZ 25933 fix
The release/2.29/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=83aaa1714428ba3b29315c8c5d14b1766b2ca3aa commit 83aaa1714428ba3b29315c8c5d14b1766b2ca3aa Author: H.J. Lu <hjl.tools@gmail.com> Date: Sat Jul 4 09:45:21 2020 -0700 NEWS: Mention BZ 25933 fix
The release/2.28/master branch has been updated by H.J. Lu <hjl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f82072183ad5b328f6a7cb91868cb1709e85d96c commit f82072183ad5b328f6a7cb91868cb1709e85d96c Author: H.J. Lu <hjl.tools@gmail.com> Date: Sat Jul 4 09:45:21 2020 -0700 NEWS: Mention BZ 25933 fix
Fixed for 2.32 and on 2.31/2.30/2.29/2.28 branches.