This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[PING][PATCH neleai/string-x64] Improve sse4 implementation of strpbrk, strspn and strcspn
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: libc-alpha at sourceware dot org
- Date: Thu, 2 Jul 2015 13:10:58 +0200
- Subject: [PING][PATCH neleai/string-x64] Improve sse4 implementation of strpbrk, strspn and strcspn
- Authentication-results: sourceware.org; auth=none
- References: <20150626204713 dot GA8001 at domone>
ping
On Fri, Jun 26, 2015 at 10:47:13PM +0200, OndÅej BÃlka wrote:
> Hi,
>
> I had idea to optimize strspn more to exploit that 75% of time mismatch
> happens in first character. I looked at sse4.2 implementation and found
> that control flow is suboptimal. I replaced that by common one of first
> checking 64 bytes unaligned unless they cross page boundary, then using
> 16 byte loop.
>
> When I checked generated assembly gcc messed that up. For testing a flag
> gcc first zeroed register, then used conditional move to set that
> register when flag is set and then finally checked if register is
> nonzero.
>
> Just fixing that mistake and using flags directly makes loop 25-40%
> faster.
>
> New control flow makes strpbrk+strcspn around 10% faster in practice.
>
> Results are here.
> http://kam.mff.cuni.cz/~ondra/benchmark_string/strpbrk_profile.html
>
> I have in my todo list to extend sse4.2 handling of accept longer than
> 16 bytes. Second is sse2 optimization. That relates to strpbrk prolog,
> idea is to check first 16 bytes of s by each byte of a and oring
> vectors. Now I need to compare that with overhead of constructing 256
> byte table. Depending on accept size I migth do it several times until
> it costs same as table construction. On core2 its 40% faster for gcc
> workload and also faster when accept is less than 6 bytes.
>
> Problem is that these constants are quite cpu dependent, how should I
> handle that.
>
> So is ok to check this patch?
>
> * sysdeps/x86_64/multiarch/Makefile: Updated.
> * sysdeps/x86_64/multiarch/strcspn-c.c: Deleted.
> * sysdeps/x86_64/multiarch/strpbrk-c.c: Likewise.
> * sysdeps/x86_64/multiarch/strspn-c.c: Likewise.
> * sysdeps/x86_64/multiarch/varshift.c: Likewise.
> * sysdeps/x86_64/multiarch/varshift.h: Likewise.
> * sysdeps/x86_64/multiarch/strpbrk_sse42.S: New file.
> * sysdeps/x86_64/multiarch/strspn_sse42.S: Likewise.
> * sysdeps/x86_64/multiarch/strcspn_sse42.S: Likewise.
>
> ---
> sysdeps/x86_64/multiarch/Makefile | 6 +-
> sysdeps/x86_64/multiarch/strcspn-c.c | 173 --------------------------
> sysdeps/x86_64/multiarch/strcspn_sse42.S | 3 +
> sysdeps/x86_64/multiarch/strpbrk-c.c | 8 --
> sysdeps/x86_64/multiarch/strpbrk_sse42.S | 204 +++++++++++++++++++++++++++++++
> sysdeps/x86_64/multiarch/strspn-c.c | 145 ----------------------
> sysdeps/x86_64/multiarch/strspn_sse42.S | 3 +
> sysdeps/x86_64/multiarch/varshift.c | 25 ----
> sysdeps/x86_64/multiarch/varshift.h | 30 -----
> 9 files changed, 211 insertions(+), 386 deletions(-)
> delete mode 100644 sysdeps/x86_64/multiarch/strcspn-c.c
> create mode 100644 sysdeps/x86_64/multiarch/strcspn_sse42.S
> delete mode 100644 sysdeps/x86_64/multiarch/strpbrk-c.c
> create mode 100644 sysdeps/x86_64/multiarch/strpbrk_sse42.S
> delete mode 100644 sysdeps/x86_64/multiarch/strspn-c.c
> create mode 100644 sysdeps/x86_64/multiarch/strspn_sse42.S
> delete mode 100644 sysdeps/x86_64/multiarch/varshift.c
> delete mode 100644 sysdeps/x86_64/multiarch/varshift.h
>
> diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> index 8094162..05d5c9b 100644
> --- a/sysdeps/x86_64/multiarch/Makefile
> +++ b/sysdeps/x86_64/multiarch/Makefile
> @@ -22,11 +22,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c strcmp-ssse3 \
> strchr-sse2-no-bsf memcmp-ssse3 strstr-sse2-unaligned
>
> ifeq (yes,$(config-cflags-sse4))
> -sysdep_routines += strcspn-c strpbrk-c strspn-c varshift
> -CFLAGS-varshift.c += -msse4
> -CFLAGS-strcspn-c.c += -msse4
> -CFLAGS-strpbrk-c.c += -msse4
> -CFLAGS-strspn-c.c += -msse4
> +sysdep_routines += strcspn_sse42 strpbrk_sse42 strspn_sse42
> endif
>
> ifeq (yes,$(config-cflags-avx2))
> diff --git a/sysdeps/x86_64/multiarch/strcspn-c.c b/sysdeps/x86_64/multiarch/strcspn-c.c
> deleted file mode 100644
> index 60b2ed7..0000000
> --- a/sysdeps/x86_64/multiarch/strcspn-c.c
> +++ /dev/null
> @@ -1,173 +0,0 @@
> -/* strcspn with SSE4.2 intrinsics
> - Copyright (C) 2009-2015 Free Software Foundation, Inc.
> - Contributed by Intel Corporation.
> - This file is part of the GNU C Library.
> -
> - The GNU C Library is free software; you can redistribute it and/or
> - modify it under the terms of the GNU Lesser General Public
> - License as published by the Free Software Foundation; either
> - version 2.1 of the License, or (at your option) any later version.
> -
> - The GNU C Library is distributed in the hope that it will be useful,
> - but WITHOUT ANY WARRANTY; without even the implied warranty of
> - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> - Lesser General Public License for more details.
> -
> - You should have received a copy of the GNU Lesser General Public
> - License along with the GNU C Library; if not, see
> - <http://www.gnu.org/licenses/>. */
> -
> -#include <nmmintrin.h>
> -#include <string.h>
> -#include "varshift.h"
> -
> -/* We use 0x2:
> - _SIDD_SBYTE_OPS
> - | _SIDD_CMP_EQUAL_ANY
> - | _SIDD_POSITIVE_POLARITY
> - | _SIDD_LEAST_SIGNIFICANT
> - on pcmpistri to compare xmm/mem128
> -
> - 0 1 2 3 4 5 6 7 8 9 A B C D E F
> - X X X X X X X X X X X X X X X X
> -
> - against xmm
> -
> - 0 1 2 3 4 5 6 7 8 9 A B C D E F
> - A A A A A A A A A A A A A A A A
> -
> - to find out if the first 16byte data element has any byte A and
> - the offset of the first byte. There are 3 cases:
> -
> - 1. The first 16byte data element has the byte A at the offset X.
> - 2. The first 16byte data element has EOS and doesn't have the byte A.
> - 3. The first 16byte data element is valid and doesn't have the byte A.
> -
> - Here is the table of ECX, CFlag, ZFlag and SFlag for 2 cases:
> -
> - 1 X 1 0/1 0
> - 2 16 0 1 0
> - 3 16 0 0 0
> -
> - We exit from the loop for cases 1 and 2 with jbe which branches
> - when either CFlag or ZFlag is 1. If CFlag == 1, ECX has the offset
> - X for case 1. */
> -
> -#ifndef STRCSPN_SSE2
> -# define STRCSPN_SSE2 __strcspn_sse2
> -# define STRCSPN_SSE42 __strcspn_sse42
> -#endif
> -
> -#ifdef USE_AS_STRPBRK
> -# define RETURN(val1, val2) return val1
> -#else
> -# define RETURN(val1, val2) return val2
> -#endif
> -
> -extern
> -#ifdef USE_AS_STRPBRK
> -char *
> -#else
> -size_t
> -#endif
> -STRCSPN_SSE2 (const char *, const char *);
> -
> -
> -#ifdef USE_AS_STRPBRK
> -char *
> -#else
> -size_t
> -#endif
> -__attribute__ ((section (".text.sse4.2")))
> -STRCSPN_SSE42 (const char *s, const char *a)
> -{
> - if (*a == 0)
> - RETURN (NULL, strlen (s));
> -
> - const char *aligned;
> - __m128i mask;
> - int offset = (int) ((size_t) a & 15);
> - if (offset != 0)
> - {
> - /* Load masks. */
> - aligned = (const char *) ((size_t) a & -16L);
> - __m128i mask0 = _mm_load_si128 ((__m128i *) aligned);
> -
> - mask = __m128i_shift_right (mask0, offset);
> -
> - /* Find where the NULL terminator is. */
> - int length = _mm_cmpistri (mask, mask, 0x3a);
> - if (length == 16 - offset)
> - {
> - /* There is no NULL terminator. */
> - __m128i mask1 = _mm_load_si128 ((__m128i *) (aligned + 16));
> - int index = _mm_cmpistri (mask1, mask1, 0x3a);
> - length += index;
> -
> - /* Don't use SSE4.2 if the length of A > 16. */
> - if (length > 16)
> - return STRCSPN_SSE2 (s, a);
> -
> - if (index != 0)
> - {
> - /* Combine mask0 and mask1. We could play games with
> - palignr, but frankly this data should be in L1 now
> - so do the merge via an unaligned load. */
> - mask = _mm_loadu_si128 ((__m128i *) a);
> - }
> - }
> - }
> - else
> - {
> - /* A is aligned. */
> - mask = _mm_load_si128 ((__m128i *) a);
> -
> - /* Find where the NULL terminator is. */
> - int length = _mm_cmpistri (mask, mask, 0x3a);
> - if (length == 16)
> - {
> - /* There is no NULL terminator. Don't use SSE4.2 if the length
> - of A > 16. */
> - if (a[16] != 0)
> - return STRCSPN_SSE2 (s, a);
> - }
> - }
> -
> - offset = (int) ((size_t) s & 15);
> - if (offset != 0)
> - {
> - /* Check partial string. */
> - aligned = (const char *) ((size_t) s & -16L);
> - __m128i value = _mm_load_si128 ((__m128i *) aligned);
> -
> - value = __m128i_shift_right (value, offset);
> -
> - int length = _mm_cmpistri (mask, value, 0x2);
> - /* No need to check ZFlag since ZFlag is always 1. */
> - int cflag = _mm_cmpistrc (mask, value, 0x2);
> - if (cflag)
> - RETURN ((char *) (s + length), length);
> - /* Find where the NULL terminator is. */
> - int index = _mm_cmpistri (value, value, 0x3a);
> - if (index < 16 - offset)
> - RETURN (NULL, index);
> - aligned += 16;
> - }
> - else
> - aligned = s;
> -
> - while (1)
> - {
> - __m128i value = _mm_load_si128 ((__m128i *) aligned);
> - int index = _mm_cmpistri (mask, value, 0x2);
> - int cflag = _mm_cmpistrc (mask, value, 0x2);
> - int zflag = _mm_cmpistrz (mask, value, 0x2);
> - if (cflag)
> - RETURN ((char *) (aligned + index), (size_t) (aligned + index - s));
> - if (zflag)
> - RETURN (NULL,
> - /* Find where the NULL terminator is. */
> - (size_t) (aligned + _mm_cmpistri (value, value, 0x3a) - s));
> - aligned += 16;
> - }
> -}
> diff --git a/sysdeps/x86_64/multiarch/strcspn_sse42.S b/sysdeps/x86_64/multiarch/strcspn_sse42.S
> new file mode 100644
> index 0000000..3e4e659
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/strcspn_sse42.S
> @@ -0,0 +1,3 @@
> +#define AS_STRCSPN
> +#define __strpbrk_sse42 __strcspn_sse42
> +#include "strpbrk_sse42.S"
> diff --git a/sysdeps/x86_64/multiarch/strpbrk-c.c b/sysdeps/x86_64/multiarch/strpbrk-c.c
> deleted file mode 100644
> index bbf5c49..0000000
> --- a/sysdeps/x86_64/multiarch/strpbrk-c.c
> +++ /dev/null
> @@ -1,8 +0,0 @@
> -/* Don't define multiple versions for strpbrk in static library since we
> - need strpbrk before the initialization happened. */
> -#ifdef SHARED
> -# define USE_AS_STRPBRK
> -# define STRCSPN_SSE2 __strpbrk_sse2
> -# define STRCSPN_SSE42 __strpbrk_sse42
> -# include "strcspn-c.c"
> -#endif
> diff --git a/sysdeps/x86_64/multiarch/strpbrk_sse42.S b/sysdeps/x86_64/multiarch/strpbrk_sse42.S
> new file mode 100644
> index 0000000..512ac19
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/strpbrk_sse42.S
> @@ -0,0 +1,204 @@
> +/* strcspn (str, ss) -- Return the length of the initial segment of STR
> + which contains no characters from SS.
> + Copyright (C) 2015 Free Software Foundation, Inc.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <http://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +
> +#ifdef AS_STRSPN
> +# define AS_STRCSPN
> +# define MATCH_ALL $18
> +#else
> +# define MATCH_ALL $2
> +#endif
> +
> +ENTRY(__strpbrk_sse42)
> + movq %rdi, %rax
> + andl $4095, %eax
> + cmp $4032, %eax
> + ja L(cross_page)
> + movq %rsi, %rax
> + andl $4095, %eax
> + cmp $4080, %eax
> + ja L(cross_page)
> + movdqu (%rsi), %xmm4
> + movdqu (%rdi), %xmm1
> + movdqu 16(%rdi), %xmm5
> + movdqu 32(%rdi), %xmm6
> + movdqu 48(%rdi), %xmm7
> +
> +L(back_from_crosspage):
> + pxor %xmm3, %xmm3
> + pxor %xmm2, %xmm2
> +
> + pcmpeqb %xmm4, %xmm2
> + pmovmskb %xmm2, %eax
> + testl %eax, %eax
> + je L(call)
> + pcmpistri MATCH_ALL, %xmm1, %xmm4
> + jc L(rx0)
> + je L(ret0)
> + pcmpistri MATCH_ALL, %xmm5, %xmm4
> + jc L(rx16)
> + je L(ret16)
> + pcmpistri MATCH_ALL, %xmm6, %xmm4
> + jc L(rx32)
> + je L(ret32)
> + pcmpistri MATCH_ALL, %xmm7, %xmm4
> + jc L(rx48)
> + je L(ret48)
> +
> + movq %rdi, %rax
> + andq $-16, %rax
> + addq $16, %rax
> + .p2align 4,,10
> + .p2align 3
> +L(loop):
> + pcmpistri MATCH_ALL, (%rax), %xmm4
> + lea 16(%rax), %rax
> + jc L(rx_loop)
> + jne L(loop)
> +#ifdef AS_STRCSPN
> + movdqa -16(%rax), %xmm1
> + pcmpistri $58, %xmm1, %xmm1
> + lea -16(%rcx, %rax), %rax
> + sub %rdi, %rax
> +#else
> + xor %eax, %eax
> +#endif
> + ret
> +L(rx_loop):
> + lea -16(%rcx, %rax), %rax
> +#ifdef AS_STRCSPN
> + sub %rdi, %rax
> +#endif
> + ret
> + .p2align 4,,10
> + .p2align 3
> +#ifndef AS_STRCSPN
> +L(ret0):
> +L(ret16):
> +L(ret32):
> +L(ret48):
> + xorl %eax, %eax
> + ret
> +#endif
> +L(call):
> +#ifdef AS_STRCSPN
> +# ifdef AS_STRSPN
> + jmp __strspn_sse2
> +# else
> + jmp __strcspn_sse2
> +# endif
> +#else
> + jmp __strpbrk_sse2
> +#endif
> + .p2align 4,,10
> + .p2align 3
> +#ifdef AS_STRCSPN
> +L(ret0):
> + pcmpistri $58, %xmm1, %xmm1
> +L(rx0):
> + lea 0(%rcx), %rax
> +#else
> +L(rx0):
> + leaq (%rdi,%rcx), %rax
> +#endif
> + ret
> +#ifdef AS_STRCSPN
> +L(ret16):
> + pcmpistri $58, %xmm5, %xmm5
> +L(rx16):
> + lea 16(%rcx), %rax
> +#else
> +L(rx16):
> + leaq 16(%rdi,%rcx), %rax
> +#endif
> + ret
> +#ifdef AS_STRCSPN
> +L(ret32):
> + pcmpistri $58, %xmm6, %xmm6
> +L(rx32):
> + lea 32(%rcx), %rax
> +#else
> +L(rx32):
> + leaq 32(%rdi,%rcx), %rax
> +#endif
> + ret
> +#ifdef AS_STRCSPN
> +L(ret48):
> + pcmpistri $58, %xmm7, %xmm7
> +L(rx48):
> + lea 48(%rcx), %rax
> +#else
> +L(rx48):
> + leaq 48(%rdi,%rcx), %rax
> +#endif
> + ret
> +
> + .p2align 4,,10
> + .p2align 3
> +L(cross_page):
> + movzbl (%rdi), %ecx
> + xorl %eax, %eax
> + leaq -80(%rsp), %r8
> + testb %cl, %cl
> + je L(sloop_end)
> + leaq -80(%rsp), %r8
> + xorl %edx, %edx
> + xorl %eax, %eax
> + .p2align 4,,10
> + .p2align 3
> +L(sloop):
> + movb %cl, (%r8,%rdx)
> + movzbl 1(%rdi,%rdx), %ecx
> + addl $1, %eax
> + testb %cl, %cl
> + je L(sloop_end)
> + addq $1, %rdx
> + cmpl $64, %eax
> + jne L(sloop)
> +L(sloop_end):
> + movzbl (%rsi), %ecx
> + cltq
> + movb $0, -80(%rsp,%rax)
> + movdqu (%r8), %xmm1
> + movdqu 16(%r8), %xmm5
> + movdqu 32(%r8), %xmm6
> + movdqu 48(%r8), %xmm7
> +
> + xorl %eax, %eax
> + testb %cl, %cl
> + je L(aloop_end)
> + xorl %edx, %edx
> + .p2align 4,,10
> + .p2align 3
> +L(aloop):
> + movb %cl, (%r8,%rdx)
> + movzbl 1(%rsi,%rdx), %ecx
> + addl $1, %eax
> + testb %cl, %cl
> + je L(aloop_end)
> + addq $1, %rdx
> + cmpl $16, %eax
> + jne L(aloop)
> +L(aloop_end):
> + cltq
> + movb $0, -80(%rsp,%rax)
> + movdqu (%r8), %xmm4
> + jmp L(back_from_crosspage)
> +END(__strpbrk_sse42)
> diff --git a/sysdeps/x86_64/multiarch/strspn-c.c b/sysdeps/x86_64/multiarch/strspn-c.c
> deleted file mode 100644
> index 6b0c80a..0000000
> --- a/sysdeps/x86_64/multiarch/strspn-c.c
> +++ /dev/null
> @@ -1,145 +0,0 @@
> -/* strspn with SSE4.2 intrinsics
> - Copyright (C) 2009-2015 Free Software Foundation, Inc.
> - Contributed by Intel Corporation.
> - This file is part of the GNU C Library.
> -
> - The GNU C Library is free software; you can redistribute it and/or
> - modify it under the terms of the GNU Lesser General Public
> - License as published by the Free Software Foundation; either
> - version 2.1 of the License, or (at your option) any later version.
> -
> - The GNU C Library is distributed in the hope that it will be useful,
> - but WITHOUT ANY WARRANTY; without even the implied warranty of
> - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> - Lesser General Public License for more details.
> -
> - You should have received a copy of the GNU Lesser General Public
> - License along with the GNU C Library; if not, see
> - <http://www.gnu.org/licenses/>. */
> -
> -#include <nmmintrin.h>
> -#include <string.h>
> -#include "varshift.h"
> -
> -/* We use 0x12:
> - _SIDD_SBYTE_OPS
> - | _SIDD_CMP_EQUAL_ANY
> - | _SIDD_NEGATIVE_POLARITY
> - | _SIDD_LEAST_SIGNIFICANT
> - on pcmpistri to compare xmm/mem128
> -
> - 0 1 2 3 4 5 6 7 8 9 A B C D E F
> - X X X X X X X X X X X X X X X X
> -
> - against xmm
> -
> - 0 1 2 3 4 5 6 7 8 9 A B C D E F
> - A A A A A A A A A A A A A A A A
> -
> - to find out if the first 16byte data element has any non-A byte and
> - the offset of the first byte. There are 2 cases:
> -
> - 1. The first 16byte data element has the non-A byte, including
> - EOS, at the offset X.
> - 2. The first 16byte data element is valid and doesn't have the non-A
> - byte.
> -
> - Here is the table of ECX, CFlag, ZFlag and SFlag for 2 cases:
> -
> - case ECX CFlag ZFlag SFlag
> - 1 X 1 0/1 0
> - 2 16 0 0 0
> -
> - We exit from the loop for case 1. */
> -
> -extern size_t __strspn_sse2 (const char *, const char *);
> -
> -
> -size_t
> -__attribute__ ((section (".text.sse4.2")))
> -__strspn_sse42 (const char *s, const char *a)
> -{
> - if (*a == 0)
> - return 0;
> -
> - const char *aligned;
> - __m128i mask;
> - int offset = (int) ((size_t) a & 15);
> - if (offset != 0)
> - {
> - /* Load masks. */
> - aligned = (const char *) ((size_t) a & -16L);
> - __m128i mask0 = _mm_load_si128 ((__m128i *) aligned);
> -
> - mask = __m128i_shift_right (mask0, offset);
> -
> - /* Find where the NULL terminator is. */
> - int length = _mm_cmpistri (mask, mask, 0x3a);
> - if (length == 16 - offset)
> - {
> - /* There is no NULL terminator. */
> - __m128i mask1 = _mm_load_si128 ((__m128i *) (aligned + 16));
> - int index = _mm_cmpistri (mask1, mask1, 0x3a);
> - length += index;
> -
> - /* Don't use SSE4.2 if the length of A > 16. */
> - if (length > 16)
> - return __strspn_sse2 (s, a);
> -
> - if (index != 0)
> - {
> - /* Combine mask0 and mask1. We could play games with
> - palignr, but frankly this data should be in L1 now
> - so do the merge via an unaligned load. */
> - mask = _mm_loadu_si128 ((__m128i *) a);
> - }
> - }
> - }
> - else
> - {
> - /* A is aligned. */
> - mask = _mm_load_si128 ((__m128i *) a);
> -
> - /* Find where the NULL terminator is. */
> - int length = _mm_cmpistri (mask, mask, 0x3a);
> - if (length == 16)
> - {
> - /* There is no NULL terminator. Don't use SSE4.2 if the length
> - of A > 16. */
> - if (a[16] != 0)
> - return __strspn_sse2 (s, a);
> - }
> - }
> -
> - offset = (int) ((size_t) s & 15);
> - if (offset != 0)
> - {
> - /* Check partial string. */
> - aligned = (const char *) ((size_t) s & -16L);
> - __m128i value = _mm_load_si128 ((__m128i *) aligned);
> -
> - value = __m128i_shift_right (value, offset);
> -
> - int length = _mm_cmpistri (mask, value, 0x12);
> - /* No need to check CFlag since it is always 1. */
> - if (length < 16 - offset)
> - return length;
> - /* Find where the NULL terminator is. */
> - int index = _mm_cmpistri (value, value, 0x3a);
> - if (index < 16 - offset)
> - return length;
> - aligned += 16;
> - }
> - else
> - aligned = s;
> -
> - while (1)
> - {
> - __m128i value = _mm_load_si128 ((__m128i *) aligned);
> - int index = _mm_cmpistri (mask, value, 0x12);
> - int cflag = _mm_cmpistrc (mask, value, 0x12);
> - if (cflag)
> - return (size_t) (aligned + index - s);
> - aligned += 16;
> - }
> -}
> diff --git a/sysdeps/x86_64/multiarch/strspn_sse42.S b/sysdeps/x86_64/multiarch/strspn_sse42.S
> new file mode 100644
> index 0000000..d460167
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/strspn_sse42.S
> @@ -0,0 +1,3 @@
> +#define AS_STRSPN
> +#define __strpbrk_sse42 __strspn_sse42
> +#include "strpbrk_sse42.S"
> diff --git a/sysdeps/x86_64/multiarch/varshift.c b/sysdeps/x86_64/multiarch/varshift.c
> deleted file mode 100644
> index 0007ef7..0000000
> --- a/sysdeps/x86_64/multiarch/varshift.c
> +++ /dev/null
> @@ -1,25 +0,0 @@
> -/* Helper for variable shifts of SSE registers.
> - Copyright (C) 2010-2015 Free Software Foundation, Inc.
> - This file is part of the GNU C Library.
> -
> - The GNU C Library is free software; you can redistribute it and/or
> - modify it under the terms of the GNU Lesser General Public
> - License as published by the Free Software Foundation; either
> - version 2.1 of the License, or (at your option) any later version.
> -
> - The GNU C Library is distributed in the hope that it will be useful,
> - but WITHOUT ANY WARRANTY; without even the implied warranty of
> - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> - Lesser General Public License for more details.
> -
> - You should have received a copy of the GNU Lesser General Public
> - License along with the GNU C Library; if not, see
> - <http://www.gnu.org/licenses/>. */
> -
> -#include "varshift.h"
> -
> -const int8_t ___m128i_shift_right[31] attribute_hidden =
> - {
> - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
> - -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
> - };
> diff --git a/sysdeps/x86_64/multiarch/varshift.h b/sysdeps/x86_64/multiarch/varshift.h
> deleted file mode 100644
> index 30ace3d..0000000
> --- a/sysdeps/x86_64/multiarch/varshift.h
> +++ /dev/null
> @@ -1,30 +0,0 @@
> -/* Helper for variable shifts of SSE registers.
> - Copyright (C) 2010-2015 Free Software Foundation, Inc.
> - This file is part of the GNU C Library.
> -
> - The GNU C Library is free software; you can redistribute it and/or
> - modify it under the terms of the GNU Lesser General Public
> - License as published by the Free Software Foundation; either
> - version 2.1 of the License, or (at your option) any later version.
> -
> - The GNU C Library is distributed in the hope that it will be useful,
> - but WITHOUT ANY WARRANTY; without even the implied warranty of
> - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> - Lesser General Public License for more details.
> -
> - You should have received a copy of the GNU Lesser General Public
> - License along with the GNU C Library; if not, see
> - <http://www.gnu.org/licenses/>. */
> -
> -#include <stdint.h>
> -#include <tmmintrin.h>
> -
> -extern const int8_t ___m128i_shift_right[31] attribute_hidden;
> -
> -static __inline__ __m128i
> -__m128i_shift_right (__m128i value, unsigned long int offset)
> -{
> - return _mm_shuffle_epi8 (value,
> - _mm_loadu_si128 ((__m128i *) (___m128i_shift_right
> - + offset)));
> -}
> --
> 1.8.4.rc3
--
Too much radiation coming from the soil.