This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH 27/27] S390: Optimize memrchr.
- From: Stefan Liebler <stli at linux dot vnet dot ibm dot com>
- To: libc-alpha at sourceware dot org
- Date: Thu, 02 Jul 2015 12:59:19 +0200
- Subject: Re: [PATCH 27/27] S390: Optimize memrchr.
- Authentication-results: sourceware.org; auth=none
- References: <1435319512-22245-1-git-send-email-stli at linux dot vnet dot ibm dot com> <1435319512-22245-28-git-send-email-stli at linux dot vnet dot ibm dot com> <20150701231138 dot GB20947 at domone>
On 07/02/2015 01:11 AM, OndÅej BÃlka wrote:
Currently I'm rewriting the optimized functions in order to load the
first 16 bytes unaligned or up to page boundary.
On Fri, Jun 26, 2015 at 01:51:52PM +0200, Stefan Liebler wrote:
This patch provides optimized version of memrchr with the z13 vector
+ vlgvb %r1,%v17,7 /* Index of c or 16 if not found. */
+ lgr %r2,%r5
+ lghi %r4,15 /* Highest index of a full vr. */
+ clr %r1,%r4
+ la %r2,0(%r1,%r2) /* Store current pointer to found c. */
+ ber %r14 /* Return if found c is last loaded byte. */
+ /* Shift vector elements left and start search again. */
+ aghi %r1,1 /* Start search after current index. */
+ slr %r4,%r1 /* New highest index. */
+ sll %r1,3 /* Calculate byte count for vector shift
+ left. */
+ vlvgg %v17,%r1,0
+ vslb %v16,%v16,%v17 /* Vector shift left by byte by number of bytes
+ specified in bits 1-4 of byte 7 in v17. */
+ vfeeb %v17,%v16,%v18 /* Find c. */
+ la %r5,1(%r2) /* Save start-address of shifted v16. */
+ vlgvb %r1,%v17,7 /* Index of c or 16 if not found */
+ clr %r1,%r4
+ locgrle %r2,%r5 /* Use stored address as base if c found. */
+ jle .Lfound_lt16 /* Found c within loaded bytes. */
+ br %r14 /* No further c found, return last stored c. */
+ la %r2,0(%r4,%r2) /* Return pointer to found c */
+ br %r14
This sequence looks slow. Both here and in strrchr you should use
something better than loop. One possibility is to use shuffle to do
16-byte byteswap but it needs load constant.
A second way is to use same trick as in calculating clz with ctz. If
instruction produces nonzero for match an 0 for nonmatch then to find
highest nonzero byte you first need or all lower bytes with it for
x |= x >> 8;
x |= x >> 4;
x |= x >> 2;
x |= x >> 1;
Then you look for index of first zero byte which is one more than
highest nonzero byte. When roles of 1 and 0 are reversed and all nonzero
bytes have bit in common you could use &.
For strrchr idea is same. When handling trailing zero you could make
similar trick with zero mask
x |= x << 8;
x |= x << 4;
x |= x << 2;
x |= x << 1;
x = ~x;
to construct vector that is 255 before and at first zero byte and 0 after
that. Its important that first zero will get 255 to handle strrchr(x,0)
without separate check if c is zero.
For strrchr I'm now searching for c in 16byte blocks and mark the last
block with an found c - as you recommended.
After reaching end of string, I check once for further occurrences of c
in the marked block by shifting right the bytes in marked vector
register to mask out the garbage after a possible zero, reversing the
bytes in vector-register and searching for "the first" c again. The
result is the last occurrence of c in the marked 16byte block and the
And you are right, the mentioned worst case - string with only c's - is
now much faster.
I post the new version of patch-series after wcsrchr, memrchr are