This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On 07/02/2015 01:11 AM, OndÅej BÃlka wrote:
Currently I'm rewriting the optimized functions in order to load the first 16 bytes unaligned or up to page boundary.On Fri, Jun 26, 2015 at 01:51:52PM +0200, Stefan Liebler wrote:This patch provides optimized version of memrchr with the z13 vector instructions.++.Lfound_g16: + vlgvb %r1,%v17,7 /* Index of c or 16 if not found. */ + lgr %r2,%r5 + lghi %r4,15 /* Highest index of a full vr. */ + clr %r1,%r4 +.Lfound_lt16: + la %r2,0(%r1,%r2) /* Store current pointer to found c. */ + ber %r14 /* Return if found c is last loaded byte. */ + + /* Shift vector elements left and start search again. */ + aghi %r1,1 /* Start search after current index. */ + slr %r4,%r1 /* New highest index. */ + sll %r1,3 /* Calculate byte count for vector shift + left. */ + vlvgg %v17,%r1,0 + vslb %v16,%v16,%v17 /* Vector shift left by byte by number of bytes + specified in bits 1-4 of byte 7 in v17. */ + vfeeb %v17,%v16,%v18 /* Find c. */ + la %r5,1(%r2) /* Save start-address of shifted v16. */ + vlgvb %r1,%v17,7 /* Index of c or 16 if not found */ + clr %r1,%r4 + locgrle %r2,%r5 /* Use stored address as base if c found. */ + jle .Lfound_lt16 /* Found c within loaded bytes. */ + br %r14 /* No further c found, return last stored c. */ + +.Lfound_end: + la %r2,0(%r4,%r2) /* Return pointer to found c */ + br %r14 +This sequence looks slow. Both here and in strrchr you should use something better than loop. One possibility is to use shuffle to do 16-byte byteswap but it needs load constant. A second way is to use same trick as in calculating clz with ctz. If instruction produces nonzero for match an 0 for nonmatch then to find highest nonzero byte you first need or all lower bytes with it for example by x |= x >> 8; x |= x >> 4; x |= x >> 2; x |= x >> 1; Then you look for index of first zero byte which is one more than highest nonzero byte. When roles of 1 and 0 are reversed and all nonzero bytes have bit in common you could use &. For strrchr idea is same. When handling trailing zero you could make similar trick with zero mask x |= x << 8; x |= x << 4; x |= x << 2; x |= x << 1; x = ~x; to construct vector that is 255 before and at first zero byte and 0 after that. Its important that first zero will get 255 to handle strrchr(x,0) without separate check if c is zero.
For strrchr I'm now searching for c in 16byte blocks and mark the last block with an found c - as you recommended.
After reaching end of string, I check once for further occurrences of c in the marked block by shifting right the bytes in marked vector register to mask out the garbage after a possible zero, reversing the bytes in vector-register and searching for "the first" c again. The result is the last occurrence of c in the marked 16byte block and the whole string.
And you are right, the mentioned worst case - string with only c's - is now much faster.
I post the new version of patch-series after wcsrchr, memrchr are changed, too.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |