This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 27/27] S390: Optimize memrchr.




On 07/02/2015 01:11 AM, OndÅej BÃlka wrote:
On Fri, Jun 26, 2015 at 01:51:52PM +0200, Stefan Liebler wrote:
This patch provides optimized version of memrchr with the z13 vector
instructions.
+
+.Lfound_g16:
+	vlgvb	%r1,%v17,7	/* Index of c or 16 if not found.  */
+	lgr	%r2,%r5
+	lghi	%r4,15		/* Highest index of a full vr.  */
+	clr	%r1,%r4
+.Lfound_lt16:
+	la	%r2,0(%r1,%r2)	/* Store current pointer to found c.  */
+	ber	%r14		/* Return if found c is last loaded byte.  */
+
+	/* Shift vector elements left and start search again.  */
+	aghi	%r1,1		/* Start search after current index.  */
+	slr	%r4,%r1		/* New highest index.  */
+	sll	%r1,3		/* Calculate byte count for vector shift
+				   left.  */
+	vlvgg	%v17,%r1,0
+	vslb	%v16,%v16,%v17	/* Vector shift left by byte by number of bytes
+				   specified in bits 1-4 of byte 7 in v17.  */
+	vfeeb	%v17,%v16,%v18	/* Find c.  */
+	la	%r5,1(%r2)	/* Save start-address of shifted v16.  */
+	vlgvb	%r1,%v17,7	/* Index of c or 16 if not found  */
+	clr	%r1,%r4
+	locgrle	%r2,%r5		/* Use stored address as base if c found.  */
+	jle	.Lfound_lt16	/* Found c within loaded bytes.  */
+	br	%r14		/* No further c found, return last stored c.  */
+
+.Lfound_end:
+	la	%r2,0(%r4,%r2)	/* Return pointer to found c  */
+	br	%r14
+
This sequence looks slow. Both here and in strrchr you should use
something better than loop. One possibility is to use shuffle to do
16-byte byteswap but it needs load constant.

A second way is to use same trick as in calculating clz with ctz. If
instruction produces nonzero for match an 0 for nonmatch then to find
highest nonzero byte you first need or all lower bytes with it for
example by

x |= x >> 8;
x |= x >> 4;
x |= x >> 2;
x |= x >> 1;

Then you look for index of first zero byte which is one more than
highest nonzero byte. When roles of 1 and 0 are reversed and all nonzero
bytes have bit in common you could use &.



For strrchr idea is same. When handling trailing zero you could make
similar trick with zero mask
x |= x << 8;
x |= x << 4;
x |= x << 2;
x |= x << 1;
x = ~x;
to construct vector that is 255 before and at first zero byte and 0 after
that. Its important that first zero will get 255 to handle strrchr(x,0)
without separate check if c is zero.


Currently I'm rewriting the optimized functions in order to load the first 16 bytes unaligned or up to page boundary.

For strrchr I'm now searching for c in 16byte blocks and mark the last block with an found c - as you recommended.

After reaching end of string, I check once for further occurrences of c in the marked block by shifting right the bytes in marked vector register to mask out the garbage after a possible zero, reversing the bytes in vector-register and searching for "the first" c again. The result is the last occurrence of c in the marked 16byte block and the whole string.

And you are right, the mentioned worst case - string with only c's - is now much faster.

I post the new version of patch-series after wcsrchr, memrchr are changed, too.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]