This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH, AArch64] Add optimized strchrnul
- From: Richard Earnshaw <rearnsha at arm dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Date: Fri, 13 Jun 2014 16:20:38 +0100
- Subject: Re: [PATCH, AArch64] Add optimized strchrnul
- Authentication-results: sourceware.org; auth=none
- References: <539AD11E dot 50507 at arm dot com> <20140613121210 dot GA3001 at domone dot podge>
On 13/06/14 13:12, OndÅej BÃlka wrote:
> On Fri, Jun 13, 2014 at 11:23:26AM +0100, Richard Earnshaw wrote:
>> Here is an optimized implementation of __strchrnul. The simplification
>> that we don't have to track precisely why the loop terminates (match or
>> end-of-string) means we have to do less work in both setup and the core
>> inner loop. That means this should never be slower than strchr.
>>
>> As with strchr, the use of LD1 means we do not need different versions
>> for big-/little-endian.
>>
>> <date> Richard Earnshaw <rearnsha@arm.com>
>>
>> * sysdeps/aarch64/strchrnul.S: New file.
>>
>> OK?
>
> Few comments, a hot path in strchrnul are first 64 bytes so you should
> focus on these.
>
> First get a profiler here. This is a simple program that collects a
> sizes of strchr calls and then runs these again. It is good first
> approximation of real performance.
>
> http://kam.mff.cuni.cz/~ondra/dryrun_strchrnul.tar.bz2
>
> After you collect calls from programs that interest you try to compare
> these. A old/new implementation is minimum, but I have several
> questions.
>
> First what is latency of unaligned loads? One performance problem on x64
> were small strings that cross 64byte boundary. It turned out that it is
> faster first check if we do not cross page and then do unaligned comparison
> on 64 bytes. That needs to be checked.
>
> Second trick is first check page crossing, then align to 16 bytes and do
> 32byte compare so you always compare at least 16 valid bytes in header.
>
Thanks for the hints and the link. I'll try to look into this more next
week.
On the question of latency for unaligned loads, the answer is that
there's no single answer; ARM just defines the architecture and then
implementations are derived from it with different trade-offs in
micro-architecture (I might be able to work out what our own
implementations do, but not implementations by architecture licencees).
Furthermore, answers to questions such as cache line length and even
page size are similarly vague -- I can probably assume pages will not be
less than 4k, but there's no guarantee that they aren't bigger; a test
to ask the kernel what the page size is would undoubtedly cost more time
than we'd save. Similarly cache lines might be 64 bytes long, but
there's no architectural guarantee that they aren't shorter.
>
>
>> +
>> +ENTRY (__strchrnul)
>> + /* Magic constant 0x40100401 to allow us to identify which lane
>> + matches the termination condition. */
>> + mov wtmp2, #0x0401
>> + movk wtmp2, #0x4010, lsl #16
>> + dup vrepchr.16b, chrin
>> + bic src, srcin, #31 /* Work with aligned 32-byte hunks. */
>> + dup vrepmask.4s, wtmp2
>> + ands tmp1, srcin, #31
>> + b.eq L(loop)
>> +
> Omitting this branch would improve performance if code below still works
> on aligned hunk. In loop there is risk of branch misprediction which is
> slower than staying at head. Condition could also cause branch
> misprediction but when I looked to data a 32-byte aligned strings are
> rare.
>
Hmm, interesting observation. I agree that even with strings regularly
aligned to 8 bytes, they often will not be 32-byte aligned (1 in 4
chance), and decreasing to 1 in 32 when the input alignment is fully
random. The trade-off here is that for the first non-aligned iteration
we have to do more work in the Neon unit (calculating an accurate
syndrome so that we can correctly mask it). Once we hit the main loop
we can trade a faster check for a bit more work in the epilogue.
On the other hand, removing that branch gives me more freedom to pull
the initial load forward and otherwise re-organise the code in addition
to the benefit of eliminating a branch that won't in practice be very
predictable -- it should be possible to share the entry sequence -- at
most it means forcing the mask to enable all lanes in the vector and
that's just a conditional select instruction.
R.
>> + /* Input string is not 32-byte aligned. Rather than forcing
>> + the padding bytes to a safe value, we calculate the syndrome
>> + for all the bytes, but then mask off those bits of the
>> + syndrome that are related to the padding. */
>> + ld1 {vdata1.16b, vdata2.16b}, [src], #32
>> + neg tmp1, tmp1
>> + cmeq vhas_nul1.16b, vdata1.16b, #0
>> + cmeq vhas_chr1.16b, vdata1.16b, vrepchr.16b
>> + cmeq vhas_nul2.16b, vdata2.16b, #0
>> + cmeq vhas_chr2.16b, vdata2.16b, vrepchr.16b
>> + orr vhas_chr1.16b, vhas_chr1.16b, vhas_nul1.16b
>> + orr vhas_chr2.16b, vhas_chr2.16b, vhas_nul2.16b
>> + and vhas_chr1.16b, vhas_chr1.16b, vrepmask.16b
>> + and vhas_chr2.16b, vhas_chr2.16b, vrepmask.16b
>> + lsl tmp1, tmp1, #1
>> + addp vend1.16b, vhas_chr1.16b, vhas_chr2.16b // 256->128
>> + mov tmp3, #~0
>> + addp vend1.16b, vend1.16b, vend1.16b // 128->64
>> + lsr tmp1, tmp3, tmp1
>> +
>> + mov tmp3, vend1.2d[0]
>> + bic tmp1, tmp3, tmp1 // Mask padding bits.
>> + cbnz tmp1, L(tail)
>> +
>> +L(loop):
>> + ld1 {vdata1.16b, vdata2.16b}, [src], #32
>> + cmeq vhas_nul1.16b, vdata1.16b, #0
>> + cmeq vhas_chr1.16b, vdata1.16b, vrepchr.16b
>> + cmeq vhas_nul2.16b, vdata2.16b, #0
>> + cmeq vhas_chr2.16b, vdata2.16b, vrepchr.16b
>> + /* Use a fast check for the termination condition. */
>> + orr vhas_chr1.16b, vhas_nul1.16b, vhas_chr1.16b
>> + orr vhas_chr2.16b, vhas_nul2.16b, vhas_chr2.16b
>> + orr vend1.16b, vhas_chr1.16b, vhas_chr2.16b
>> + addp vend1.2d, vend1.2d, vend1.2d
>> + mov tmp1, vend1.2d[0]
>> + cbz tmp1, L(loop)
>> +
>> + /* Termination condition found. Now need to establish exactly why
>> + we terminated. */
>> + and vhas_chr1.16b, vhas_chr1.16b, vrepmask.16b
>> + and vhas_chr2.16b, vhas_chr2.16b, vrepmask.16b
>> + addp vend1.16b, vhas_chr1.16b, vhas_chr2.16b // 256->128
>> + addp vend1.16b, vend1.16b, vend1.16b // 128->64
>> +
>> + mov tmp1, vend1.2d[0]
>> +L(tail):
>> + /* Count the trailing zeros, by bit reversing... */
>> + rbit tmp1, tmp1
>> + /* Re-bias source. */
>> + sub src, src, #32
>> + clz tmp1, tmp1 /* ... and counting the leading zeros. */
>> + /* tmp1 is twice the offset into the fragment. */
>> + add result, src, tmp1, lsr #1
>> + ret
>> +
>> +END(__strchrnul)
>> +weak_alias (__strchrnul, strchrnul)
>
>
>