This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH, AArch64] Add optimized strchrnul
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Richard Earnshaw <rearnsha at arm dot com>
- Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Date: Wed, 25 Jun 2014 03:28:59 +0200
- Subject: Re: [PATCH, AArch64] Add optimized strchrnul
- Authentication-results: sourceware.org; auth=none
- References: <539AD11E dot 50507 at arm dot com> <20140613121210 dot GA3001 at domone dot podge> <539B16C6 dot 3060401 at arm dot com>
On Fri, Jun 13, 2014 at 04:20:38PM +0100, Richard Earnshaw wrote:
> On 13/06/14 13:12, OndÅej BÃlka wrote:
> > On Fri, Jun 13, 2014 at 11:23:26AM +0100, Richard Earnshaw wrote:
> >> Here is an optimized implementation of __strchrnul. The simplification
> >> that we don't have to track precisely why the loop terminates (match or
> >> end-of-string) means we have to do less work in both setup and the core
> >> inner loop. That means this should never be slower than strchr.
> >> As with strchr, the use of LD1 means we do not need different versions
> >> for big-/little-endian.
> >> <date> Richard Earnshaw <firstname.lastname@example.org>
> >> * sysdeps/aarch64/strchrnul.S: New file.
> >> OK?
> > Few comments, a hot path in strchrnul are first 64 bytes so you should
> > focus on these.
> > First get a profiler here. This is a simple program that collects a
> > sizes of strchr calls and then runs these again. It is good first
> > approximation of real performance.
> > http://kam.mff.cuni.cz/~ondra/dryrun_strchrnul.tar.bz2
> > After you collect calls from programs that interest you try to compare
> > these. A old/new implementation is minimum, but I have several
> > questions.
> > First what is latency of unaligned loads? One performance problem on x64
> > were small strings that cross 64byte boundary. It turned out that it is
> > faster first check if we do not cross page and then do unaligned comparison
> > on 64 bytes. That needs to be checked.
> > Second trick is first check page crossing, then align to 16 bytes and do
> > 32byte compare so you always compare at least 16 valid bytes in header.
> Thanks for the hints and the link. I'll try to look into this more next
> On the question of latency for unaligned loads, the answer is that
> there's no single answer; ARM just defines the architecture and then
> implementations are derived from it with different trade-offs in
> micro-architecture (I might be able to work out what our own
> implementations do, but not implementations by architecture licencees).
> Furthermore, answers to questions such as cache line length and even
> page size are similarly vague -- I can probably assume pages will not be
> less than 4k, but there's no guarantee that they aren't bigger; a test
> to ask the kernel what the page size is would undoubtedly cost more time
> than we'd save. Similarly cache lines might be 64 bytes long, but
> there's no architectural guarantee that they aren't shorter.
Of course it is faster to hardcode a 4096 page size and 64 bytes cache
lines as different values does not make that much difference.
A best answer would be to have several implementations and then for
each processor we run a benchmark that says what function should be
used. That has two problems, first is that we need to cache benchmark
results and check if cpu changed, second is that it is hard to write a
A start is to have variants that you could try.