This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 4/*] Generic string memchr and strnlen


On Mon, Jul 27, 2015 at 02:54:21PM +0100, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> > On Fri, Jul 24, 2015 at 07:03:10PM +0100, Wilco Dijkstra wrote:
> > > > OndÅej BÃlka wrote:
> > > > On Fri, Jul 24, 2015 at 05:38:43PM +0100, Wilco Dijkstra wrote:
> > > > > > OndÅej BÃlka wrote:
> > > > > > On Fri, Jul 24, 2015 at 04:10:24PM +0100, Wilco Dijkstra wrote:
> 
> > > Which alternatives? I didn't see a mention of an alternative that would
> > > actually be faster.
> > >
> > As I wrote before for example derive strnlen from memchr assembly.
> 
> That's not a realistic alternative at all for most targets. I am talking
> about a generic C solution that applies to all targets, similar to what I
> did for strcpy, strcat, mempcpy calling optimized strlen and memcpy.
>
While this is good first step it still leaves considerable performance
improvement. For that you need to do assembly tricks.
 
> > > I'd find it hard to believe you can beat assembly implementations. Do you
> > > have any performance results for your patches? There were a lot of patches
> > > posted but I don't recall any performance results in any.
> > >
> > That isn't hard to believe, actually its quite easy. If current assembly
> > isn't particulary good then c implementation will beat assembly. 
> 
> That's extremely unlikely. The generic C implementations that we haven't
> fixed already are quite inefficient, while the assembly implementations I've
> looked at were all efficient.
> 
Then you didn't look at many implementations. In lot of cases strlen is
just copypasted output of gcc string/strlen.c -S with fixed gcc errors.
While its faster than current generic one a better generic will beat it.

Best example is strrchr where almost all architectures got it wrong
first time until I tell them otherwise. Problem is that they calculate
position of byte at each iteration instead just at end which is
ineffective. Now tile and powerpc have this problem. My proposed
alternative

return memrchr (s, c, strlen (s) + 1);

would beat these on suitable inputs. I don't know if in practice as
there is extra strlen call.

Second would be strcpy and strncpy where assembly implementations of
some rchitectures look dubious and you could beat these with memcpy
call + strlen. I know definitely about powerpc power7 and powerppc
implementations.

Third would be strcmp et al, there writing a correct loop is very tricky
and only mine x64 implementation does it.

> > And as I read assembly it isn't particulary well optimized for most
> > architectures. In most cases code looks like you would take current
> > generic implementation and ran gcc -S. For example most assemlbly
> > implementations don't do loop unrolling.
> 
> Loop unrolling doesn't always increase performance.
> 
But often does by allowing other optimizations like having only one
addition instead four, you could do four loads in parallel without
worrying that speculative load would fault etc.

While it doesn't have to improve performance in most cases it does.

> > As for performance results I asked maintainers to run them but didn't
> > get response. I don't have access to exotic architectures so I couldn't
> > provide it.
> 
> All we need is results for a few popular architectures. Without any result
> I am very sceptical that it beats any assembly implementation.
> 
> > > That's only possible in a few cases. I'm talking about missing optimized
> > > implementations. Are you saying we should continue to use slow C code rather
> > > than trying to call an optimized assembler function?
> > >
> > What cases are you talking about. All of strnlen, strlen, rawmemchr
> > could be obtained from memchr assembly by doing dead code elimination of
> > unused code.
> 
> A case where it may be possible is strchrnul vs strchr for example where it is
> just a slightly different return value. Other cases are non-trivial and if you
> don't spend a significant effort on optimizing each function you end up with
> something inefficient.
> 
No, cases where its trivial are widespread. For strnlen its also just a
different return value and changing parameters.
For strlen and rawmemchr you could use strchr and delete zero
check/character check.

> We have lots of targets which are missing optimized functions, but which do have
> some related ones that can be called from generic code instead. Clearly doing
> that is a far more practical and realistic approach than talking about how easy
> it is to do dead-code elimination in assembly code when nobody does that. 
> 
> Or are you planning to do the dead-code elimination yourself? If not, we need
> to fix the generic code.
> 
That isn't completely true as I could do maintainers to do that. That
could save them lot of time if they want to add optimized
implementations as if you know tricks then most functions could be
trivialy derived from another.

> > > > > > Suggestion to express strlen as memchr would just cause regression. On
> > > > > > my system there happened 9535682 calls of strlen while memchr was called
> > > > > > just 11633 times and rawmemchr 1742 times.
> > > > >
> > > > > Why would it cause a regression? If you don't have an optimized strlen,
> > > > > what other implementation would be the fastest alternative?
> > > > >
> > > > It would be my generic strlen implementation. If you don't have
> > > > optimized strlen then you certainly don't have optimized memchr that is
> > > > called 819 times less often.
> > >
> > > Well I'd like to see results that show a C version of strlen beating an
> > > optimized memchr on x64. Still it seems to me there is no real need for an
> > > optimized C version of strlen - every target already provides an optimized
> > > version and it is hard to believe it is possible to beat those.
> > >
> > Thats trivial, I attached what I have. When I compile it I get
> > following:
> 
> Please reread the subject, we're talking about generic C code...
>
No, I wrote original post so we don't. My proposal counts that
maintainer will supply platform specific primitives if there is
instruction to do paralel byte comparison...
 
> > And for strlen being available it was exactly my point as its more important to
> > have assembly strlen than using memchr/rawmechr so your case wont happen.
> 
> Strlen is implemented in most targets so that is not an issue indeed, however
> there are quite a few targets which do have memchr but no strnlen, so that 
> scenario is certainly common.
> 
> > > That's only true for few cases. Note given its rarity, it seems better
> > > to change any call to rawmemchr into memchr(s, c, SIZE_MAX) - the gain
> > > due to cache sharing should far outweigh the small loss due to the extra
> > > length checks.
> > >
> > Thats also false as saving is much more than few cycles. Also that it
> > doesn't follow that if function has low number of total calls it would
> > be cold, just that there are few applications that use them. And in
> > application that uses them it could use it only in one tight loop. Here
> > rawmemchr is used only in mutt while strlen is used everywhere.
> > 
> > A performance difference could be quite big in x64, from benchtests;
> 
> It looks like that implementation isn't very optimal then. Decrementing a
> counter every 64 bytes is not a lot of overhead.
> 
No, it shows that implementation is optimal. On haswell rawmemchr needs
4 cycles per 64 bytes so extra overhead is very noticable.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]