This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 4/*] Generic string memchr and strnlen

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Wilco Dijkstra <wdijkstr at arm dot com>
Cc: 'GNU C Library' <libc-alpha at sourceware dot org>
Date: Fri, 24 Jul 2015 21:15:55 +0200
Subject: Re: [PATCH 4/*] Generic string memchr and strnlen
Authentication-results: sourceware.org; auth=none
References: <003b01d0c622$dd96f530$98c4df90$ at com> <20150724153758 dot GA855 at domone> <003c01d0c62f$3440aa00$9cc1fe00$ at com> <20150724170425 dot GA10041 at domone> <003d01d0c63b$0080b730$01822590$ at com>

On Fri, Jul 24, 2015 at 07:03:10PM +0100, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> > On Fri, Jul 24, 2015 at 05:38:43PM +0100, Wilco Dijkstra wrote:
> > > > OndÅej BÃlka wrote:
> > > > On Fri, Jul 24, 2015 at 04:10:24PM +0100, Wilco Dijkstra wrote:
> > > > > Getting back to this, if you don't have an optimized strnlen then
> > > > > it is always better to try to use memchr (there are 14 optimized
> > > > > implementations of memchr but only 6 for strnlen).
> > > > >
> > > > > So I'd suggest changing strnlen in an independent patch as:
> > > > >
> > > > > __strnlen (const char *str, size_t n)
> > > > > {
> > > > >   char *ret = __memchr (str, 0, n);
> > > > >   return ret ? ret - str : n;
> > > > > }
> > > > >
> > > > > It also looks worthwhile to express strlen and rawmemchr as memchr
> > > > > so that you only need one highly optimized function rather than many.
> > > > > Deferring to more widely implemented optimized assembler functions
> > > > > should result in better performance than trying to optimize these
> > > > > functions in C.
> > > > >
> > > > No, that is bad idea. Unless you inline strnlen or memchr then you add
> > > > extra call overhead.
> > >
> > > The goal is to call the optimized assembler version of memchr when there
> > > isn't one for strnlen - you could inline the above in headers if a target
> > > decides that there will only be an optimized memchr and not a strnlen
> > > (assuming that strnlen shows similar performance as memchr on a particular
> > > target).
> > >
> > Which as I explained is worse than alternatives, unless saving size.
> 
> Which alternatives? I didn't see a mention of an alternative that would
> actually be faster.
>
As I wrote before for example derive strnlen from memchr assembly.
 
> > > > That is unless you want to claim that you want to save size.
> > > >
> > > > As for optimized implementations of strnlen vs memchr it isn't clear
> > > > that we will delete all of them as they are slower.
> > >
> > > Delete what? We could certainly decide on a core set of functions which
> > > every target should implement in assembler. Candidates are memcpy, memset,
> > > memmove, memchr, strchr, strlen. Then for those we do not try to provide
> > > an optimized C implementation as it won't ever be used. But deleting them
> > > seems a bridge too far.
> > >
> > This patch is about generic string functions. When they have good
> > performance they will replace current ones for architectures. So soon
> > there won't be architecture where it holds.
> 
> I'd find it hard to believe you can beat assembly implementations. Do you
> have any performance results for your patches? There were a lot of patches
> posted but I don't recall any performance results in any.
> 
That isn't hard to believe, actually its quite easy. If current assembly
isn't particulary good then c implementation will beat assembly. And as
I read assembly it isn't particulary well optimized for most
architectures. In most cases code looks like you would take current
generic implementation and ran gcc -S. For example most assemlbly
implementations don't do loop unrolling. 

As for performance results I asked maintainers to run them but didn't
get response. I don't have access to exotic architectures so I couldn't
provide it.

> > > > Also its wrong way to solve it, a architecture maintainer should add
> > > > optimized strnlen implementations, that quite easy when you have memchr
> > > > implementation, add few macros to initially add start and different end
> > > > handling.
> > >
> > > The problem with the non-standard functions that are rarely used is that
> > > there are very few optimized implementations. We can't force maintainers to
> > > implement all string functions in assembler, so the generic code should use
> > > the fastest possible alternative if there isn't an optimized implementation.
> > > And that is pretty much always a more commonly used function which does
> > > have an optimized implementation.
> > >
> > But that isn't about what I said. I said that if there is optimized
> > memchr implementation then other function assembly is trivial to add for
> > maintainer. That gives you better performance.
> 
> That's only possible in a few cases. I'm talking about missing optimized
> implementations. Are you saying we should continue to use slow C code rather
> than trying to call an optimized assembler function?
> 
What cases are you talking about. All of strnlen, strlen, rawmemchr
could be obtained from memchr assembly by doing dead code elimination of
unused code.


> > > > Suggestion to express strlen as memchr would just cause regression. On
> > > > my system there happened 9535682 calls of strlen while memchr was called
> > > > just 11633 times and rawmemchr 1742 times.
> > >
> > > Why would it cause a regression? If you don't have an optimized strlen,
> > > what other implementation would be the fastest alternative?
> > >
> > It would be my generic strlen implementation. If you don't have
> > optimized strlen then you certainly don't have optimized memchr that is
> > called 819 times less often.
> 
> Well I'd like to see results that show a C version of strlen beating an
> optimized memchr on x64. Still it seems to me there is no real need for an
> optimized C version of strlen - every target already provides an optimized
> version and it is hard to believe it is possible to beat those.
> 
Thats trivial, I attached what I have. When I compile it I get
following:

gcc -O3 -S s.c
gcc -O3 t.c s.c
gcc -O3 t.c s.o
time ./a.out

real	0m0.934s
user	0m0.804s
sys	0m0.008s

gcc -O3 t.c -DSTRLEN s.o
time ./a.out

real	0m0.309s
user	0m0.308s
sys	0m0.004s


And for strlen being available it was exactly my point as its more important to 
have assembly strlen than using memchr/rawmechr so your case wont happen.

> > > > Also purpose of strlen and rawmechr is to be faster than memchr. Again
> > > > these should be implemented by architecture maintainer by removing size
> > > > checks from memchr implementation.
> > >
> > > Yes it would be perfect if we had optimized assembler implementations for
> > > all functions. However that's unfortunately not the case given there is a
> > > high cost for creating assembler implementations.
> > 
> > No, there isn't. If you have optimized memchr then deriving these is
> > simple mechanic work. Just do equivalent of dead code elimination on
> > memchr and you will get strlen.
> 
> That's only true for few cases. Note given its rarity, it seems better
> to change any call to rawmemchr into memchr(s, c, SIZE_MAX) - the gain
> due to cache sharing should far outweigh the small loss due to the extra
> length checks.
>
Thats also false as saving is much more than few cycles. Also that it
doesn't follow that if function has low number of total calls it would
be cold, just that there are few applications that use them. And in
application that uses them it could use it only in one tight loop. Here
rawmemchr is used only in mutt while strlen is used everywhere.

A performance difference could be quite big in x64, from benchtests;

 memchr  simple_memchr
Length   32, alignment  0:      41.6719 176.562
Length   64, alignment  2:      38.7969 395.953
Length  128, alignment  0:      53.6406 705.078
Length   64, alignment  3:      41.2656 395.969
Length  128, alignment  0:      51.9531 705.328
Length   64, alignment  3:      38.9219 396.234
Length  256, alignment  0:      93.3594 1373.05
Length   64, alignment  4:      38.9375 395.703
Length  256, alignment  0:      87.7656 1375.91
Length   64, alignment  4:      38.9219 395.969
Length  512, alignment  0:      131.25  2684.64

                        rawmemchr       simple_rawmemchr
Length   32, alignment  0:      34.5    156.906
Length   64, alignment  2:      33.7188 333.594
Length  128, alignment  0:      50.3906 613.672
Length   64, alignment  3:      35.0156 333.453
Length  128, alignment  0:      46.2188 615.625
Length   64, alignment  3:      34.8906 333.594
Length  256, alignment  0:      79.1719 1133.47
Length   64, alignment  4:      33.4688 351.828
Length  256, alignment  0:      74.0938 1156.77
Length   64, alignment  4:      33.4531 333.594
Length  512, alignment  0:      108.734 2262.38

Attachment: s.c
Description: Text document

Attachment: t.c
Description: Text document

Follow-Ups:
- RE: [PATCH 4/*] Generic string memchr and strnlen
  - From: Wilco Dijkstra

References:
- [PATCH 4/*] Generic string memchr and strnlen
  - From: Wilco Dijkstra
- Re: [PATCH 4/*] Generic string memchr and strnlen
  - From: OndÅej BÃlka
- RE: [PATCH 4/*] Generic string memchr and strnlen
  - From: Wilco Dijkstra
- Re: [PATCH 4/*] Generic string memchr and strnlen
  - From: OndÅej BÃlka
- RE: [PATCH 4/*] Generic string memchr and strnlen
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]