This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH 4/*] Generic string memchr and strnlen
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: libc-alpha at sourceware dot org
- Date: Mon, 27 Jul 2015 15:37:28 -0300
- Subject: Re: [PATCH 4/*] Generic string memchr and strnlen
- Authentication-results: sourceware.org; auth=none
- References: <003b01d0c622$dd96f530$98c4df90$ at com> <20150724153758 dot GA855 at domone> <003c01d0c62f$3440aa00$9cc1fe00$ at com> <20150724170425 dot GA10041 at domone> <003d01d0c63b$0080b730$01822590$ at com> <20150724191248 dot GA2889 at domone> <004201d0c873$bd3b9fe0$37b2dfa0$ at com> <20150727165642 dot GA22842 at domone>
On 27-07-2015 13:56, OndÅej BÃlka wrote:
> On Mon, Jul 27, 2015 at 02:54:21PM +0100, Wilco Dijkstra wrote:
>>> OndÅej BÃlka wrote:
>>> On Fri, Jul 24, 2015 at 07:03:10PM +0100, Wilco Dijkstra wrote:
>>>>> OndÅej BÃlka wrote:
>>>>> On Fri, Jul 24, 2015 at 05:38:43PM +0100, Wilco Dijkstra wrote:
>>>>>>> OndÅej BÃlka wrote:
>>>>>>> On Fri, Jul 24, 2015 at 04:10:24PM +0100, Wilco Dijkstra wrote:
>>>> Which alternatives? I didn't see a mention of an alternative that would
>>>> actually be faster.
>>> As I wrote before for example derive strnlen from memchr assembly.
>> That's not a realistic alternative at all for most targets. I am talking
>> about a generic C solution that applies to all targets, similar to what I
>> did for strcpy, strcat, mempcpy calling optimized strlen and memcpy.
> While this is good first step it still leaves considerable performance
> improvement. For that you need to do assembly tricks.
>>>> I'd find it hard to believe you can beat assembly implementations. Do you
>>>> have any performance results for your patches? There were a lot of patches
>>>> posted but I don't recall any performance results in any.
>>> That isn't hard to believe, actually its quite easy. If current assembly
>>> isn't particulary good then c implementation will beat assembly.
>> That's extremely unlikely. The generic C implementations that we haven't
>> fixed already are quite inefficient, while the assembly implementations I've
>> looked at were all efficient.
> Then you didn't look at many implementations. In lot of cases strlen is
> just copypasted output of gcc string/strlen.c -S with fixed gcc errors.
> While its faster than current generic one a better generic will beat it.
> Best example is strrchr where almost all architectures got it wrong
> first time until I tell them otherwise. Problem is that they calculate
> position of byte at each iteration instead just at end which is
> ineffective. Now tile and powerpc have this problem. My proposed
> return memrchr (s, c, strlen (s) + 1);
> would beat these on suitable inputs. I don't know if in practice as
> there is extra strlen call.
You do a lot of assumptions and I doubt very much that it will be faster
on every input. Assuming assembly implementations (no function call) I
do see that this could be faster for long string with the character being
found in the end of the string, but I see it could potentially slower
if it searches large strings with character in start/middle. It will
really depends in which will be the input char size and distribution.
> Second would be strcpy and strncpy where assembly implementations of
> some rchitectures look dubious and you could beat these with memcpy
> call + strlen. I know definitely about powerpc power7 and powerppc
>From previous discussion the idea of power7 strcpy/strncpy is to not use
unaligned read/writes due architecture constraints. And again you do
a lot of assumptions: I doubt that using power7 memcpy/strlen (which also
do only aligned accesses) would be faster than current strcpy/strncpy.
It could be the case for large strings, where memcpy will gain because the
use of VSX instructions; but even I would like some data before taking
> Third would be strcmp et al, there writing a correct loop is very tricky
> and only mine x64 implementation does it.
>>> And as I read assembly it isn't particulary well optimized for most
>>> architectures. In most cases code looks like you would take current
>>> generic implementation and ran gcc -S. For example most assemlbly
>>> implementations don't do loop unrolling.
>> Loop unrolling doesn't always increase performance.
> But often does by allowing other optimizations like having only one
> addition instead four, you could do four loads in parallel without
> worrying that speculative load would fault etc.
> While it doesn't have to improve performance in most cases it does.
It should be address case by case, like point out specific assembly
implementations where you think can be benefits from some loop