This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 4/*] Generic string memchr and strnlen

On Mon, Jul 27, 2015 at 02:42:47PM -0400, Chris Metcalf wrote:
> On 07/27/2015 12:56 PM, OndÅej BÃlka wrote:
> >Best example is strrchr where almost all architectures got it wrong
> >first time until I tell them otherwise. Problem is that they calculate
> >position of byte at each iteration instead just at end which is
> >ineffective. Now tile and powerpc have this problem. My proposed
> >alternative
> >
> >return memrchr (s, c, strlen (s) + 1);
> >
> >would beat these on suitable inputs.
> Your comment and your code snippet don't really match, so can I ask
> you to elucidate a little what your concern is?
It is mix of two concerns, so its both First a snippet was there 
mainly to show that generic code could be faster. I mentioned that
it applies for some inputs, as it would need larger inputs than usual
to overcome initialization and call overhead of assembly one.

> First, are you suggesting that at a high level the right thing to do
> is to do strlen() and then search backwards from the end?  This will
> certainly be faster if there is an instance of the character near the
> end of the string: on tilegx, strlen is 0.375 cycles/byte on hot cache
> vs 0.625 cycles/byte for strrchr.  But if the character is closer to
> the beginning of the string, you have to pay for the strlen, and then
> pay again for scanning the string backwards, and that cost is higher
> than just scanning forward for NUL and 'c' at the same time.  It's
> actually a bit slower to scan backwards since you have to do a counted
> loop, plus examine the loaded words, so you end up running at 0.750
> cycles/byte in reverse, so on average you would need to find the
> character in the last third of the string to make up the difference.
> This is what your strlen + memchrr code snippet suggested.
I didn't say that a generic is optimal only that it avoids problem with
tile implementation on some inputs.

A problem is that your analysis is incomplete. It assumes that character
occurs rarely which doesn't have to be case. A forward scan could be
very expensive if character has probability 1/16 which causes branch
misprediction each second iteration.

With strlen and memrchr you will likely have at most two mispredictions 
when they find zero and character.

> Or, are you concerned about the forward implementation of strrchr for
> tilegx itself?  I reviewed the tilegx implementation of strrchr (in C
> but using our multibyte "v1cmpeq" vector builtins for speed), and the
> inner loop is just four instruction bundles, operating on eight bytes
> at a time (with a one-cycle stall after the ld to fetch from L1D$):
> string/../sysdeps/tile/tilegx/strrchr.c:61
>   40:    { addi r10, r10, 8 ; bnez r11, 90 <__GI_strrchr+0x90> }
> string/../sysdeps/tile/tilegx/strrchr.c:64
>   48:    { ld r11, r10 }
> string/../sysdeps/tile/tilegx/strrchr.c:43
>   50:    { v1cmpeq r12, r11, r13 ; v1cmpeqi r11, r11, 0 }
> string/../sysdeps/tile/tilegx/strrchr.c:49
>   58:    { beqzt r12, 40 <__GI_strrchr+0x40> }
> This is what your English comment suggested (quoted above).  Despite
> the appearance of the C code, the compiler knows better than to waste
> time computing the precise character position at every iteration hit.
A problem that I described in english is that this isn't inner loop as
you jump from it each second iteration when c probability is 1/16.

For example if application uses strrchr to find last / in path name then 
its likely that it would be among last 8 characters and occurs
frequently previously.

Here you would need a conditional move instead of branch to avoid that

> So I'm not sure which point you are making but unless you know
> something about the average distribution of the characters in the
> strrchr() string to suggest they are likely to occur in the last third
> of the string more than 50% of the time, I don't think I'm convinced.
This mainly depends on size of string, if you assume that each character
has fixed probability it happens for sufficiently large strings.

A tricky part of writing these is that inputs are short so you need to
write separate header to get better performance.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]