This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/*] Optimize generic strchrnul and strchr


On Wed, May 27, 2015 at 01:35:58PM +0100, Wilco Dijkstra wrote:
> OndÅej BÃlka wrote:
> > This is my generic strchr algorithm resubmitted to use skeleton.
> >
> > Idea to split into cases c<128 and c>128 didn't change.
> 
> Why do this?
> 
> > So comments? How this perform on different architectures?
> 
> In my view using 9 operations for a combined zero check and test 
> for another character is too much, it should be 5-7 operations at 
> most (the general form is (x - 0x01010101) & ~x & 0x80808080
> which is just 3).
>
yes I lost track of that when I tried debugging skeleton and eliminate
possibilities so its now suboptimal. But still it was around 33% faster than
current on my sandy_bridge. Now I use following combined check
exploiting ascii

((s - 0x0101) | ((s ^ c) - 0x0101)) & (~s) & 0x8080
 
will submit that on v3.

> You can optimize things further by calculating partial masks for each
> of the unrolled cases, ORing them together and only doing a single test
> per loop iteration rather than 4 or 8.

Already done in skeleton. I also rely on that gcc will use
distributivity to do only one & 0x8080 of combined mask.

Code will likely suboptimal based on my experience, for similar loops
gcc decided that when calculating combined mask its beneficial to spill
lot of registers to save individual masks that are useless for all
iterations except last.

I need to recall how I bypassed that gcc bug when I used similar
skeletons as template of x64 assembly.

> This also avoids adding a lot of
> code and branches to the inner loop which makes the unrolling pointless.
> 
> The other thing is support for big-endian - this is generally tricky as
> the mask returned by the zero check won't work even if byte-reversed.
> 
Nice catch, didn't though about that. Short answer is that you need a
more complicated expression that doesn't cause carry propagation like

(((x | 128) - 127) ^ 128) & ~x & 128

Then you could do byte reversal but its isnt needed as it would be
faster to count leading zero bytes directly.

So we will need add separate BIG_ENDIAN_EXPRESSION macro to support
these. Possibly one could squeeze extra performance if he is more
careful but I don't care that much.

Then first_nonzero_byte would need some work to support that, you could
do it directly without reversing.

> Finally first_nonzero_byte should just use __builtin_ffsl (yet another
> function that should be inlined by default in the generic string.h...).
> 
Yes when architecture supports that. When you need to emulate that with
software you can make it faster. You need to focus only on 8 bits
instead 64, a generic function there uses only one multiplication.

For full you need to use something like debruijn method that also needs
array lookup, like one from wiki

table[0..31] initialized by: for i from 0 to 31: table[ ( 0x077CB531 * (
1 << i ) ) >> 27 ] â i
function ctz_debruijn (x)
    return table[((x & (-x)) * 0x077CB531) >> 27]


Also its possible to combine masks, question is if it saves anything. On
x64 you could use movmskb instruction that extract high bits of xmm
registers and packs them as 16bit int. 

You could emulate that but its more expensive, for example with

unsigned long 
pack (unsigned long p)
{
  return (0x0102040810204080 * (p >> 7)) >> 56;
}

So to check 32 bytes at once you would need evaluate this:

ffsl (pack(mask0) | (pack(mask1) << 8) | (pack(mask2) << 16) | (pack(mask3) << 24))


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]