[Final testing with benchmark tests] Fastest String Search Algorithm.

Sat Jun 12 10:56:03 GMT 2021

On 6/12/21 1:31 PM, Paul Zimmermann wrote:
>         Dear Amit,
> 
>> I ran the benchmark tests for my algorithm and for strstr().
>>
>> Total number of tests = 332.
>>
>> My algorithm is faster 182 times.
>>
>> strstr() is faster 150 times.
> 
> and your algorithm can be up to 3531 times slower than strstr():
> 
> s: 3.75006e+08				106175
> 
> The largest strstr() time is 1.5e6, whereas the largest time of your
> algorithm is 9.8e8, i.e., 656 times larger.
> 
> Can you improve your algorithm so that the worst case time is similar to that
> of strstr()?

This is going around in circles so I am going to summarize the several 
problems (and not just the worst case time) here that need to be 
addressed before this proposal can even be considered with any 
seriousness.  You may consider this my objection to the proposal 
conditional these points being resolved.

Starting with the technical issues:

1. In the benchmark results shared, the geomean performance of the 
proposed algorithm is worse by 22.62%.  This is by no means (no pun 
intended) a better result.  In general, worse cases are much worse than 
when they're better.

2. The worst case performances are thousands of times slower as PaulZ 
pointed out.  But wait, it's worse because it's probably not an issue of 
scaling with length...

3. ... The benchmark results shared don't actually indicate which length 
pairs result in those worst case performances but if I had to assume 
that it's related to glibc benchmark inputs, (there's one input less 
than the glibc benchmarks) it appears that the performance starts going 
downhill from 224 byte haystacks, especially with misaligned strings.

4. Given that the benchmark program is outside of the benchtests 
infrastructure (because this is clearly not the benchtests program), 
there is no information about how the benchmark is run and what 
conditions affect its output.

5. The baseline for comparison is not clear.  Is it the default strstr() 
in string/strstr.c?  Is it the system strstr()?  In the latter case, it 
could be anything from string/strstr.c to the various 
microarchitecture-specific asm ifuncs.

6. Amit has made handwavy claims about why a certain set of inputs 
matter over others without providing any references to prior art or 
reviewable experimental data that can back his claims.  The proposal 
needs clearer commentary here with proper references.

7. Refusal to use the glibc-benchtests - either in its current form or 
by adding more inputs that the community agrees as relevant - is IMO a 
blocker to getting any changes accepted to strstr().  A better strstr() 
must show better results on the benchmarks, without which future 
maintenance of the function will be challenging.  If Amit doesn't write 
those, someone else needs to volunteer to do it for him.

Now on to the non-technical issues:

1. First and foremost, Amit has been insinuating racial discrimination 
or crying victim when someone makes suggestions for verification or 
points out flaws in the algorithm.  He was warned once[1] but has 
refused to treat community members with respect.  Let this not be the 
low we set for acceptable behaviour in the community.

2. There is a plain text algorithm and several C code snippets, none of 
which have proper copyright notices.  There is no indication even of 
whether Amit owns the copyright for the shared content (as opposed to 
his employer since many employers claim ownership for everything the 
employee creates) or whether he has given us permission to use the content.

3. On a related point, it is unclear if Amit or his employer have signed 
a copyright assignment with the FSF.  I personally would like to see 
glibc move away from the assignment like gcc did recently, but that's an 
orthogonal point at the moment.

4. Amit has on various occasions expressed disinterest in contributing 
to glibc and put conditions on what he will or will not do even before 
posting a single line patch.  I have no confidence that if we do end up 
accepting his contribution at some point, he will make himself available 
to maintain the code he contributed, leaving the community to deal with 
the technical debt.

Siddhesh

[1] https://sourceware.org/pipermail/libc-alpha/2021-June/127380.html