[Final testing with benchmark tests] Fastest String Search Algorithm.
Siddhesh Poyarekar
siddhesh@gotplt.org
Sat Jun 12 10:56:03 GMT 2021
On 6/12/21 1:31 PM, Paul Zimmermann wrote:
> Dear Amit,
>
>> I ran the benchmark tests for my algorithm and for strstr().
>>
>> Total number of tests = 332.
>>
>> My algorithm is faster 182 times.
>>
>> strstr() is faster 150 times.
>
> and your algorithm can be up to 3531 times slower than strstr():
>
> s: 3.75006e+08 106175
>
> The largest strstr() time is 1.5e6, whereas the largest time of your
> algorithm is 9.8e8, i.e., 656 times larger.
>
> Can you improve your algorithm so that the worst case time is similar to that
> of strstr()?
This is going around in circles so I am going to summarize the several
problems (and not just the worst case time) here that need to be
addressed before this proposal can even be considered with any
seriousness. You may consider this my objection to the proposal
conditional these points being resolved.
Starting with the technical issues:
1. In the benchmark results shared, the geomean performance of the
proposed algorithm is worse by 22.62%. This is by no means (no pun
intended) a better result. In general, worse cases are much worse than
when they're better.
2. The worst case performances are thousands of times slower as PaulZ
pointed out. But wait, it's worse because it's probably not an issue of
scaling with length...
3. ... The benchmark results shared don't actually indicate which length
pairs result in those worst case performances but if I had to assume
that it's related to glibc benchmark inputs, (there's one input less
than the glibc benchmarks) it appears that the performance starts going
downhill from 224 byte haystacks, especially with misaligned strings.
4. Given that the benchmark program is outside of the benchtests
infrastructure (because this is clearly not the benchtests program),
there is no information about how the benchmark is run and what
conditions affect its output.
5. The baseline for comparison is not clear. Is it the default strstr()
in string/strstr.c? Is it the system strstr()? In the latter case, it
could be anything from string/strstr.c to the various
microarchitecture-specific asm ifuncs.
6. Amit has made handwavy claims about why a certain set of inputs
matter over others without providing any references to prior art or
reviewable experimental data that can back his claims. The proposal
needs clearer commentary here with proper references.
7. Refusal to use the glibc-benchtests - either in its current form or
by adding more inputs that the community agrees as relevant - is IMO a
blocker to getting any changes accepted to strstr(). A better strstr()
must show better results on the benchmarks, without which future
maintenance of the function will be challenging. If Amit doesn't write
those, someone else needs to volunteer to do it for him.
Now on to the non-technical issues:
1. First and foremost, Amit has been insinuating racial discrimination
or crying victim when someone makes suggestions for verification or
points out flaws in the algorithm. He was warned once[1] but has
refused to treat community members with respect. Let this not be the
low we set for acceptable behaviour in the community.
2. There is a plain text algorithm and several C code snippets, none of
which have proper copyright notices. There is no indication even of
whether Amit owns the copyright for the shared content (as opposed to
his employer since many employers claim ownership for everything the
employee creates) or whether he has given us permission to use the content.
3. On a related point, it is unclear if Amit or his employer have signed
a copyright assignment with the FSF. I personally would like to see
glibc move away from the assignment like gcc did recently, but that's an
orthogonal point at the moment.
4. Amit has on various occasions expressed disinterest in contributing
to glibc and put conditions on what he will or will not do even before
posting a single line patch. I have no confidence that if we do end up
accepting his contribution at some point, he will make himself available
to maintain the code he contributed, leaving the community to deal with
the technical debt.
Siddhesh
[1] https://sourceware.org/pipermail/libc-alpha/2021-June/127380.html
More information about the Libc-alpha
mailing list