Fastest String Search Algorithm.

Thu Jun 10 11:08:22 GMT 2021

On Thu, Jun 10, 2021, 3:14 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> Hi Amit,
>
> For testing correctness yes. However in a benchmark there is no difference
> if the haystack
> ends immediately after a match or continues forever.
>

In my opinion, needle should be in the middle so as not to give any
algorithm any unfair advantage - like Boyer-Moore which searches from the
end.

> It won't since the benchmark explicitly removes all early matches (and of
> course it checks that
> the result is correct as well).
>

I have not seen what you are pointing out. I can attach the file which has
all haystacks and needles, but it is 82 MB. I did bzip2 on it but still it
is 564 KB. So, I have removed all haystacks/needles which were a
combination of 'a's and 'b's. I have run bzip2 on that file and now it is
115 KB large. However, the file format is not very good but it is still
easily readable.

I searched for the needle in the haystack using document editor's search
and found that needle was located in the beginning.

> > So, if some algorithm is faster for real world scenario but not for
> worst case (and worst case
> > will not occur in real world), then you don't want to include that
> algorithm.
>
> Large needles are not real-world, so for these the worst-case matters. The
> TwoWay algorithm
> for large needles is old and relatively inefficient. Anyone who has an
> algorithm that is not
> only faster on real-world needles but also guarantees good worst-case
> performance on huge
> needles would have no trouble getting their patches accepted.
>

I can do more tests and provide the inputs and results but I will not use
benchmark tests because I don't find them useful.

I will also not fix benchmark tests.

If all this is acceptable then please let me know. In this case I will
start my testing again.

The worst case scenario of my algorithm is O(mn) where m is the length of
the text and n is the length of the pattern.

And this worst case scenario will be hit in only one case:

haystack is: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA...
needle is: ABABA

But my theory is this: We should be fastest 100% if the time. If that is
not possible, then we should try to get as close to 100% as possible.

This means that if for average case, some algorithm is faster 99% of the
time but slower 1% of the time in the worst case scenario then we should go
for this algorithm rather than fretting over worst case scenario.

So, please let me know if you are still amenable to include my algorithm
provided the results for my algorithm are better than strstr(). You can
tell me what haystack / needle to pick.

I am thinking of having 2 long haystacks copied from Internet and then a
function call in the program will generate needle (equal to length passed
to it) from the middle of the haystack (Mostly, the length of needle will
be greater than or equal to 4 KB).

But, please let me know if you have other ideas.

Please find attached the haystacks/needles file from benchmark tests of
strstr (115 KB, bzipped2).

Regards,
Amit
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench-strstr.out.txt.bz2
Type: application/x-bzip
Size: 117063 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/libc-alpha/attachments/20210610/c57f58f0/attachment-0001.bin>