This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: strstr()
On Fri, 14 Jul 2006, Petr Baudis wrote:
Dear diary, on Sun, Jul 09, 2006 at 02:38:56AM CEST, I got a letter
where Tor Myklebust <tmyklebu@caffeine.csclub.uwaterloo.ca> said that...
I've attached a strstr() implementation that uses the present code on
short patterns but switches over to a linear-time, constant-space
algorithm on longer patterns. It was made against the glibc-2.3.2 package
from Debian sarge. I've tested this patch extensively on both an i386
machine and an AMD64 machine, and have found no problems. (It is, of
course, possible that I have overlooked all of its bugs.)
did you do any benchmarks? It would be good to first see how does it
actually affect performance for various haystack-needle length
configurations. It seems that the performance impact on short haystack +
short needle would be non-negligible and it's more sensible to optimize
for that case; applications requiring large searches are more likely to
have own fast string search implementations.
What do you have in mind for "short"? For needles of length less than
eight characters, the old strstr() is used; the performance impact here
should be negligible. So, there should be very little difference in
performance for truly short needles.
Yes, I have benchmarked my code against the present code. My code fares
slightly worse on random cases (as should be expected), but is a big win
in the pathological cases. Here is the code I used:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <time.h>
char a[10000001],
b[ 5000001];
int main() {
int hl, nl;
int j;
printf("random caees:\n");
for (hl = 20; hl <= 100; hl += 20)
for (nl = 10; nl <= 50; nl += 10) {
int i;
int ticks = 0;
for (j = 0; j < 1000; j++) {
for (i = 0; i < hl; i++)
a[i] = random() % 2 + 'a';
for (i = 0; i < nl; i++)
b[i] = random() % 2 + 'a';
b[nl] = a[hl] = 0;
char *foo = strstr(a,b);
clock_t time0 = clock();
for (i = 0; i < 22222; i++)
assert(strstr(a,b) == foo);
clock_t time1 = clock();
ticks += time1 - time0;
}
printf("hl: %3d; nl: %3d; %.2lf\n", hl, nl,
ticks / (double)CLOCKS_PER_SEC);
fflush(stdout);
}
printf("bad case for svdb:\n");
for (hl = 20; hl <= 100; hl += 20)
for (nl = 10; nl <= 50; nl += 10) {
int i;
for (i = 0; i < hl; i++)
a[i] = 'a';
for (i = 0; i < nl; i++)
b[i] = 'a';
b[nl-1] = 'b';
b[nl] = a[hl] = 0;
char *foo = strstr(a,b);
clock_t time0 = clock();
for (i = 0; i < 2222222; i++)
assert(strstr(a,b) == foo);
clock_t time1 = clock();
printf("hl: %3d; nl: %3d; %.2lf\n", hl, nl,
(time1 - time0) / (double)CLOCKS_PER_SEC);
fflush(stdout);
}
}
Briefly, for each haystack length in {20, 40, 60, 80, 100} and each needle
length in {10, 20, 30, 40, 50}, we generate 1000 random (needle, haystack)
pairs of the appropriate lengths, call strstr() 22222 times on each pair,
add up the time taken for all of these, and report it. Then, for each
haystack length and each needle length, we generate a haystack of all a's
and a needle of many a's followed by a b, then call strstr() 2222222
times.
I ran this code on an "AMD Athlon(tm) 64 X2 Dual Core Processor 4200+"
against my strstr() and the old strstr(), then used the UNIX 'paste'
command, yielding the following output (my strstr() on the left, and the
old strstr() on the right):
random caees: random caees:
hl: 20; nl: 10; 2.37 hl: 20; nl: 10; 1.70
hl: 20; nl: 20; 1.52 hl: 20; nl: 20; 1.72
hl: 20; nl: 30; 1.53 hl: 20; nl: 30; 1.72
hl: 20; nl: 40; 1.54 hl: 20; nl: 40; 1.74
hl: 20; nl: 50; 1.49 hl: 20; nl: 50; 1.73
hl: 40; nl: 10; 4.76 hl: 40; nl: 10; 3.93
hl: 40; nl: 20; 5.14 hl: 40; nl: 20; 4.02
hl: 40; nl: 30; 3.50 hl: 40; nl: 30; 4.00
hl: 40; nl: 40; 3.32 hl: 40; nl: 40; 4.02
hl: 40; nl: 50; 3.67 hl: 40; nl: 50; 3.98
hl: 60; nl: 10; 7.52 hl: 60; nl: 10; 6.40
hl: 60; nl: 20; 8.47 hl: 60; nl: 20; 6.65
hl: 60; nl: 30; 8.73 hl: 60; nl: 30; 6.66
hl: 60; nl: 40; 7.04 hl: 60; nl: 40; 6.67
hl: 60; nl: 50; 7.05 hl: 60; nl: 50; 6.65
hl: 80; nl: 10; 10.54 hl: 80; nl: 10; 8.92
hl: 80; nl: 20; 11.40 hl: 80; nl: 20; 9.43
hl: 80; nl: 30; 12.50 hl: 80; nl: 30; 9.48
hl: 80; nl: 40; 12.57 hl: 80; nl: 40; 9.49
hl: 80; nl: 50; 9.78 hl: 80; nl: 50; 9.45
hl: 100; nl: 10; 13.82 hl: 100; nl: 10; 11.71
hl: 100; nl: 20; 14.99 hl: 100; nl: 20; 12.33
hl: 100; nl: 30; 15.70 hl: 100; nl: 30; 12.34
hl: 100; nl: 40; 16.65 hl: 100; nl: 40; 12.38
hl: 100; nl: 50; 15.76 hl: 100; nl: 50; 12.39
bad case for svdb: bad case for svdb:
hl: 20; nl: 10; 0.16 hl: 20; nl: 10; 0.66
hl: 20; nl: 20; 0.15 hl: 20; nl: 20; 1.17
hl: 20; nl: 30; 0.16 hl: 20; nl: 30; 1.15
hl: 20; nl: 40; 0.15 hl: 20; nl: 40; 1.16
hl: 20; nl: 50; 0.16 hl: 20; nl: 50; 1.16
hl: 40; nl: 10; 0.24 hl: 40; nl: 10; 1.30
hl: 40; nl: 20; 0.24 hl: 40; nl: 20; 2.90
hl: 40; nl: 30; 0.26 hl: 40; nl: 30; 3.48
hl: 40; nl: 40; 0.25 hl: 40; nl: 40; 3.70
hl: 40; nl: 50; 0.26 hl: 40; nl: 50; 3.74
hl: 60; nl: 10; 0.32 hl: 60; nl: 10; 1.97
hl: 60; nl: 20; 0.33 hl: 60; nl: 20; 4.62
hl: 60; nl: 30; 0.34 hl: 60; nl: 30; 5.92
hl: 60; nl: 40; 0.34 hl: 60; nl: 40; 6.88
hl: 60; nl: 50; 0.36 hl: 60; nl: 50; 7.48
hl: 80; nl: 10; 0.40 hl: 80; nl: 10; 2.65
hl: 80; nl: 20; 0.42 hl: 80; nl: 20; 6.37
hl: 80; nl: 30; 0.42 hl: 80; nl: 30; 8.35
hl: 80; nl: 40; 0.42 hl: 80; nl: 40; 9.97
hl: 80; nl: 50; 0.44 hl: 80; nl: 50; 11.37
hl: 100; nl: 10; 0.49 hl: 100; nl: 10; 3.31
hl: 100; nl: 20; 0.49 hl: 100; nl: 20; 8.10
hl: 100; nl: 30; 0.39 hl: 100; nl: 30; 10.84
hl: 100; nl: 40; 0.41 hl: 100; nl: 40; 13.11
hl: 100; nl: 50; 0.44 hl: 100; nl: 50; 15.27
As for your remark about it being more sensible to optimise for small
needle/haystack combinations, I'm not sure this is the case. A cursory
grep through the source code of several software packages (gcc, glibc,
perl, vim) reveals that strstr() does not appear to be called repeatedly
on short needles and haystacks very often. Do you know of any packages
that do call strstr() often with small arguments?
I'm sure that most mature applications that actually need to do string
matching on large strings will have an efficient strstr() implementation,
yes; given that the old strstr() does not run efficiently on large
strings, using a different strstr() would be a necessity. People do still
write code, though, and the presence of an efficient strstr() in the
standard C library ought to make this easier --- it seems rather wasteful
of programmer time for every application to reinvent strstr() simply
because the standard C library's strstr() is too slow.
I do not, by any means, claim that this is the "best" implementation of
strstr() in C. It is, however, considerably more efficient (in the worst
case) than the old code while taking only a modest performance hit in the
"average" case.
Tor Myklebust