This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v2] Improve memmem.

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Paul Eggert <eggert at cs dot ucla dot edu>
Cc: libc-alpha at sourceware dot org
Date: Thu, 14 May 2015 11:29:26 +0200
Subject: Re: [PATCH v2] Improve memmem.
Authentication-results: sourceware.org; auth=none
References: <20150513000329 dot GA23595 at domone> <55537C4A dot 20001 at cs dot ucla dot edu> <20150513185107 dot GA4100 at domone> <5553F532 dot 6060604 at cs dot ucla dot edu>

On Wed, May 13, 2015 at 06:06:58PM -0700, Paul Eggert wrote:
> OndÅej BÃlka wrote:
> 
> >You wont get illegal acces.
> 
> I don't see why not.  If memchr returns haystack_end - 1,
> haystack[1] will be equivalent to *haystack_end, i.e., one past the
> end of the haystack.
>
I am using different end here

+  const unsigned char *haystack_end = (const unsigned char *)
+                                      haystack_start + haystack_len
+                                      - needle_len + 1;

> >Here is equivalent without goto. It uses implementation behaviour that
> >memcmp(x,y,0) doesn't touch anything.
> 
> Better, but I think it still has the problem.  Also, come to think
> of it, it's not clear that we should bother with the 'needle_len ==
> 2' test (are needles of length 2 really that common?), and things
> are simpler and smaller without it, so how about the attached
> instead?
>
Reason is performance. Idea is to use this algorithm and switch to slow
two-way algorithm only for pathological inputs.

This uses simple heuristic to look only for first k characters and once
it matches character beyond k'th switch to two-way algorithm.

With single memchr as now its heuristic with k=1. What are you
suggesting is k=2. My algorithm uses k=3.

Main motivations is that pairs are still too common and in longer switch
you will quickly switch to slow two-way algorithm.

I could use more sophisticated accounting of runtime cost but it would
likely cost more than benefit as most inputs are quite small.

Finally there is buy-or-rent problem. Computing critical factorization
is very costy, often several times more expensive than using naive
algorithm. A solution to limit loss is buy-or-rent approach. You run a
naive algorithm even if its slower until you reach time diffierence
equal of doing factorization, then you switch to two-way algorithm and
you are at most twice slower than if you guessed correctly what to use.

Follow-Ups:
- Re: [PATCH v2] Improve memmem.
  - From: Paul Eggert

References:
- [PATCH] Improve memmem.
  - From: OndÅej BÃlka
- Re: [PATCH] Improve memmem.
  - From: Paul Eggert
- [PATCH v2] Improve memmem.
  - From: OndÅej BÃlka
- Re: [PATCH v2] Improve memmem.
  - From: Paul Eggert

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]