This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: strcasestr optimization




On 06/16/2015 12:51 PM, OndÅej BÃlka wrote:
On Thu, Jun 11, 2015 at 12:04:08AM +0530, Rajalakshmi Srinivasaraghavan wrote:


On 06/10/2015 04:18 PM, OndÅej BÃlka wrote:

  The onus for making all this happen is on the person who wants to change the code from its status quo.

That would mean that its Rajalakshmi and responsibility to
write benchmark that shows its performance improvement. I stated my
objections explicitly described which inputs to use.

Problem is that Steven failed to do his duty as a powerpc maintainer
which is to read and review change, check benchmark inputs and that
they are adequate. I could supply really bad implementation of strstr
which first checks special case that needle and haystack are periodic
abcdef sequences and returns result expected from benchmark.

I did just simple request that every freshmen should know and Steven
should as powerpc maintainer check how its handled. Its that naive
algorithm and proposed algorithm are slow on inputs of form

strstr("aaaaa....aaaa","aaaa...aaab");

As from your previous mail its their responsibility to add that case for
benchmark. A 'solution' on v2 would be to switch into two-way algorithm
when size is bigger than 2048. That isn't solution as you still have
problem with size 2047. I seen somewhere comment that its security
problem when you have website with search function on user comments and
implementation internally uses strstr then attacker could just post
comment with aaaaa...a and search few times for aa...ab.

2048 number is chosen by checking various length inputs and
comparing the performance. So for 2047 case, there wont be worst
performance.

What are you talking about? I am saying that aaaa haystack with aaa...ab
  needle of size 2047 is worst case for your algorithm performance.

I ran the attached test.c in a loop 10000 times.
proposed patch took 19s and existing code took 20s on power machine.


Roland as its really basic issue I think that its proposer
responsibility to check. Also for strstr thread Carlos said about strstr
quadratic case:

We should add gnulib's test to the microbenchmarks.


Then as I commented that algorithm is flawed its from my experience,
Raji's description was:

For case 1, I always compare 8 characters at a time using cmpb instruction.
instruction.
-> perform basic strlen, strnlen, checks. Move haystack based on strchr result.
strchr result.
-> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them.
needle and compare them.
If they are same proceed to next 8 bytes of haystack and needle.
Else load next doubleword from haystack.Form a doubleword with last
seven characters from first load and first character from
second load and compare again with needle. Proceed in a similar way.
This gives good improvement.

Lets not mix the review comments of strstr and strcasestr.(as the
algorithm is different)

From webster dictionary

"Full Definition of ALGORITHM :
a procedure for solving a mathematical problem (as of finding the
greatest common divisor) in a finite number of steps that frequently
involves repetition of an operation; broadly :  a step-by-step procedure
for solving a problem or accomplishing some end especially by a computer"

Changing techical few details like applying parallel tolower does not
make new algorithm.

Your strstr and strcasestr are really same algorithm.

The reason I mentioned them as different is, in strstr() patch the
input is loaded as double word ie. 8 bytes at a time by byte and in strcasestr its byte by byte load.The input is handled in different way.

And my point still applies, as you could change generic strcasestr from

while(s=strchr(s,n[0]))

to

if (bijective_locale)
   {
     char fn[3];
     fn[0]=tolower(n[0]);
     fn[1]=toupper(n[0]);
     fn[2]=0;
     while(s=strpbrk(s,fn))

and add optimized strpbrk(x,"xy") path. That as strpbrk would be at most
twice slower than strchr and match is twice likely would be faster than
what you gain and easier to implement.

My comment was that it wastes time most of time as in common case these
already differ in first byte so all these shifts were completely
unnecessary, a simple optimization of naive algorithm using

while(s=strchr(s,n[0]))
If I use while(s=strchr(s,n[0]) as mentioned in
https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html
the quadratic testcase gives worst peformance. Attached the testcase.

Thats irrelevant testcase. I asked two things. First is show how big
regression you get on pathological inputs. So submit data when you use
your testcase with m=2047 and strstr executed 10000 times in loop with
current algorithm and with your proposed one. That is what I ask.

Second is about average case perforance. There worst case is irrelevant
as you use buy-or-rent techniques to handle it. I sumbitted following
patch with easy way to handle that.

[PATCH] Improve generic strstr performance.

I applied your proposed patches and the benchtests and I could see proposed patch(__strstr_power7) is better than existing code (__strstr_ppc).
13363 out of 13442 combinations shows improvement.
Attached benchtest results.

git status shows:

On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   benchtests/bench-strcasestr.c
	modified:   benchtests/bench-strstr.c
	modified:   locale/C-ctype.c
	modified:   locale/categories.def
	modified:   locale/langinfo.h
	modified:   locale/programs/ld-ctype.c
	modified:   string/memmem.c
	modified:   string/strcasestr.c
	modified:   string/strstr.c
	modified:   string/test-strcasestr.c
	modified:   sysdeps/powerpc/powerpc64/multiarch/Makefile
	modified:   sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	sysdeps/generic/string_vector.h
	sysdeps/generic/string_vector_search.h
	sysdeps/generic/string_vector_skeleton.h
	sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S
	sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c
	sysdeps/powerpc/powerpc64/multiarch/strcasestr.c
	sysdeps/powerpc/powerpc64/multiarch/strstr-power7.S
	sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c
	sysdeps/powerpc/powerpc64/multiarch/strstr.c
	sysdeps/powerpc/powerpc64/power7/strcasestr.S
	sysdeps/powerpc/powerpc64/power7/strstr.S

its currently superseeded by generic vector bigraph search but it should
beat your strstr.


would be faster. Steven instead of doing his duty as powerpc maintainer
and checking result of benchtest that I send instead tried to bullshit
himself by telling that powerpc architecture is too special. Then he
asked me do do benchmark instead as stalling tactics.

If he did his duty of reading benchmark results there is already
evidence. First to notice is there are regressions in original
benchmark, like following:

Length   30/2, alignment  0/ 9, found:  20.7812 32.2656 13.125  19.7188
Length   30/2, alignment  0/ 9, fail:   32.4688 33.9219 36.75   31.4688

I agree there are few regression cases(79 out of 13442 cases) in
'proposed_benchresults.txt' attached in
https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html.
This is minimal when compared to the overall performance gain.

What overall performance gain? Did you try to find application and
measure performance difference before and after? Big sizes are not very
relevant in practice. You look behind 32'th byte of haystack only in
1.5% of calls so 10% regression at these sizes is much worse that 30%
gain on 256 byte range. In my profiler a simple collection shows
following
No I don't check on any specific application and I use the glibc benchtests for performance measurement which is considered to be the standard way of performance measurement before proposing a patch.


calls 13561
success probability   3.0%
average needle size:    8.9    ns <= 0:   0.4% ns <= 4:   0.8% ns <= 8: 5.9% ns <= 16:  99.9% ns <= 24: 100.0% ns <= 32: 100.0% ns <= 48: 100.0%
ns <= 64: 100.0%
average n:   10.6    n <= 0:   0.4% n <= 4:   1.3% n <= 8:   1.3% n <= 16:  97.0% n <= 24:  97.8% n <= 32:  98.5% n <= 48: 99.9% n <= 64:  99.9%
s aligned to 4 bytes:  98.6%  8 bytes:  98.6% 16 bytes:  96.0%




--
Thanks
Rajalakshmi S

Attachment: newbenchresults
Description: Text document

#include <string.h>		/* for strstr */
#include <stdlib.h>		/* for malloc */
#include <unistd.h>		/* for alarm */

int
main ()
{

  int result = 0;
  size_t m = 1000000;
  size_t n = 2043;
  char *haystack = (char *) malloc (2 * m + 2);
  char *needle = (char *) malloc (n + 2);
  if (haystack && needle)
    {
      memset (haystack, 'A', 2 * m);
      haystack[2 * m] = 'B';
      haystack[2 * m + 1] = 0;
      memset (needle, 'A', n);
      needle[n] = 'B';
      needle[n + 1] = 0;
      strstr (haystack, needle);
    }
  return result;


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]