This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On 06/16/2015 12:51 PM, OndÅej BÃlka wrote:
On Thu, Jun 11, 2015 at 12:04:08AM +0530, Rajalakshmi Srinivasaraghavan wrote:On 06/10/2015 04:18 PM, OndÅej BÃlka wrote:The onus for making all this happen is on the person who wants to change the code from its status quo. That would mean that its Rajalakshmi and responsibility to write benchmark that shows its performance improvement. I stated my objections explicitly described which inputs to use. Problem is that Steven failed to do his duty as a powerpc maintainer which is to read and review change, check benchmark inputs and that they are adequate. I could supply really bad implementation of strstr which first checks special case that needle and haystack are periodic abcdef sequences and returns result expected from benchmark. I did just simple request that every freshmen should know and Steven should as powerpc maintainer check how its handled. Its that naive algorithm and proposed algorithm are slow on inputs of form strstr("aaaaa....aaaa","aaaa...aaab"); As from your previous mail its their responsibility to add that case for benchmark. A 'solution' on v2 would be to switch into two-way algorithm when size is bigger than 2048. That isn't solution as you still have problem with size 2047. I seen somewhere comment that its security problem when you have website with search function on user comments and implementation internally uses strstr then attacker could just post comment with aaaaa...a and search few times for aa...ab.2048 number is chosen by checking various length inputs and comparing the performance. So for 2047 case, there wont be worst performance.What are you talking about? I am saying that aaaa haystack with aaa...ab needle of size 2047 is worst case for your algorithm performance.
I ran the attached test.c in a loop 10000 times. proposed patch took 19s and existing code took 20s on power machine.
Roland as its really basic issue I think that its proposer responsibility to check. Also for strstr thread Carlos said about strstr quadratic case:We should add gnulib's test to the microbenchmarks.Then as I commented that algorithm is flawed its from my experience, Raji's description was: For case 1, I always compare 8 characters at a time using cmpb instruction. instruction. -> perform basic strlen, strnlen, checks. Move haystack based on strchr result. strchr result. -> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them. needle and compare them. If they are same proceed to next 8 bytes of haystack and needle. Else load next doubleword from haystack.Form a doubleword with last seven characters from first load and first character from second load and compare again with needle. Proceed in a similar way. This gives good improvement.Lets not mix the review comments of strstr and strcasestr.(as the algorithm is different)From webster dictionary"Full Definition of ALGORITHM : a procedure for solving a mathematical problem (as of finding the greatest common divisor) in a finite number of steps that frequently involves repetition of an operation; broadly : a step-by-step procedure for solving a problem or accomplishing some end especially by a computer" Changing techical few details like applying parallel tolower does not make new algorithm. Your strstr and strcasestr are really same algorithm.
The reason I mentioned them as different is, in strstr() patch theinput is loaded as double word ie. 8 bytes at a time by byte and in strcasestr its byte by byte load.The input is handled in different way.
I applied your proposed patches and the benchtests and I could see proposed patch(__strstr_power7) is better than existing code (__strstr_ppc).And my point still applies, as you could change generic strcasestr from while(s=strchr(s,n[0])) to if (bijective_locale) { char fn[3]; fn[0]=tolower(n[0]); fn[1]=toupper(n[0]); fn[2]=0; while(s=strpbrk(s,fn)) and add optimized strpbrk(x,"xy") path. That as strpbrk would be at most twice slower than strchr and match is twice likely would be faster than what you gain and easier to implement.My comment was that it wastes time most of time as in common case these already differ in first byte so all these shifts were completely unnecessary, a simple optimization of naive algorithm using while(s=strchr(s,n[0]))If I use while(s=strchr(s,n[0]) as mentioned in https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html the quadratic testcase gives worst peformance. Attached the testcase.Thats irrelevant testcase. I asked two things. First is show how big regression you get on pathological inputs. So submit data when you use your testcase with m=2047 and strstr executed 10000 times in loop with current algorithm and with your proposed one. That is what I ask. Second is about average case perforance. There worst case is irrelevant as you use buy-or-rent techniques to handle it. I sumbitted following patch with easy way to handle that. [PATCH] Improve generic strstr performance.
13363 out of 13442 combinations shows improvement. Attached benchtest results. git status shows: On branch master Your branch is up-to-date with 'origin/master'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: benchtests/bench-strcasestr.c modified: benchtests/bench-strstr.c modified: locale/C-ctype.c modified: locale/categories.def modified: locale/langinfo.h modified: locale/programs/ld-ctype.c modified: string/memmem.c modified: string/strcasestr.c modified: string/strstr.c modified: string/test-strcasestr.c modified: sysdeps/powerpc/powerpc64/multiarch/Makefile modified: sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c Untracked files: (use "git add <file>..." to include in what will be committed) sysdeps/generic/string_vector.h sysdeps/generic/string_vector_search.h sysdeps/generic/string_vector_skeleton.h sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c sysdeps/powerpc/powerpc64/multiarch/strcasestr.c sysdeps/powerpc/powerpc64/multiarch/strstr-power7.S sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c sysdeps/powerpc/powerpc64/multiarch/strstr.c sysdeps/powerpc/powerpc64/power7/strcasestr.S sysdeps/powerpc/powerpc64/power7/strstr.S
No I don't check on any specific application and I use the glibc benchtests for performance measurement which is considered to be the standard way of performance measurement before proposing a patch.its currently superseeded by generic vector bigraph search but it should beat your strstr.would be faster. Steven instead of doing his duty as powerpc maintainer and checking result of benchtest that I send instead tried to bullshit himself by telling that powerpc architecture is too special. Then he asked me do do benchmark instead as stalling tactics. If he did his duty of reading benchmark results there is already evidence. First to notice is there are regressions in original benchmark, like following: Length 30/2, alignment 0/ 9, found: 20.7812 32.2656 13.125 19.7188 Length 30/2, alignment 0/ 9, fail: 32.4688 33.9219 36.75 31.4688I agree there are few regression cases(79 out of 13442 cases) in 'proposed_benchresults.txt' attached in https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html. This is minimal when compared to the overall performance gain.What overall performance gain? Did you try to find application and measure performance difference before and after? Big sizes are not very relevant in practice. You look behind 32'th byte of haystack only in 1.5% of calls so 10% regression at these sizes is much worse that 30% gain on 256 byte range. In my profiler a simple collection shows following
calls 13561 success probability 3.0% average needle size: 8.9 ns <= 0: 0.4% ns <= 4: 0.8% ns <= 8: 5.9% ns <= 16: 99.9% ns <= 24: 100.0% ns <= 32: 100.0% ns <= 48: 100.0% ns <= 64: 100.0% average n: 10.6 n <= 0: 0.4% n <= 4: 1.3% n <= 8: 1.3% n <= 16: 97.0% n <= 24: 97.8% n <= 32: 98.5% n <= 48: 99.9% n <= 64: 99.9% s aligned to 4 bytes: 98.6% 8 bytes: 98.6% 16 bytes: 96.0%
-- Thanks Rajalakshmi S
Attachment:
newbenchresults
Description: Text document
#include <string.h> /* for strstr */ #include <stdlib.h> /* for malloc */ #include <unistd.h> /* for alarm */ int main () { int result = 0; size_t m = 1000000; size_t n = 2043; char *haystack = (char *) malloc (2 * m + 2); char *needle = (char *) malloc (n + 2); if (haystack && needle) { memset (haystack, 'A', 2 * m); haystack[2 * m] = 'B'; haystack[2 * m + 1] = 0; memset (needle, 'A', n); needle[n] = 'B'; needle[n + 1] = 0; strstr (haystack, needle); } return result;
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |