This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: strstr optimization
- From: Rajalakshmi Srinivasaraghavan <raji at linux dot vnet dot ibm dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org, sjmunroe at us dot ibm dot com
- Date: Mon, 04 May 2015 11:03:21 +0530
- Subject: Re: [PATCH] powerpc: strstr optimization
- Authentication-results: sourceware.org; auth=none
- References: <1429254584-124997-1-git-send-email-raji at linux dot vnet dot ibm dot com> <20150501120510 dot GB2203 at domone>
On 05/01/2015 05:35 PM, OndÅej BÃlka wrote:
On Fri, Apr 17, 2015 at 02:09:44AM -0500, Rajalakshmi Srinivasaraghavan wrote:
This patch optimizes strstr function for power >= 7 systems. Performance
gain is obtained using doubleword aligned memory access and usage of cmpb
instruction for quicker comparison. The average improvement of this
optimization is ~40%.
As I did same with x64 I will add improvements, I didn't look at patch
in detail.
First problem looks there is quadratic worst case. That is solved by
buy-or-rent technique. You count number of comparisons and if that
exceeds C*N where N is number of haystack bytes read so far and C
suitable constant then you switch to slower generic algorithm.
Main improvement in x64 case was checking for pair of characters. A pair
of characters is less likely than single character even vectorized
branch would be rarely triggered saving misprediction cost that
dominates performance.
Though I did not understand your program completely, I agree with
checking pair of characters
can improve the performance.
In my optimization there are totally three branches.
case 1# if needle len >= 8 (common case)
case 2# if needle < 8 (common case)
case 3# For page boundary cross haystack. (rare case)
For case 1, I always compare 8 characters at a time using cmpb instruction.
-> perform basic strlen, strnlen, checks. Move haystack based on strchr
result.
-> Load first 8 bytes(taking care of aligned load)from haystack and
needle and compare them.
If they are same proceed to next 8 bytes of haystack and needle.
Else load next doubleword from haystack.Form a doubleword with last
seven characters from first load and first character from
second load and compare again with needle. Proceed in a similar way.
This gives good improvement.
For case2, always compare 'n' characters at a time, 'n' is needle length.
-> Algorithm is same as case 1, however before doing cmpb,create input
mask based on needle len.
For case 1 and case 2, input is always loaded as double word.
For case 3, its byte by byte comparison.
--
Thanks
Rajalakshmi S