This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] powerpc: strstr optimization

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Rajalakshmi Srinivasaraghavan <raji at linux dot vnet dot ibm dot com>
Cc: libc-alpha at sourceware dot org, sjmunroe at us dot ibm dot com
Date: Tue, 5 May 2015 23:14:14 +0200
Subject: Re: [PATCH] powerpc: strstr optimization
Authentication-results: sourceware.org; auth=none
References: <1429254584-124997-1-git-send-email-raji at linux dot vnet dot ibm dot com> <20150501120510 dot GB2203 at domone> <554704A1 dot 3010007 at linux dot vnet dot ibm dot com>

On Mon, May 04, 2015 at 11:03:21AM +0530, Rajalakshmi Srinivasaraghavan wrote:
> 
> 
> On 05/01/2015 05:35 PM, OndÅej BÃlka wrote:
> >On Fri, Apr 17, 2015 at 02:09:44AM -0500, Rajalakshmi Srinivasaraghavan wrote:
> >>This patch optimizes strstr function for power >= 7 systems.  Performance
> >>gain is obtained using doubleword aligned memory access and usage of cmpb
> >>instruction for quicker comparison.  The average improvement of this
> >>optimization is ~40%.
> >As I did same with x64 I will add improvements, I didn't look at patch
> >in detail.
> >
> >First problem looks there is  quadratic worst case. That is solved by
> >buy-or-rent technique. You count number of comparisons and if that
> >exceeds C*N where N is number of haystack bytes read so far and C
> >suitable constant then you switch to slower generic algorithm.
> >
> >Main improvement in x64 case was checking for pair of characters. A pair
> >of characters is less likely than single character even vectorized
> >branch would be rarely triggered saving misprediction cost that
> >dominates performance.
> >
> Though I did not understand your program completely, I agree with
> checking pair of characters
> can improve the performance.
> 
It does. AFAIR for practical strings on x64 its only around 3 times slower than
theoretical lower bound. 

A lower bound is strlen call as you must find terminating null when
needle isn't present.

I hope that main idea was clear and you will
use it. Code I send is complex as it contains ten different speedups
that improve performance.


> In my optimization there are totally three branches.
> case 1# if needle len >= 8 (common case)
> case 2# if needle < 8 (common case)
> case 3# For page boundary cross haystack. (rare case)
>
Checking if its case 1 or 2 is relatively costy as in practice haystacks
tend to be short. Most apps that use strstr have 90% call haystack less than 64 bytes long.

You don't need to know needle size if you know that it doesn't contain
first pair. Also its rare that it contains first pair of characters twice.

> For case 1, I always compare 8 characters at a time using cmpb instruction.
> -> perform basic strlen, strnlen, checks. Move haystack based on
> strchr result.
> -> Load first 8 bytes(taking care of aligned load)from haystack and
> needle and compare them.
> If they are same proceed to next 8 bytes of haystack and needle.
> Else load next doubleword from haystack.Form a doubleword with last
> seven characters from first load and first character from
> second load and compare again with needle. Proceed in a similar way.
> This gives good improvement.
> 
Thats looks worse than naive algorithm on practical strings. Most
instructions you do are useless as you have mismatch at first character.

Following function should beat it.

char *strstr (char *s, char *n)
{
  if (n[0] == 0)
    return s;
  while (s = strchr (s, n[0]))
    {
      long i;
      for (i=1; n[i] && n[i] == s[i]; i++);
      if (n[i] == 0)
        return s;
      s++;
    }
}

Or use musl strstr idea. While implementation is terrible as it in
practice unnecessary spends 90% time on preprocessing their loop is
relatively good for portable implementation.

> For case2, always compare 'n'  characters at a time, 'n' is needle length.
> -> Algorithm is same as case 1, however before doing cmpb,create
> input mask based on needle len.
> 
> For case 1 and case 2, input is always loaded as double word.
> 
> For case 3, its byte by byte comparison.
>

Follow-Ups:
- Re: [PATCH] powerpc: strstr optimization
  - From: Rajalakshmi Srinivasaraghavan

References:
- Re: [PATCH] powerpc: strstr optimization
  - From: OndÅej BÃlka
- Re: [PATCH] powerpc: strstr optimization
  - From: Rajalakshmi Srinivasaraghavan

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]