This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Status of strchr

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Wed, 14 Aug 2013 22:37:24 +0200
Subject: Status of strchr
References: <20130807140911 dot GA31968 at domone dot kolej dot mff dot cuni dot cz> <CAHjhQ926EE-MYDJR5Eftf+DUefBg-Gox0pw57vZ7XUwsO3OPJg at mail dot gmail dot com> <20130808190716 dot GA4589 at domone dot kolej dot mff dot cuni dot cz> <CAHjhQ92+C6uXyrUhTd3OWuoa6v2SeUaKLBuqaNX5Sqtn4ANBdg at mail dot gmail dot com> <CAHjhQ90S-1uBhwV44KODTcQkr=0U-P+_9Pu0O=RbYYY9e82JCA at mail dot gmail dot com> <20130809164420 dot GB4972 at domone dot kolej dot mff dot cuni dot cz> <CAHjhQ91rFwppQ4ixhPNuB9xe8FH9OrEoz3=eFrTQTscwOvSBCA at mail dot gmail dot com>

On Wed, Aug 14, 2013 at 11:46:23AM +0400, Liubov Dmitrieva wrote:

A problem here is that my 64-byte loop is after 512 characters fastest
but there is big constant overhead that makes 16-byte loop better at
that interval.

It is partially caused by that I did not do much tuning to this
implementation. I wrote strchr_new_v2 that decreases overhead somewhat
but it has a catch.

It is possible, other variant is to write header that does not use
unaligned loads.

It is not only problem of atom but also for old athlons a no-bsf variant
looks 10% faster than what is selected and my improvement.
http://kam.mff.cuni.cz/~ondra/benchmark_string/athlon_x2/strchr_profile/results_gcc/result.html

A silvermont is problematic as problem looks to be in loop overhead. One
possibility is wait how optimized can 64-byte loop be.

A second possibility is also try a 32-byte loop and see how it fares.

This test stresses implementation with code that trashes instruction
cache, branch target buffer... These factors matter when a function tends
to be called rarely.

As for strchr I think that increase of code size is worth the cost.

We will see.

Also a catch of optimized implementation that I mentioned earlier is
that performance is sensitive to scheduling. I spend lot of time figure
out why a optimized implementation is about 5% slower on big
sizes(switch to block mode) on amd processors. It turned out that
unoptimized version was aligned to 8 bytes but not 16. This leads to
scheduling that is faster than when it is aligned to 16 bytes as I my
strchr_new_v2 does.

Ondra

Follow-Ups:
- Re: Status of strchr
  - From: OndÅej BÃlka

References:
- [PATCH] Faster strchr implementation.
  - From: OndÅej BÃlka
- Re: [PATCH] Faster strchr implementation.
  - From: Liubov Dmitrieva
- Re: [PATCH] Faster strchr implementation.
  - From: OndÅej BÃlka
- Re: [PATCH] Faster strchr implementation.
  - From: Liubov Dmitrieva
- Re: [PATCH] Faster strchr implementation.
  - From: Liubov Dmitrieva
- Re: [PATCH] Faster strchr implementation.
  - From: OndÅej BÃlka
- Re: [PATCH] Faster strchr implementation.
  - From: Liubov Dmitrieva

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]