This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: bzero/bcopy/bcmp/mempcpy (was: Improve strncpy performance further)

Roland McGrath wrote:
> Wilco Dijkstra wrote:
> > We need something like this in string.h so we always optimize all calls to
> > standard optimized functions, irrespectively of the compiler and options used:
> We would need that if we wanted to do that.  But these entrypoints are all
> old and deprecated.  They are only for the benefit of old code.  Any code
> so old that it hasn't been touched since there were actually systems to
> build it on that don't have the C89 standard functions surely has worse
> performance issues than this.  Making the deprecated functions optimal only
> encourages people to keep using them.

Agreed, however they appear to be used in a lot of code, including benchmarks.
For example a quick grep shows there are a large number of occurrences of 
bzero and bcopy in SPEC2006.

> > Now the only remaining one to deal with is mempcpy - I'd like something like
> > this in string/strings2.h:
> Why?  It's trivial enough for each memcpy implementation to implement
> mempcpy too, and for many implementations rolling it in might save an
> instruction or two over the generic addition.  It doesn't seem worth
> the complexity to bother with anything in the header files.

OK, so the goal of many of the changes I've been making is as follows:

By default GLIBC should provide the most efficient generic implementations
so that a new target is not forced to write a large number of optimized 
assembler functions in order to get reasonable performance. Additionally,
given that all targets add optimized versions of a few key functions
(such as memcpy, memset, strlen), use those whenever feasible rather than
less widely used variants.

Back to mempcpy, not only is inlining mempcpy simple and a good idea, it is
also the most efficient implementation. If you create a separate optimized
implementation of mempcpy, it requires 1-2 extra instructions and increases
pressure on caches and branch predictors. Another approach would be to set
the return value at the start of memcpy so that mempcpy can jump past it. 
This means 1 extra instruction in every memcpy invocation plus an extra
branch for mempcpy. Neither option is clearly better than just inlining. 
This ignores the additional effort to write/test mempcpy which could be 
spent on more important things. It appears most targets have not bothered 
with mempcpy as a result.

So to me adding the inline version is a no-brainer and should have been done
a long time ago.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]