This is the mail archive of the
mailing list for the glibc project.
RE: bzero/bcopy/bcmp/mempcpy (was: Improve strncpy performance further)
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: 'Ondřej Bílka' <neleai at seznam dot cz>
- Cc: "'Roland McGrath'" <roland at hack dot frob dot com>, <libc-alpha at sourceware dot org>
- Date: Wed, 4 Feb 2015 16:30:43 -0000
- Subject: RE: bzero/bcopy/bcmp/mempcpy (was: Improve strncpy performance further)
- Authentication-results: sourceware.org; auth=none
- References: <001801d02b72$6ce0c3c0$46a24b40$ at com> <20150108185812 dot 285782C3BF6 at topped-with-meat dot com> <001901d02c0d$43cf9920$cb6ecb60$ at com> <20150109191632 dot 694692C3C1F at topped-with-meat dot com> <001a01d02dc9$bd6f0370$384d0a50$ at com> <20150113191449 dot AD91B2C39DC at topped-with-meat dot com> <001e01d03003$f67b8670$e3729350$ at com> <20150114193244 dot 44C022C39DB at topped-with-meat dot com> <002101d030da$c05f76f0$411e64d0$ at com> <20150131203619 dot GA13121 at domone dot leoexpresswifi dot com>
> Ondřej Bílka wrote:
> On Thu, Jan 15, 2015 at 03:48:47PM -0000, Wilco Dijkstra wrote:
> > Roland McGrath wrote:
> > > Wilco Dijkstra wrote:
> > > > We need something like this in string.h so we always optimize all calls to
> > > > standard optimized functions, irrespectively of the compiler and options used:
> > >
> > > We would need that if we wanted to do that. But these entrypoints are all
> > > old and deprecated. They are only for the benefit of old code. Any code
> > > so old that it hasn't been touched since there were actually systems to
> > > build it on that don't have the C89 standard functions surely has worse
> > > performance issues than this. Making the deprecated functions optimal only
> > > encourages people to keep using them.
> > Agreed, however they appear to be used in a lot of code, including benchmarks.
> > For example a quick grep shows there are a large number of occurrences of
> > bzero and bcopy in SPEC2006.
> Also gcc could optimize memset to __bzero, I will probably write patch
> for x64 to save few cycles. There is omplication that gcc could use
> memset return value so we need to check if its dead or create new
That is certainly a good idea - I added _memclr to armcc a long time ago as 99% of
uses of memset set it to zero and don't use the return value (and the cost of
save/restore the return value inside memcpy/memset is higher than just recomputing
it on most targets).
However this means we need to first make sure all targets have a decent __bzero
implementation as otherwise you penalize everybody with an extra veneer to memset...
> > > > Now the only remaining one to deal with is mempcpy - I'd like something like
> > > > this in string/strings2.h:
> > >
> > > Why? It's trivial enough for each memcpy implementation to implement
> > > mempcpy too, and for many implementations rolling it in might save an
> > > instruction or two over the generic addition. It doesn't seem worth
> > > the complexity to bother with anything in the header files.
> > Back to mempcpy, not only is inlining mempcpy simple and a good idea, it is
> > also the most efficient implementation. If you create a separate optimized
> > implementation of mempcpy, it requires 1-2 extra instructions and increases
> > pressure on caches and branch predictors. Another approach would be to set
> That was previously mentioned in parent thread. With separate mempcpy
> you will likely pay additional 100 cycle penalty as mempcy is not called
> > the return value at the start of memcpy so that mempcpy can jump past it.
> > This means 1 extra instruction in every memcpy invocation plus an extra
> > branch for mempcpy.
> That is false. You need to copy starting memcpy fragment until you set
> return value and then jump which gives no overhead to memcpy.
That's not how memcpy implementations work. You never set the return value
explicitly, you either don't change the destination register (which on most ABIs
also is the return value) or save/restore it on targets with few registers.
Additionally for small/medium copies you use the destination (and return value)
unchanged, so to support a different return value you need an extra instruction
to make a copy of the destination ...
> That could be problematic on some architectures as you need to do it
> without spilling extra register.
... which could also mean an extra spill/restore in the small/medium copy cases.
So I don't think merging mempcpy and memcpy is a good idea on any target.