This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Improving memcpy and memset.


I tinkered with rep stosq/movsq it could improve performance for older
processors. It took me a while to realize that big constant factor for
__builtin_memcpy/memset was caused by a poor code generation rather that
instruction startup cost. 

With effective header I could get around 5% speedup for older machines.
As i7* handles unaligned loads well my implementation is fastest there.

A rep implementation is best until certain size where vector loop takes
over. I do switch at 512 bytes for now, some bit can be squeezed by
finding architecture-specific optimums.

One exception is silvermont, where profiling shown that rep movsq is
best course of action from 512 bytes. It probably is always when used
with new header.

There is one thing that I do not understand which is that now core2 
looks to pick unlikely branches as predicted which leads to bad performance
in random test. 

I decided to split variants to two classes, a loop improvements are
found here:

And here is updated version of profiler.

One thing left is to add ssse3 implementation for bigger sizes. It needs
to be big to pay overhead caused by computed jump.

Second is playing with unaligned loops. I added memcpy_cntloop which has
additional counter that gets substracted by 1 in each iteration. It
is best that I have for i7's so far as it is predicted upto 512 bytes.


On Mon, Aug 12, 2013 at 02:40:01PM +0400, Liubov Dmitrieva wrote:

That should be fixed now.


Traceroute says that there is a routing problem in the backbone.  It's not our problem.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]