This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Improving memcpy and memset.


Your first profiler got memory corruption on Haswell.

double free or corruption (out): 0x00000000029bb050 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3f7ee7c00e]
/lib64/libfreetype.so.6(FT_Outline_Done_Internal+0x89)[0x3f84615059]
/lib64/libfreetype.so.6(FT_Done_Glyph+0x2b)[0x3f8461aedb]
/lib64/libfreetype.so.6(FT_Glyph_To_Bitmap+0x21c)[0x3f8461b12c]
/lib64/libgd.so.2(gdImageStringFTEx+0xb52)[0x3f8c41b012]
/lib64/libgd.so.2(gdImageStringFT+0x1b)[0x3f8c41bb6b]
gnuplot[0x4b1e45]
gnuplot[0x4dccbd]
gnuplot[0x445f92]
gnuplot[0x44d9c3]
gnuplot[0x46a6f5]
gnuplot[0x41f9e2]
gnuplot[0x41fc79]
gnuplot[0x415ee9]

The second one still works nowhere.

I've got all other results and attached.


--
Liubov Dmitrieva
Intel Corporation

On Tue, Aug 13, 2013 at 12:18 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> Hi,
>
> I tinkered with rep stosq/movsq it could improve performance for older
> processors. It took me a while to realize that big constant factor for
> __builtin_memcpy/memset was caused by a poor code generation rather that
> instruction startup cost.
>
> With effective header I could get around 5% speedup for older machines.
> As i7* handles unaligned loads well my implementation is fastest there.
>
> A rep implementation is best until certain size where vector loop takes
> over. I do switch at 512 bytes for now, some bit can be squeezed by
> finding architecture-specific optimums.
>
> One exception is silvermont, where profiling shown that rep movsq is
> best course of action from 512 bytes. It probably is always when used
> with new header.
>
> There is one thing that I do not understand which is that now core2
> looks to pick unlikely branches as predicted which leads to bad performance
> in random test.
>
> I decided to split variants to two classes, a loop improvements are
> found here:
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_loop.html
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_loop.html
>
> And here is updated version of profiler.
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile120813.tar.bz2
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile120813.tar.bz2
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_loop120813.tar.bz2
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_loop120813.tar.bz2
>
> One thing left is to add ssse3 implementation for bigger sizes. It needs
> to be big to pay overhead caused by computed jump.
>
> Second is playing with unaligned loops. I added memcpy_cntloop which has
> additional counter that gets substracted by 1 in each iteration. It
> is best that I have for i7's so far as it is predicted upto 512 bytes.
>
> Comments?
>
> On Mon, Aug 12, 2013 at 02:40:01PM +0400, Liubov Dmitrieva wrote:
>
> That should be fixed now.
>
>
>
>
>
>
>
> --
>
> Traceroute says that there is a routing problem in the backbone.  It's not our problem.

Attachment: results_memcpy_atom_loop.tar.bz2
Description: BZip2 compressed data

Attachment: results_memcpy_haswell_loop.tar.bz2
Description: BZip2 compressed data

Attachment: results_memcpy_silvermot_loop.tar.bz2
Description: BZip2 compressed data

Attachment: results_memset_atom_loop.tar.bz2
Description: BZip2 compressed data

Attachment: results_memset_haswell_loop.tar.bz2
Description: BZip2 compressed data

Attachment: results_memset_silvermot_loop.tar.bz2
Description: BZip2 compressed data

Attachment: results_memcpy_atom_new.tar.bz2
Description: BZip2 compressed data

Attachment: results_memcpy_silvermot_new.tar.bz2
Description: BZip2 compressed data

Attachment: memset_fail.log
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]