This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Your first profiler got memory corruption on Haswell. double free or corruption (out): 0x00000000029bb050 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3f7ee7c00e] /lib64/libfreetype.so.6(FT_Outline_Done_Internal+0x89)[0x3f84615059] /lib64/libfreetype.so.6(FT_Done_Glyph+0x2b)[0x3f8461aedb] /lib64/libfreetype.so.6(FT_Glyph_To_Bitmap+0x21c)[0x3f8461b12c] /lib64/libgd.so.2(gdImageStringFTEx+0xb52)[0x3f8c41b012] /lib64/libgd.so.2(gdImageStringFT+0x1b)[0x3f8c41bb6b] gnuplot[0x4b1e45] gnuplot[0x4dccbd] gnuplot[0x445f92] gnuplot[0x44d9c3] gnuplot[0x46a6f5] gnuplot[0x41f9e2] gnuplot[0x41fc79] gnuplot[0x415ee9] The second one still works nowhere. I've got all other results and attached. -- Liubov Dmitrieva Intel Corporation On Tue, Aug 13, 2013 at 12:18 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > Hi, > > I tinkered with rep stosq/movsq it could improve performance for older > processors. It took me a while to realize that big constant factor for > __builtin_memcpy/memset was caused by a poor code generation rather that > instruction startup cost. > > With effective header I could get around 5% speedup for older machines. > As i7* handles unaligned loads well my implementation is fastest there. > > A rep implementation is best until certain size where vector loop takes > over. I do switch at 512 bytes for now, some bit can be squeezed by > finding architecture-specific optimums. > > One exception is silvermont, where profiling shown that rep movsq is > best course of action from 512 bytes. It probably is always when used > with new header. > > There is one thing that I do not understand which is that now core2 > looks to pick unlikely branches as predicted which leads to bad performance > in random test. > > I decided to split variants to two classes, a loop improvements are > found here: > > http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_loop.html > http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_loop.html > > And here is updated version of profiler. > > http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile120813.tar.bz2 > http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile120813.tar.bz2 > http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_loop120813.tar.bz2 > http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_loop120813.tar.bz2 > > One thing left is to add ssse3 implementation for bigger sizes. It needs > to be big to pay overhead caused by computed jump. > > Second is playing with unaligned loops. I added memcpy_cntloop which has > additional counter that gets substracted by 1 in each iteration. It > is best that I have for i7's so far as it is predicted upto 512 bytes. > > Comments? > > On Mon, Aug 12, 2013 at 02:40:01PM +0400, Liubov Dmitrieva wrote: > > That should be fixed now. > > > > > > > > -- > > Traceroute says that there is a routing problem in the backbone. It's not our problem.
Attachment:
results_memcpy_atom_loop.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memcpy_haswell_loop.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memcpy_silvermot_loop.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memset_atom_loop.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memset_haswell_loop.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memset_silvermot_loop.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memcpy_atom_new.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memcpy_silvermot_new.tar.bz2
Description: BZip2 compressed data
Attachment:
memset_fail.log
Description: Binary data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |