This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
I've got Segmentation fault for memset profiler everywhere. I've attached the log. I've got memory corruption (double free) when I was running memcpy profiler on Haswell. I've attached the log. For Atom and Silvermont it works. -- Liubov Dmitrieva Intel Corporation On Sat, Aug 10, 2013 at 1:02 AM, OndÅej BÃlka <neleai@seznam.cz> wrote: > Hi, > > Another area of optimization that I want return to are memset and > memcpy. Results are here > > http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile.html > http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile.html > > Ljuba, could you also test them on haswell, it will be useful to know. > > http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile090813.tar.bz2 > http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile090813.tar.bz2 > > I did several experiments. > > One is using stosq/movsq as generated with > gcc -mstringop-strategy=rep_8byte memset_rep8.c -S -o memset_rep8.s > Results here are chaotic. > > When data is already at L1 cache then our loops are more effective. > When I increassed cache pressure to have data in L2 cache then I loops > are still better. > But when data are L3 cache or when data is in main memory then for nehalem, core2 and ivy bridge > cases a rep implementation is significantly faster. > > On other hand bulldozer has rep implementation always slower. > > For rest of architectures results are chaotic. > > A question if to switch to rep foosq depends on memory behavior, I do > not have simple answer other than decide by profile based optimization. > > > Second experiment was was how effective could computed jump be. > I eliminated several overheads, like jump table improved cache usage. > My table is upto size 1024 but if I cut it to same size as headeds of > other implementation a table would be more space effective. > But performance is still inferior. See files memset/cpy_tbl.s > > > Third one is haswell specific, I consider to incerease memset/memcpy > header to handle upto 512 bytes. For memcpy I need to store additional > data to ymm registers. A memset could be done without that but I am > still interested if this extension (memset/cpy_512.s) is sucessfull. > > > Then I discovered that my memcpy implementation could be improved in > several cases, I wrote memcpy_new_tuned.s that tries to use more > effective control flow. > > Comments, new ideas?
Attachment:
log_memcpy_haswell.txt
Description: Text document
Attachment:
log_memset_silvermont.txt
Description: Text document
Attachment:
results_memcpy_atom.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memcpy_haswell_with_fail.tar.bz2
Description: BZip2 compressed data
Attachment:
results_memcpy_silvermot.tar.bz2
Description: BZip2 compressed data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |