This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Improving memcpy and memset.


I've got Segmentation fault for memset profiler everywhere. I've
attached the log.
I've got memory corruption (double free) when I was running memcpy
profiler on Haswell. I've attached the log.
For Atom and Silvermont it works.


--
Liubov Dmitrieva
Intel Corporation


On Sat, Aug 10, 2013 at 1:02 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> Hi,
>
> Another area of optimization that I want return to are memset and
> memcpy. Results are here
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile.html
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile.html
>
> Ljuba, could you also test them on haswell, it will be useful to know.
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile090813.tar.bz2
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile090813.tar.bz2
>
> I did several experiments.
>
> One is using stosq/movsq as generated with
> gcc -mstringop-strategy=rep_8byte memset_rep8.c -S -o memset_rep8.s
> Results here are chaotic.
>
> When data is already at L1 cache then our loops are more effective.
> When I increassed cache pressure to have data in L2 cache then I loops
> are still better.
> But when data are L3 cache or when data is in main memory then for nehalem, core2 and ivy bridge
> cases a rep implementation is significantly faster.
>
> On other hand bulldozer has rep implementation always slower.
>
> For rest of architectures results are chaotic.
>
> A question if to switch to rep foosq depends on memory behavior, I do
> not have simple answer other than decide by profile based optimization.
>
>
> Second experiment was was how effective could computed jump be.
> I eliminated several overheads, like jump table improved cache usage.
> My table is upto size 1024 but if I cut it to same size as headeds of
> other implementation a table would be more space effective.
> But performance is still inferior. See files memset/cpy_tbl.s
>
>
> Third one is haswell specific, I consider to incerease memset/memcpy
> header to handle upto 512 bytes. For memcpy I need to store additional
> data to ymm registers. A memset could be done without that but I am
> still interested if this extension (memset/cpy_512.s) is sucessfull.
>
>
> Then I discovered that my memcpy implementation could be improved in
> several cases, I wrote memcpy_new_tuned.s that tries to use more
> effective control flow.
>
> Comments, new ideas?

Attachment: log_memcpy_haswell.txt
Description: Text document

Attachment: log_memset_silvermont.txt
Description: Text document

Attachment: results_memcpy_atom.tar.bz2
Description: BZip2 compressed data

Attachment: results_memcpy_haswell_with_fail.tar.bz2
Description: BZip2 compressed data

Attachment: results_memcpy_silvermot.tar.bz2
Description: BZip2 compressed data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]