This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[RFC] Improving memcpy and memset.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Fri, 9 Aug 2013 23:02:31 +0200
- Subject: [RFC] Improving memcpy and memset.
Hi,
Another area of optimization that I want return to are memset and
memcpy. Results are here
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile.html
Ljuba, could you also test them on haswell, it will be useful to know.
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile090813.tar.bz2
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile090813.tar.bz2
I did several experiments.
One is using stosq/movsq as generated with
gcc -mstringop-strategy=rep_8byte memset_rep8.c -S -o memset_rep8.s
Results here are chaotic.
When data is already at L1 cache then our loops are more effective.
When I increassed cache pressure to have data in L2 cache then I loops
are still better.
But when data are L3 cache or when data is in main memory then for nehalem, core2 and ivy bridge
cases a rep implementation is significantly faster.
On other hand bulldozer has rep implementation always slower.
For rest of architectures results are chaotic.
A question if to switch to rep foosq depends on memory behavior, I do
not have simple answer other than decide by profile based optimization.
Second experiment was was how effective could computed jump be.
I eliminated several overheads, like jump table improved cache usage.
My table is upto size 1024 but if I cut it to same size as headeds of
other implementation a table would be more space effective.
But performance is still inferior. See files memset/cpy_tbl.s
Third one is haswell specific, I consider to incerease memset/memcpy
header to handle upto 512 bytes. For memcpy I need to store additional
data to ymm registers. A memset could be done without that but I am
still interested if this extension (memset/cpy_512.s) is sucessfull.
Then I discovered that my memcpy implementation could be improved in
several cases, I wrote memcpy_new_tuned.s that tries to use more
effective control flow.
Comments, new ideas?