This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFC] Improving memcpy and memset.


Hi,

Another area of optimization that I want return to are memset and
memcpy. Results are here

http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile.html

Ljuba, could you also test them on haswell, it will be useful to know.

http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile090813.tar.bz2
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile090813.tar.bz2

I did several experiments.

One is using stosq/movsq as generated with  
gcc -mstringop-strategy=rep_8byte memset_rep8.c -S -o memset_rep8.s
Results here are chaotic.

When data is already at L1 cache then our loops are more effective. 
When I increassed cache pressure to have data in L2 cache then I loops
are still better.
But when data are L3 cache or when data is in main memory then for nehalem, core2 and ivy bridge
cases a rep implementation is significantly faster. 

On other hand bulldozer has rep implementation always slower.

For rest of architectures results are chaotic. 

A question if to switch to rep foosq depends on memory behavior, I do
not have simple answer other than decide by profile based optimization.


Second experiment was was how effective could computed jump be. 
I eliminated several overheads, like jump table improved cache usage.
My table is upto size 1024 but if I cut it to same size as headeds of
other implementation a table would be more space effective.
But performance is still inferior. See files memset/cpy_tbl.s


Third one is haswell specific, I consider to incerease memset/memcpy
header to handle upto 512 bytes. For memcpy I need to store additional
data to ymm registers. A memset could be done without that but I am
still interested if this extension (memset/cpy_512.s) is sucessfull.


Then I discovered that my memcpy implementation could be improved in
several cases, I wrote memcpy_new_tuned.s that tries to use more
effective control flow.

Comments, new ideas?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]