This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[RFC] Improving memcpy and memset.

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
Cc: libc-alpha at sourceware dot org
Date: Fri, 9 Aug 2013 23:02:31 +0200
Subject: [RFC] Improving memcpy and memset.

Hi,

Another area of optimization that I want return to are memset and
memcpy. Results are here

http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile.html

Ljuba, could you also test them on haswell, it will be useful to know.

http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile090813.tar.bz2
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile090813.tar.bz2

I did several experiments.

One is using stosq/movsq as generated with
gcc -mstringop-strategy=rep_8byte memset_rep8.c -S -o memset_rep8.s
Results here are chaotic.

When data is already at L1 cache then our loops are more effective.
When I increassed cache pressure to have data in L2 cache then I loops
are still better.
But when data are L3 cache or when data is in main memory then for nehalem, core2 and ivy bridge
cases a rep implementation is significantly faster.

On other hand bulldozer has rep implementation always slower.

For rest of architectures results are chaotic.

A question if to switch to rep foosq depends on memory behavior, I do
not have simple answer other than decide by profile based optimization.

Second experiment was was how effective could computed jump be.
I eliminated several overheads, like jump table improved cache usage.
My table is upto size 1024 but if I cut it to same size as headeds of
other implementation a table would be more space effective.
But performance is still inferior. See files memset/cpy_tbl.s

Third one is haswell specific, I consider to incerease memset/memcpy
header to handle upto 512 bytes. For memcpy I need to store additional
data to ymm registers. A memset could be done without that but I am
still interested if this extension (memset/cpy_512.s) is sucessfull.

Then I discovered that my memcpy implementation could be improved in
several cases, I wrote memcpy_new_tuned.s that tries to use more
effective control flow.

Comments, new ideas?

Follow-Ups:
- Re: [RFC] Improving memcpy and memset.
  - From: Liubov Dmitrieva

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]