[PATCH] ARM: Add Cortex-A15 optimized NEON and VFP memcpy routines, with IFUNC.

Will Newton will.newton@linaro.org
Mon Apr 15 10:38:00 GMT 2013


On 15 April 2013 11:06, Måns Rullgård <mans@mansr.com> wrote:

Hi Måns,

>> Add a high performance memcpy routine optimized for Cortex-A15 with
>> variants for use in the presence of NEON and VFP hardware, selected
>> at runtime using indirect function support.
>
> How does this perform on Cortex-A9?

The code is also faster on A9 although the gains are not quite as
pronounced. A set of numbers is attached (they linewrap pretty
horribly inline).


--
Will Newton
Toolchain Working Group, Linaro
-------------- next part --------------
before:8:100000000:1:3.847382: took 3.847382 s for 100000000 calls to memcpy of 8 bytes.  ~258.630 MB/s corrected.
after:8:100000000:1:3.171783: took 3.171783 s for 100000000 calls to memcpy of 8 bytes.  ~335.458 MB/s corrected.
before:8:100000000:2:3.763550: took 3.763550 s for 100000000 calls to memcpy of 8 bytes.  ~266.195 MB/s corrected.
after:8:100000000:2:2.360168: took 2.360168 s for 100000000 calls to memcpy of 8 bytes.  ~521.594 MB/s corrected.
before:8:100000000:4:3.183990: took 3.183990 s for 100000000 calls to memcpy of 8 bytes.  ~333.667 MB/s corrected.
after:8:100000000:4:2.357422: took 2.357422 s for 100000000 calls to memcpy of 8 bytes.  ~522.575 MB/s corrected.
before:8:100000000:8:3.105652: took 3.105652 s for 100000000 calls to memcpy of 8 bytes.  ~345.504 MB/s corrected.
after:8:100000000:8:2.339081: took 2.339081 s for 100000000 calls to memcpy of 8 bytes.  ~529.223 MB/s corrected.
before:16:100000000:1:3.887695: took 3.887695 s for 100000000 calls to memcpy of 16 bytes.  ~510.287 MB/s corrected.
after:16:100000000:1:2.506378: took 2.506378 s for 100000000 calls to memcpy of 16 bytes.  ~948.388 MB/s corrected.
before:16:100000000:2:4.114410: took 4.114410 s for 100000000 calls to memcpy of 16 bytes.  ~474.325 MB/s corrected.
after:16:100000000:2:2.506226: took 2.506226 s for 100000000 calls to memcpy of 16 bytes.  ~948.478 MB/s corrected.
before:16:100000000:4:3.460236: took 3.460236 s for 100000000 calls to memcpy of 16 bytes.  ~595.401 MB/s corrected.
after:16:100000000:4:2.509155: took 2.509155 s for 100000000 calls to memcpy of 16 bytes.  ~946.754 MB/s corrected.
before:16:100000000:8:3.344055: took 3.344055 s for 100000000 calls to memcpy of 16 bytes.  ~623.674 MB/s corrected.
after:16:100000000:8:2.339264: took 2.339264 s for 100000000 calls to memcpy of 16 bytes.  ~1058.312 MB/s corrected.
before:20:100000000:1:4.080444: took 4.080444 s for 100000000 calls to memcpy of 20 bytes.  ~599.233 MB/s corrected.
after:20:100000000:1:3.094452: took 3.094452 s for 100000000 calls to memcpy of 20 bytes.  ~868.164 MB/s corrected.
before:20:100000000:2:4.399658: took 4.399658 s for 100000000 calls to memcpy of 20 bytes.  ~544.615 MB/s corrected.
after:20:100000000:2:3.091522: took 3.091522 s for 100000000 calls to memcpy of 20 bytes.  ~869.323 MB/s corrected.
before:20:100000000:4:3.512451: took 3.512451 s for 100000000 calls to memcpy of 20 bytes.  ~729.390 MB/s corrected.
after:20:100000000:4:3.094696: took 3.094696 s for 100000000 calls to memcpy of 20 bytes.  ~868.067 MB/s corrected.
before:20:100000000:8:3.579956: took 3.579956 s for 100000000 calls to memcpy of 20 bytes.  ~711.035 MB/s corrected.
after:20:100000000:8:2.339600: took 2.339600 s for 100000000 calls to memcpy of 20 bytes.  ~1322.583 MB/s corrected.
before:31:100000000:1:4.722931: took 4.722931 s for 100000000 calls to memcpy of 31 bytes.  ~772.817 MB/s corrected.
after:31:100000000:1:3.512634: took 3.512634 s for 100000000 calls to memcpy of 31 bytes.  ~1130.475 MB/s corrected.
before:31:100000000:2:4.926422: took 4.926422 s for 100000000 calls to memcpy of 31 bytes.  ~733.785 MB/s corrected.
after:31:100000000:2:3.700684: took 3.700684 s for 100000000 calls to memcpy of 31 bytes.  ~1054.640 MB/s corrected.
before:31:100000000:4:3.725647: took 3.725647 s for 100000000 calls to memcpy of 31 bytes.  ~1045.331 MB/s corrected.
after:31:100000000:4:3.430481: took 3.430481 s for 100000000 calls to memcpy of 31 bytes.  ~1167.140 MB/s corrected.
before:31:100000000:8:3.706085: took 3.706085 s for 100000000 calls to memcpy of 31 bytes.  ~1052.611 MB/s corrected.
after:31:100000000:8:2.669373: took 2.669373 s for 100000000 calls to memcpy of 31 bytes.  ~1668.474 MB/s corrected.
before:32:100000000:1:4.521362: took 4.521362 s for 100000000 calls to memcpy of 32 bytes.  ~842.119 MB/s corrected.
after:32:100000000:1:3.682373: took 3.682373 s for 100000000 calls to memcpy of 32 bytes.  ~1095.818 MB/s corrected.
before:32:100000000:2:4.879456: took 4.879456 s for 100000000 calls to memcpy of 32 bytes.  ~766.389 MB/s corrected.
after:32:100000000:2:3.680542: took 3.680542 s for 100000000 calls to memcpy of 32 bytes.  ~1096.539 MB/s corrected.
before:32:100000000:4:3.563934: took 3.563934 s for 100000000 calls to memcpy of 32 bytes.  ~1144.492 MB/s corrected.
after:32:100000000:4:3.679932: took 3.679932 s for 100000000 calls to memcpy of 32 bytes.  ~1096.779 MB/s corrected.
before:32:100000000:8:3.602142: took 3.602142 s for 100000000 calls to memcpy of 32 bytes.  ~1128.324 MB/s corrected.
after:32:100000000:8:2.703949: took 2.703949 s for 100000000 calls to memcpy of 32 bytes.  ~1689.331 MB/s corrected.
before:63:100000000:1:5.548370: took 5.548370 s for 100000000 calls to memcpy of 63 bytes.  ~1291.822 MB/s corrected.
after:63:100000000:1:5.854523: took 5.854523 s for 100000000 calls to memcpy of 63 bytes.  ~1212.038 MB/s corrected.
before:63:100000000:2:5.685883: took 5.685883 s for 100000000 calls to memcpy of 63 bytes.  ~1254.724 MB/s corrected.
after:63:100000000:2:6.084839: took 6.084839 s for 100000000 calls to memcpy of 63 bytes.  ~1158.224 MB/s corrected.
before:63:100000000:4:4.683136: took 4.683136 s for 100000000 calls to memcpy of 63 bytes.  ~1587.074 MB/s corrected.
after:63:100000000:4:5.771179: took 5.771179 s for 100000000 calls to memcpy of 63 bytes.  ~1232.765 MB/s corrected.
before:63:100000000:8:4.640594: took 4.640594 s for 100000000 calls to memcpy of 63 bytes.  ~1605.112 MB/s corrected.
after:63:100000000:8:4.098389: took 4.098389 s for 100000000 calls to memcpy of 63 bytes.  ~1877.002 MB/s corrected.
before:64:100000000:1:5.395660: took 5.395660 s for 100000000 calls to memcpy of 64 bytes.  ~1356.879 MB/s corrected.
after:64:100000000:1:4.349274: took 4.349274 s for 100000000 calls to memcpy of 64 bytes.  ~1768.205 MB/s corrected.
before:64:100000000:2:5.692108: took 5.692108 s for 100000000 calls to memcpy of 64 bytes.  ~1272.985 MB/s corrected.
after:64:100000000:2:4.457306: took 4.457306 s for 100000000 calls to memcpy of 64 bytes.  ~1714.545 MB/s corrected.
before:64:100000000:4:4.468567: took 4.468567 s for 100000000 calls to memcpy of 64 bytes.  ~1709.138 MB/s corrected.
after:64:100000000:4:4.772614: took 4.772614 s for 100000000 calls to memcpy of 64 bytes.  ~1575.038 MB/s corrected.
before:64:100000000:8:4.309143: took 4.309143 s for 100000000 calls to memcpy of 64 bytes.  ~1789.004 MB/s corrected.
after:64:100000000:8:3.262054: took 3.262054 s for 100000000 calls to memcpy of 64 bytes.  ~2581.210 MB/s corrected.
before:100:100000000:1:7.877625: took 7.877625 s for 100000000 calls to memcpy of 100 bytes.  ~1366.263 MB/s corrected.
after:100:100000000:1:4.935211: took 4.935211 s for 100000000 calls to memcpy of 100 bytes.  ~2361.895 MB/s corrected.
before:100:100000000:2:8.309174: took 8.309174 s for 100000000 calls to memcpy of 100 bytes.  ~1286.712 MB/s corrected.
after:100:100000000:2:4.851624: took 4.851624 s for 100000000 calls to memcpy of 100 bytes.  ~2411.823 MB/s corrected.
before:100:100000000:4:5.450745: took 5.450745 s for 100000000 calls to memcpy of 100 bytes.  ~2094.476 MB/s corrected.
after:100:100000000:4:5.515472: took 5.515472 s for 100000000 calls to memcpy of 100 bytes.  ~2065.119 MB/s corrected.
before:100:100000000:8:5.214142: took 5.214142 s for 100000000 calls to memcpy of 100 bytes.  ~2209.276 MB/s corrected.
after:100:100000000:8:4.516113: took 4.516113 s for 100000000 calls to memcpy of 100 bytes.  ~2635.440 MB/s corrected.
before:200:100000000:1:8.623077: took 8.623077 s for 100000000 calls to memcpy of 200 bytes.  ~2468.862 MB/s corrected.
after:200:100000000:1:7.694977: took 7.694977 s for 100000000 calls to memcpy of 200 bytes.  ~2805.949 MB/s corrected.
before:200:100000000:2:9.148895: took 9.148895 s for 100000000 calls to memcpy of 200 bytes.  ~2311.536 MB/s corrected.
after:200:100000000:2:7.444061: took 7.444061 s for 100000000 calls to memcpy of 200 bytes.  ~2913.494 MB/s corrected.
before:200:100000000:4:8.382385: took 8.382385 s for 100000000 calls to memcpy of 200 bytes.  ~2548.253 MB/s corrected.
after:200:100000000:4:7.862091: took 7.862091 s for 100000000 calls to memcpy of 200 bytes.  ~2738.621 MB/s corrected.
before:200:100000000:8:8.110168: took 8.110168 s for 100000000 calls to memcpy of 200 bytes.  ~2644.428 MB/s corrected.
after:200:100000000:8:6.816742: took 6.816742 s for 100000000 calls to memcpy of 200 bytes.  ~3222.264 MB/s corrected.


More information about the Libc-ports mailing list