This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor
- From: "Zhangxuelei (Derek)" <zhangxuelei4 at huawei dot com>
- To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, "yikunkero at gmail dot com" <yikunkero at gmail dot com>, jiangyikun <jiangyikun at huawei dot com>
- Cc: nd <nd at arm dot com>
- Date: Thu, 17 Oct 2019 12:49:50 +0000
- Subject: Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor
> Well it looks the dst_unaligned code (which deals with a specific issue on ThunderX2) is completely
> unnecessary on Kunpeng since the unaligned cases in eg. Falkor and generic aren't slower than the
> aligned cases. So I'd suggest to remove this code - it's adds a lot of code, thus making memcpy
> unnecessarily large.
Yes, thanks for the reminder, we will remove the dst_unaligned code in next patch, note that we have tested this version and the results are same as before removed .
> Well these results show a very significant 4% win for Falkor memcpy! It seems strange to only optimize
> for large sizes when the vast majority of copies in real code are very small (note the distribution of the
> sizes and alignment for the random benchmark come from SPEC).
Sure, we agree the falkor memcpy has 4% win on small size. However, at the beginning we start to Kunpeng optimized the memcpy, one of the most important case is database case, which really need more improvement on large size.
So we use memcpy-bench-large to help us to choose the baseline, the result seems that ThunderX2 better in Kunpeng env (see below large size benchset). And it really confusing us why falkor acts so differently between bench-walk and bench-large.
And we also know the small/medium size is very important case for us, so we do more optimizations on medium size in branch L(copy2048_large), and the gap existing in small size is acceptable on our opinion, at least better than generic.
Function: memcpy
Variant: large
__memcpy_thunderx __memcpy_thunderx2 __memcpy_falkor __memcpy_kunpeng __memcpy_generic
========================================================================================================================
length=65543, align1=0, align2=0: 4238.12 (-101.40%) 2295.00 ( -9.06%) 2156.25 ( -2.46%) 2301.25 ( -9.36%) 2104.38
length=65551, align1=0, align2=3: 3101.88 ( -2.97%) 2283.75 ( 24.19%) 3562.50 (-18.26%) 3332.50 (-10.62%) 3012.50
length=65567, align1=3, align2=0: 2899.38 ( 18.38%) 2285.62 ( 35.66%) 3800.00 ( -6.97%) 3333.75 ( 6.16%) 3552.50
length=65599, align1=3, align2=5: 2866.88 ( 2.09%) 2258.75 ( 22.86%) 3333.75 (-13.85%) 3316.25 (-13.26%) 2928.12
length=131079, align1=0, align2=0: 6905.00 (-43.54%) 5153.75 ( -7.13%) 5848.75 (-21.58%) 5185.62 ( -7.80%) 4810.62
length=131087, align1=0, align2=3: 7232.50 ( -1.16%) 5138.75 ( 28.12%) 7966.88 (-11.43%) 7117.50 ( 0.45%) 7149.38
length=131103, align1=3, align2=0: 6889.38 ( -0.69%) 5145.00 ( 24.81%) 7562.50 (-10.52%) 7094.38 ( -3.68%) 6842.50
length=131135, align1=3, align2=5: 7096.88 ( 0.38%) 5116.88 ( 28.17%) 7501.88 ( -5.31%) 6989.38 ( 1.89%) 7123.75
length=262151, align1=0, align2=0: 15736.90 (-37.76%) 11581.90 ( -1.39%) 13088.10 (-14.58%) 11993.80 ( -5.00%) 11423.10
length=262159, align1=0, align2=3: 16581.20 ( -1.58%) 11655.00 ( 28.60%) 24495.60 (-50.07%) 16983.10 ( -4.05%) 16322.50
length=262175, align1=3, align2=0: 14940.60 ( -1.26%) 11769.40 ( 20.23%) 22577.50 (-53.02%) 16925.00 (-15.00%) 14755.00
length=262207, align1=3, align2=5: 16167.50 ( 0.64%) 11700.60 ( 28.09%) 23373.10 (-43.64%) 16959.40 ( -4.23%) 16271.90
length=524295, align1=0, align2=0: 59790.60 (-11.75%) 37941.90 ( 29.09%) 55081.90 ( -2.95%) 38127.50 ( 28.74%) 53505.00
length=524303, align1=0, align2=3: 58881.90 ( 11.26%) 38826.90 ( 41.48%) 82475.60 (-24.30%) 45186.20 ( 31.90%) 66351.20
length=524319, align1=3, align2=0: 52656.90 ( 13.39%) 36772.50 ( 39.52%) 79804.40 (-31.26%) 44925.60 ( 26.11%) 60800.60
length=524351, align1=3, align2=5: 60072.50 ( 9.98%) 38996.90 ( 41.56%) 80605.60 (-20.79%) 44833.80 ( 32.82%) 66731.90
length=1048583, align1=0, align2=0: 140744.00 ( 0.00%) 92834.40 ( 34.15%) 156891.00 (-12.00%) 92700.60 ( 34.25%) 140981.00
length=1048591, align1=0, align2=3: 138818.00 ( 14.00%) 93137.50 ( 42.85%) 181426.00 (-12.00%) 99160.00 ( 39.00%) 162969.00
length=1048607, align1=3, align2=0: 123231.00 ( 19.00%) 85615.00 ( 44.00%) 180895.00 (-19.00%) 99104.40 ( 35.31%) 153204.00
length=1048639, align1=3, align2=5: 139292.00 ( 14.00%) 93651.20 ( 42.53%) 181196.00 (-12.00%) 99200.00 ( 39.00%) 162948.00
length=2097159, align1=0, align2=0: 305711.00 ( -3.00%) 194030.00 ( 34.00%) 341827.00 (-15.00%) 193635.00 ( 35.00%) 298108.00
length=2097167, align1=0, align2=3: 284683.00 ( 14.00%) 194907.00 ( 41.00%) 369128.00 (-11.00%) 199646.00 ( 40.00%) 332992.00
length=2097183, align1=3, align2=0: 252849.00 ( 20.00%) 175724.00 ( 44.00%) 368446.00 (-17.00%) 199992.00 ( 36.00%) 316792.00
length=2097215, align1=3, align2=5: 284985.00 ( 14.00%) 194299.00 ( 41.00%) 370709.00 (-11.00%) 199978.00 ( 40.00%) 334178.00
length=4194311, align1=0, align2=0: 569834.00 ( 3.00%) 374032.00 ( 36.00%) 678767.00 (-16.00%) 372642.00 ( 36.00%) 588650.00
length=4194319, align1=0, align2=3: 576691.00 ( 20.00%) 378198.00 ( 47.00%) 872069.00 (-21.00%) 479148.00 ( 33.00%) 723708.00
length=4194335, align1=3, align2=0: 512557.00 ( 25.00%) 342394.00 ( 50.00%) 863184.00 (-26.00%) 480661.00 ( 30.00%) 690116.00
length=4194367, align1=3, align2=5: 576482.00 ( 19.00%) 380856.00 ( 46.00%) 860296.00 (-21.00%) 479336.00 ( 32.00%) 712510.00
length=8388615, align1=0, align2=0: 1324960.00 ( -4.74%) 805671.00 ( 36.31%) 1444260.00 (-14.17%) 784482.00 ( 37.99%) 1265040.00
length=8388623, align1=0, align2=3: 1217610.00 ( 18.78%) 771131.00 ( 48.56%) 1894040.00 (-26.35%) 1143710.00 ( 23.71%) 1499080.00
length=8388639, align1=3, align2=0: 1147310.00 ( 28.52%) 856928.00 ( 46.61%) 1974130.00 (-22.99%) 1281390.00 ( 20.17%) 1605130.00
length=8388671, align1=3, align2=5: 1124090.00 ( 27.46%) 846521.00 ( 45.37%) 1928810.00 (-24.47%) 1248630.00 ( 19.43%) 1549660.00
length=16777223, align1=0, align2=0: 4243330.00 ( -8.50%) 2821000.00 ( 27.87%) 3382290.00 ( 13.51%) 3635000.00 ( 7.05%) 3910780.00
length=16777231, align1=0, align2=3: 3503780.00 ( 19.16%) 2214300.00 ( 48.91%) 4848350.00 (-11.86%) 3703890.00 ( 14.55%) 4334450.00
length=16777247, align1=3, align2=0: 2411200.00 ( 44.45%) 1467160.00 ( 66.20%) 5724860.00 (-31.89%) 4081640.00 ( 5.97%) 4340700.00
length=16777279, align1=3, align2=5: 4414480.00 ( 12.29%) 2689160.00 ( 46.57%) 6632000.00 (-31.77%) 3824970.00 ( 24.00%) 5032940.00
length=33554439, align1=0, align2=0: 10333400.00 ( 5.39%) 7028430.00 ( 35.65%) 9881280.00 ( 9.53%) 7024200.00 ( 35.69%) 10921700.00
length=33554447, align1=0, align2=3: 11757800.00 ( 6.63%) 8676040.00 ( 31.10%) 15423100.00 (-22.47%) 8309720.00 ( 34.01%) 12592900.00
length=33554463, align1=3, align2=0: 6959020.00 ( 34.78%) 5700270.00 ( 46.58%) 15931700.00 (-49.32%) 8625250.00 ( 19.16%) 10669700.00
length=33554495, align1=3, align2=5: 11190000.00 ( 12.07%) 8721580.00 ( 31.47%) 15759900.00 (-23.84%) 8557940.00 ( 32.75%) 12726300.00
> This has the effect of slowing down all small memmoves and no-overlap memmoves (ie. 99% of calls).
> Is there a reason to special case 96-512? I don't see an obvious difference between the cases, there is
> one extra prefetch but outside the loop. Even if it helps somehow, why not do the test for >512 in the
> move_long code? That removes 4 instructions (3 and a NOP) from the memmove fallthrough path.
+ .p2align 4
+ 1:
+ subs count, count, 64
+ stp A_q, B_q, [dstend, -32]
+ ldp A_q, B_q, [srcend, -32]
+ stp C_q, D_q, [dstend, -64]!
+ ldp C_q, D_q, [srcend, -64]!
+ b.hi 1b
The branch prediction algorithm of Kunpeng chip requires that the jump judgment and the jump address cannot be in the same 32 instruction cacheline. The original loop body (see ablove) will cause the prediction failure leading to performance degradation, especially below 512 bytes. So the ldr/str instruction replaces the ldp/stp instruction to lengthen the loop body to satisfy the branch prediction algorithm, in case 96-512.
> Btw do you have any plans to post other string functions that you can discuss here? If so, would these
> add more ifuncs or improve the generic versions?
Yes, memcmp, strlen, strnlen, strcpy, memrchr will be included, we will summited the patch and test results as soon as possible.