This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor


> Well it looks the dst_unaligned code (which deals with a specific issue on ThunderX2) is completely
> unnecessary on Kunpeng since the unaligned cases in eg. Falkor and generic aren't slower than the
> aligned cases. So I'd suggest to remove this code - it's adds a lot of code, thus making memcpy
> unnecessarily large.

Yes, thanks for the reminder, we will remove the dst_unaligned code in next patch, note that we have tested this version and the results are same as before removed .

> Well these results show a very significant 4% win for Falkor memcpy! It seems strange to only optimize
> for large sizes when the vast majority of copies in real code are very small (note the distribution of the
> sizes and alignment for the random benchmark come from SPEC).

Sure, we agree the falkor memcpy has 4% win on small size. However, at the beginning we start to Kunpeng optimized the memcpy, one of the most important case is database case, which really need more improvement on large size.

So we use memcpy-bench-large to help us to choose the baseline, the result seems that ThunderX2 better in Kunpeng env (see below large size benchset). And it really confusing us why falkor acts so differently between bench-walk and bench-large.

And we also know the small/medium size is very important case for us, so we do more optimizations on medium size in branch L(copy2048_large), and the gap existing in small size is acceptable on our opinion, at least better than generic.

Function: memcpy
Variant: large
                                    __memcpy_thunderx	__memcpy_thunderx2	__memcpy_falkor	__memcpy_kunpeng	__memcpy_generic
========================================================================================================================
    length=65543, align1=0, align2=0:      4238.12 (-101.40%)	     2295.00 ( -9.06%)	     2156.25 ( -2.46%)	     2301.25 ( -9.36%)	     2104.38	
    length=65551, align1=0, align2=3:      3101.88 ( -2.97%)	     2283.75 ( 24.19%)	     3562.50 (-18.26%)	     3332.50 (-10.62%)	     3012.50	
    length=65567, align1=3, align2=0:      2899.38 ( 18.38%)	     2285.62 ( 35.66%)	     3800.00 ( -6.97%)	     3333.75 (  6.16%)	     3552.50	
    length=65599, align1=3, align2=5:      2866.88 (  2.09%)	     2258.75 ( 22.86%)	     3333.75 (-13.85%)	     3316.25 (-13.26%)	     2928.12	
   length=131079, align1=0, align2=0:      6905.00 (-43.54%)	     5153.75 ( -7.13%)	     5848.75 (-21.58%)	     5185.62 ( -7.80%)	     4810.62	
   length=131087, align1=0, align2=3:      7232.50 ( -1.16%)	     5138.75 ( 28.12%)	     7966.88 (-11.43%)	     7117.50 (  0.45%)	     7149.38	
   length=131103, align1=3, align2=0:      6889.38 ( -0.69%)	     5145.00 ( 24.81%)	     7562.50 (-10.52%)	     7094.38 ( -3.68%)	     6842.50	
   length=131135, align1=3, align2=5:      7096.88 (  0.38%)	     5116.88 ( 28.17%)	     7501.88 ( -5.31%)	     6989.38 (  1.89%)	     7123.75	
   length=262151, align1=0, align2=0:     15736.90 (-37.76%)	    11581.90 ( -1.39%)	    13088.10 (-14.58%)	    11993.80 ( -5.00%)	    11423.10	
   length=262159, align1=0, align2=3:     16581.20 ( -1.58%)	    11655.00 ( 28.60%)	    24495.60 (-50.07%)	    16983.10 ( -4.05%)	    16322.50	
   length=262175, align1=3, align2=0:     14940.60 ( -1.26%)	    11769.40 ( 20.23%)	    22577.50 (-53.02%)	    16925.00 (-15.00%)	    14755.00	
   length=262207, align1=3, align2=5:     16167.50 (  0.64%)	    11700.60 ( 28.09%)	    23373.10 (-43.64%)	    16959.40 ( -4.23%)	    16271.90	
   length=524295, align1=0, align2=0:     59790.60 (-11.75%)	    37941.90 ( 29.09%)	    55081.90 ( -2.95%)	    38127.50 ( 28.74%)	    53505.00	
   length=524303, align1=0, align2=3:     58881.90 ( 11.26%)	    38826.90 ( 41.48%)	    82475.60 (-24.30%)	    45186.20 ( 31.90%)	    66351.20	
   length=524319, align1=3, align2=0:     52656.90 ( 13.39%)	    36772.50 ( 39.52%)	    79804.40 (-31.26%)	    44925.60 ( 26.11%)	    60800.60	
   length=524351, align1=3, align2=5:     60072.50 (  9.98%)	    38996.90 ( 41.56%)	    80605.60 (-20.79%)	    44833.80 ( 32.82%)	    66731.90	
  length=1048583, align1=0, align2=0:    140744.00 (  0.00%)	    92834.40 ( 34.15%)	   156891.00 (-12.00%)	    92700.60 ( 34.25%)	   140981.00	
  length=1048591, align1=0, align2=3:    138818.00 ( 14.00%)	    93137.50 ( 42.85%)	   181426.00 (-12.00%)	    99160.00 ( 39.00%)	   162969.00	
  length=1048607, align1=3, align2=0:    123231.00 ( 19.00%)	    85615.00 ( 44.00%)	   180895.00 (-19.00%)	    99104.40 ( 35.31%)	   153204.00	
  length=1048639, align1=3, align2=5:    139292.00 ( 14.00%)	    93651.20 ( 42.53%)	   181196.00 (-12.00%)	    99200.00 ( 39.00%)	   162948.00	
  length=2097159, align1=0, align2=0:    305711.00 ( -3.00%)	   194030.00 ( 34.00%)	   341827.00 (-15.00%)	   193635.00 ( 35.00%)	   298108.00	
  length=2097167, align1=0, align2=3:    284683.00 ( 14.00%)	   194907.00 ( 41.00%)	   369128.00 (-11.00%)	   199646.00 ( 40.00%)	   332992.00	
  length=2097183, align1=3, align2=0:    252849.00 ( 20.00%)	   175724.00 ( 44.00%)	   368446.00 (-17.00%)	   199992.00 ( 36.00%)	   316792.00	
  length=2097215, align1=3, align2=5:    284985.00 ( 14.00%)	   194299.00 ( 41.00%)	   370709.00 (-11.00%)	   199978.00 ( 40.00%)	   334178.00	
  length=4194311, align1=0, align2=0:    569834.00 (  3.00%)	   374032.00 ( 36.00%)	   678767.00 (-16.00%)	   372642.00 ( 36.00%)	   588650.00	
  length=4194319, align1=0, align2=3:    576691.00 ( 20.00%)	   378198.00 ( 47.00%)	   872069.00 (-21.00%)	   479148.00 ( 33.00%)	   723708.00	
  length=4194335, align1=3, align2=0:    512557.00 ( 25.00%)	   342394.00 ( 50.00%)	   863184.00 (-26.00%)	   480661.00 ( 30.00%)	   690116.00	
  length=4194367, align1=3, align2=5:    576482.00 ( 19.00%)	   380856.00 ( 46.00%)	   860296.00 (-21.00%)	   479336.00 ( 32.00%)	   712510.00	
  length=8388615, align1=0, align2=0:   1324960.00 ( -4.74%)	   805671.00 ( 36.31%)	  1444260.00 (-14.17%)	   784482.00 ( 37.99%)	  1265040.00	
  length=8388623, align1=0, align2=3:   1217610.00 ( 18.78%)	   771131.00 ( 48.56%)	  1894040.00 (-26.35%)	  1143710.00 ( 23.71%)	  1499080.00	
  length=8388639, align1=3, align2=0:   1147310.00 ( 28.52%)	   856928.00 ( 46.61%)	  1974130.00 (-22.99%)	  1281390.00 ( 20.17%)	  1605130.00	
  length=8388671, align1=3, align2=5:   1124090.00 ( 27.46%)	   846521.00 ( 45.37%)	  1928810.00 (-24.47%)	  1248630.00 ( 19.43%)	  1549660.00	
 length=16777223, align1=0, align2=0:   4243330.00 ( -8.50%)	  2821000.00 ( 27.87%)	  3382290.00 ( 13.51%)	  3635000.00 (  7.05%)	  3910780.00	
 length=16777231, align1=0, align2=3:   3503780.00 ( 19.16%)	  2214300.00 ( 48.91%)	  4848350.00 (-11.86%)	  3703890.00 ( 14.55%)	  4334450.00	
 length=16777247, align1=3, align2=0:   2411200.00 ( 44.45%)	  1467160.00 ( 66.20%)	  5724860.00 (-31.89%)	  4081640.00 (  5.97%)	  4340700.00	
 length=16777279, align1=3, align2=5:   4414480.00 ( 12.29%)	  2689160.00 ( 46.57%)	  6632000.00 (-31.77%)	  3824970.00 ( 24.00%)	  5032940.00	
 length=33554439, align1=0, align2=0:  10333400.00 (  5.39%)	  7028430.00 ( 35.65%)	  9881280.00 (  9.53%)	  7024200.00 ( 35.69%)	 10921700.00	
 length=33554447, align1=0, align2=3:  11757800.00 (  6.63%)	  8676040.00 ( 31.10%)	 15423100.00 (-22.47%)	  8309720.00 ( 34.01%)	 12592900.00	
 length=33554463, align1=3, align2=0:   6959020.00 ( 34.78%)	  5700270.00 ( 46.58%)	 15931700.00 (-49.32%)	  8625250.00 ( 19.16%)	 10669700.00	
 length=33554495, align1=3, align2=5:  11190000.00 ( 12.07%)	  8721580.00 ( 31.47%)	 15759900.00 (-23.84%)	  8557940.00 ( 32.75%)	 12726300.00	

> This has the effect of slowing down all small memmoves and no-overlap memmoves (ie. 99% of calls).
> Is there a reason to special case 96-512? I don't see an obvious difference between the cases, there is
> one extra prefetch but outside the loop. Even if it helps somehow, why not do the test for >512 in the
> move_long code? That removes 4 instructions (3 and a NOP) from the memmove fallthrough path.

+	.p2align 4
+ 1:
+	subs	count, count, 64
+	stp	A_q, B_q, [dstend, -32]
+	ldp	A_q, B_q, [srcend, -32]
+	stp	C_q, D_q, [dstend, -64]!
+	ldp	C_q, D_q, [srcend, -64]!
+	b.hi	1b

The branch prediction algorithm of Kunpeng chip requires that the jump judgment and the jump address cannot be in the same 32 instruction cacheline. The original loop body (see ablove) will cause the prediction failure leading to performance degradation, especially below 512 bytes. So the ldr/str instruction replaces the ldp/stp instruction to lengthen the loop body to satisfy the branch prediction algorithm, in case 96-512.

> Btw do you have any plans to post other string functions that you can discuss here? If so, would these
> add more ifuncs or improve the generic versions?

Yes, memcmp, strlen, strnlen, strcpy, memrchr will be included, we will summited the patch and test results as soon as possible.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]