This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor


Hi Derek,

> We do vary based on ThunderX2 for its good performance for large copies, which is needed by us firstly. 

But it's not clear whether that is a win over the Falkor version according to memcpy-walk output. Consider eg.

                                    __memcpy_thunderx	__memcpy_thunderx2	__memcpy_falkor	__memcpy_kunpeng	__memcpy_generic
     length=32768:      7629.73 (-46.76%)	     4473.26 ( 13.95%)	     3947.03 ( 24.08%)	     5201.79 ( -0.06%)	     5198.66
     length=32784:      7666.55 (-44.44%)	     4705.93 ( 11.34%)	     3782.04 ( 28.75%)	     4759.99 ( 10.32%)	     5307.92
     length=32769:      7589.72 (-42.41%)	     4776.54 ( 10.37%)	     3889.61 ( 27.02%)	     4789.02 ( 10.14%)	     5329.43
     length=32783:      7502.45 (-44.25%)	     4688.97 (  9.85%)	     3858.48 ( 25.81%)	     4714.48 (  9.36%)	     5201.10
     length=32770:      7438.46 (-43.39%)	     4647.77 ( 10.41%)	     3848.94 ( 25.81%)	     4680.76 (  9.77%)	     5187.71
     length=32782:      7225.10 (-38.35%)	     4609.10 ( 11.74%)	     3855.37 ( 26.17%)	     4643.70 ( 11.08%)	     5222.19
     length=32771:      7326.40 (-42.87%)	     4587.85 ( 10.53%)	     3828.78 ( 25.33%)	     4580.34 ( 10.68%)	     5127.85
     length=32781:      7261.12 (-41.38%)	     4548.17 ( 11.44%)	     3851.30 ( 25.01%)	     4584.78 ( 10.73%)	     5135.97
     length=32772:      7178.11 (-41.83%)	     4510.12 ( 10.89%)	     3802.44 ( 24.87%)	     4521.19 ( 10.67%)	     5061.20
     length=32780:      7186.99 (-42.34%)	     4481.01 ( 11.25%)	     3835.00 ( 24.00%)	     4532.33 ( 10.23%)	     5049.00
     length=32773:      7089.60 (-38.93%)	     4482.70 ( 12.15%)	     3830.79 ( 24.93%)	     4487.18 ( 12.07%)	     5102.88
     length=32779:      7076.27 (-25.65%)	     4498.21 ( 20.13%)	     3881.99 ( 31.07%)	     5371.42 (  4.62%)	     5631.73
     length=32774:      8362.27 (-48.81%)	     5190.76 (  7.63%)	     3845.61 ( 31.57%)	     5176.08 (  7.89%)	     5619.46
     length=32778:      8186.09 (-48.41%)	     5109.29 (  7.37%)	     3861.96 ( 29.98%)	     5211.88 (  5.51%)	     5515.68
     length=32775:      8186.51 (-49.43%)	     5096.24 (  6.98%)	     3846.12 ( 29.80%)	     5128.88 (  6.38%)	     5478.44
     length=32777:      8038.38 (-49.89%)	     5001.09 (  6.75%)	     3837.79 ( 28.44%)	     5095.39 (  4.99%)	     5362.88

Here the Falkor variant is 20-30% faster...

It doesn't help the existing benchmarks don't report an average across all the inputs for each ifunc...
A colleague is working on a script to visualise benchmark results in a graph which should make these
comparisons much easier.

> And we do find the detect of ThunderX2 version for 96 to 2M bytes copy, at least when running on the
> Kunpeng arch, even the falkor version is not much better.

Well it looks the dst_unaligned code (which deals with a specific issue on ThunderX2) is completely
unnecessary on Kunpeng since the unaligned cases in eg. Falkor and generic aren't slower than the
aligned cases. So I'd suggest to remove this code - it's adds a lot of code, thus making memcpy
unnecessarily large.

> Therefore, branch was written and we used generic copy, that 64 bytes loop, dst aligned, without prefetch,
> and it works. We also have simply tried Q register replacing X register in this branch, but it didn't make more sense. 

Yes, using Q register copy is best on modern micro-architectures (hence the idea to do this even in the
the generic version).

> And here is the result of memcpy-random benchmarks :

                                    __memcpy_thunderx   __memcpy_thunderx2      __memcpy_falkor __memcpy_kunpeng        __memcpy_generic   
   max-size=4096:     32558.90 ( -2.08%)            31987.80 ( -0.29%)       30474.30 (  4.46%)       31666.30 (  0.72%)       31896.60    
   max-size=8192:     31796.80 ( -1.18%)            31423.90 (  0.01%)       29974.40 (  4.62%)       30917.90 (  1.62%)       31427.40    
  max-size=16384:     33122.80 ( -1.05%)            32058.30 (  2.20%)       30470.40 (  7.05%)       31727.90 (  3.21%)       32779.90    
  max-size=32768:     32530.10 ( -1.22%)            31912.80 (  0.71%)       29960.80 (  6.78%)       31567.60 (  1.78%)       32139.40    
  max-size=65536:     33373.60 ( -0.40%)            32476.30 (  2.30%)       30957.70 (  6.87%)       32137.00 (  3.32%)       33240.10

Well these results show a very significant 4% win for Falkor memcpy! It seems strange to only optimize
for large sizes when the vast majority of copies in real code are very small (note the distribution of the
sizes and alignment for the random benchmark come from SPEC).

+ENTRY_ALIGN (MEMMOVE, 6)
...
+	sub	tmp1, dstin, src 
+	cmp	count, 512 
+	ccmp	tmp1, count, 2, hi 
+	b.lo	L(move_long)
+	cmp	count, 96 
+	ccmp	tmp1, count, 2, hi 
+	b.lo	L(move_middle)		

This has the effect of slowing down all small memmoves and no-overlap memmoves (ie. 99% of calls).
Is there a reason to special case 96-512? I don't see an obvious difference between the cases, there is
one extra prefetch but outside the loop. Even if it helps somehow, why not do the test for >512 in the
move_long code? That removes 4 instructions (3 and a NOP) from the memmove fallthrough path.

Btw do you have any plans to post other string functions that you can discuss here? If so, would these
add more ifuncs or improve the generic versions?

Cheers,
Wilco



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]