At sizes significantly larger than 32MB, the copy strategy seems to perform worse on Zen 4 than either REP MOVSB or a more naive AVX-512 streaming copy. Steps to reproduce: 1. Compile the microbench at https://github.com/ska-sa/katgpucbf/blob/6176ed2e1f5eccf7f2acc97e4779141ac794cc01/scratch/memcpy_loop.cpp using the adjacent Makefile (or g++ -std=c++17 -std=c++17 -Wall -O3 -pthread -o memcpy_loop memcpy_loop.cpp) 2. Run it as ./memcpy_loop -f memcpy -r 5 3. Run it again as ./memcpy_loop -f memcpy_rep_movsb -r 5 4. Run it again as ./memcpy_loop -f memcpy_stream_avx512 -r 5 On the system I'm testing, the first reports 19.2 GB/s while the second (which directly invokes REP MOVSB) reports 27-27.5 GB/s and the third (a straight-forward non-temporal AVX-512 implementation) reports 27.8 GB/s. This is for a 128 MiB copy (other sizes can be passed to the benchmark with -b). Interestingly, I don't see this regression on a similarly-configured Zen 3 system, where memcpy and memcpy_rep_movsb seem to have roughly the same performance on large copies. This is in spite of the comment at https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/dl-cacheinfo.h;h=87486054f931e52f53123c672217f1903297ec76;hb=HEAD#l1031 claiming that Zen 3's REP MOVSB performs poorly on large copies. System information: Epyc 9374F processor, Ubuntu 22.04, glibc compiled from git glibc-2.38.9000-185-g2aa0974d25
I've only got access to the Zen 4 systems until the end of the week, so if there are any diagnostics that would be useful to capture, let me know ASAP. There is some further information attached to #30994.
This is likely a trade-off between whole-system performance and single-thread/single-process performance. To avoid evicting higher-level, shared cashes, it is usually beneficial to switch from REP MOVSB to non-temporal stores at a certain point even though it impacts single-thread performance.
> To avoid evicting higher-level, shared cashes, it is usually beneficial to switch from REP MOVSB to non-temporal stores at a certain point even though it impacts single-thread performance. From what I can tell, REP MOVSB on Zen 4 already does this for large copies. I base that off using AMD uProf to read the DRAM bandwidth counters while running the copy benchmark. When copying 1GB with a single REP MOVSB, the read and write counters match the rate of data transfer (no read-for-ownership overhead). When breaking the copy into smaller pieces (less than 32MB), the read counter is roughly double the transfer rate due to read-for-ownership. I've tried running the benchmark on all 32 cores of the CPU; in this case glibc's memcpy is about 5% faster than using REP MOVSB (and my simple AVX512 streaming copy with a linear access pattern gets pretty much the same performance as REP MOVSB in this case). So you're correct that there is a trade-off, but being 5% faster when bandwidth-limited but 30% slower on a single core (as well as using more space in icache) doesn't seem like a great tradeoff (I appreciate that trying to write a memcpy that works well across a wide range of hardware is no easy task though).
On Zen3 I can confirm that REP MOSVB is not faster than the vectorized path, but with an unaligned destination the results are also subpar: # Default non-temporal stores $ ./memcpy_loop -f memcpy -D 1 4.19552 # GLIBC_TUNABLES=glibc.cpu.x86_non_temporal_threshold=134217730 $ ./memcpy_loop -f memcpy -D 1 11.7379 # Modified glibc with tunables to force REP MOVSB $ ./memcpy_loop -f memcpy -D 1 1.01945 With aligned stores I see ~20 GB on Zen3. I am even more convinced that REP MOVSB is not really a good strategy for Zen3. I still think it would be better to avoid non-temporal stores for unaligned inputs on Zen3. Another possibility would avoid unaligned stores, but it would require adding another code path that might not be optimal for all x86 cpus.
As a side-note, the benchmark you are referring does not have some of the options you are using (-r, memcpy_rep_movsb).
> As a side-note, the benchmark you are referring does not have some of the options you are using (-r, memcpy_rep_movsb). Oh, I linked to a fixed commit and those features have since been added to main: https://github.com/ska-sa/katgpucbf/blob/main/scratch/memcpy_loop.cpp
> I still think it would be better to avoid non-temporal stores for unaligned inputs on Zen3. As noted in comment 2, non-temporal stores for large copies have benefits that won't show up in a single-threaded microbenchmark: both less pollution of the shared cache, and 1/3 less DRAM bandwidth (eliminates read-for-ownership).
(In reply to Bruce Merry from comment #7) > > I still think it would be better to avoid non-temporal stores for unaligned inputs on Zen3. > > As noted in comment 2, non-temporal stores for large copies have benefits > that won't show up in a single-threaded microbenchmark: both less pollution > of the shared cache, and 1/3 less DRAM bandwidth (eliminates > read-for-ownership). Indeed, after some tests, it seems that the performance difference when multiple issuers are involved does seem to show the advantage of non-temporal stores even for unaligned arguments.