Bug 30994 - REP MOVSB performance suffers from page aliasing on Zen 4
Summary: REP MOVSB performance suffers from page aliasing on Zen 4
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: string (show other bugs)
Version: 2.38
: P2 minor
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-24 06:18 UTC by Bruce Merry
Modified: 2024-04-04 10:36 UTC (History)
12 users (show)

See Also:
Host: x86_64-linux-gnu
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Glibc's memcpy benchmark results (7.68 KB, application/gzip)
2023-10-24 06:19 UTC, Bruce Merry
Details
Output of ld-linux.so.2 --list-tunables (520 bytes, text/plain)
2023-10-24 06:20 UTC, Bruce Merry
Details
Output of ld-linux.so.2 --list-diagnostics (1.78 KB, text/plain)
2023-10-24 06:21 UTC, Bruce Merry
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bruce Merry 2023-10-24 06:18:38 UTC
When (dst-src)&0xFFF is small (but non-zero), the REP MOVSB path in memcpy performs extremely poorly (as much as 25x slower than the alternative path). I'm observing this on Zen 4 (Epyc 9374F). I'm running Ubuntu 22.04 with a glibc hand-built from glibc-2.38.9000-185-g2aa0974d25.

To reproduce:
1. Download the microbench at https://github.com/ska-sa/katgpucbf/blob/6176ed2e1f5eccf7f2acc97e4779141ac794cc01/scratch/memcpy_loop.cpp
2. Compile it with the adjacent Makefile (tl;dr: g++ -std=c++17 -O3 -pthread -o memcpy_loop memcpy_loop.cpp)
3. Run ./memcpy_loop -t mmap -f memcpy -b 8192 -p 100000 -D 1 -r 5
4. Run GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=10000 ./memcpy_loop -t mmap -f memcpy -b 8192 -p 100000 -D 1 -r 5

Step 3 reports a rate of 4.2 GB/s, while step 4 (which disables the rep_movsb path) reports a rate of 111 GB/s. The test uses 8192-byte memory copies, where the source is page-aligned and the destination starts 1 byte into a page.

I'll also attach the bench-memcpy-large.out, which shows similar results.

I've previously filed this as an Ubuntu bug (https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515) but it doesn't seem to have received much attention.
Comment 1 Bruce Merry 2023-10-24 06:19:48 UTC
Created attachment 15193 [details]
Glibc's memcpy benchmark results
Comment 2 Bruce Merry 2023-10-24 06:20:33 UTC
Created attachment 15194 [details]
Output of ld-linux.so.2 --list-tunables
Comment 3 Bruce Merry 2023-10-24 06:21:12 UTC
Created attachment 15195 [details]
Output of ld-linux.so.2 --list-diagnostics
Comment 4 Bruce Merry 2023-10-24 06:32:39 UTC
This issue also affects Zen 3. Zen 2 doesn't advertise ERMS so memcpy isn't affected.
Comment 5 Bruce Merry 2023-10-25 13:37:58 UTC
FWIW, backwards REP MOVSB (std; rep movsb; cld) is still horribly slow on Zen 4 (4 GB/s even when the data is nicely aligned and cached).
Comment 6 Adhemerval Zanella 2023-10-27 12:39:09 UTC
I have access to a Zen3 code (5900X) and I can confirm that using REP MOVSB seems to be always worse than vector instructions.  ERMS is used for sizes between 2112 (rep_movsb_threshold) and 524288 (rep_movsb_stop_threshold or the L2 size for Zen3) and the '-S 0 -D 1' performance really seems to be a microcode since I don't see similar performance difference with other alignments.

On Zen3 with REP MOVSB I see:

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
84.2448 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
506.099 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
990.845 GB/s


$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
57.1122 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
325.409 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
510.87 GB/s


$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
4.43104 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.4551 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
40.4088 GB/s


$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
4.34671 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.0829 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23`


While with vectorized instructions I see:


$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
124.183 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
773.696 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
1413.02 GB/s


$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
58.3212 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
322.583 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
506.116 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
121.872 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
717.717 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
1318.17 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
58.5352 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23`
325.996 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23`
498.552 GB/s

So it seems there in gain in using REP MOVSB on Zen3/Zen4, specially on the size is was supposed to be better. glibc 2.34 added a fix from AMD (6e02b3e9327b7dbb063958d2b124b64fcb4bbe3f), where the assumption is ERMS performs poorly on data above L2 cache size so REP MOVSB is limited to L2 cache size (from 2113 to 524287), but I think AMD engineers did not really evaluated that ERM is indeed better than vectorized instruction.

And I think BZ#30995 is the same issue, since __memcpy_avx512_unaligned_erms uses the same tunable to decide whether to use ERMS. I have created a patch that just disable ERMS usage on AMD cores [1], can you check if it improves performance on Zen4 as well?

Also, I have notices that memset is also showing subpar performance with ERMS and I also disable it on my branch.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/bz30944-memcpy-zen
Comment 7 Bruce Merry 2023-10-27 13:04:12 UTC
Here's what I get on the Zen 4 system with the same parameters. I haven't had a chance to look at what it all means:

+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
80.6649 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
954.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1883.1 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
48.7753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
570.385 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
676.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
3.54696 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
42.5706 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
85.0753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
3.50689 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
41.5237 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
81.8951 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
102.05 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
1206.81 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2415.47 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
49.4859 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
583.279 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1066.54 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
97.1753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
991.128 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2257.42 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
49.3362 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
571.026 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1075.03 GB/s
Comment 8 Bruce Merry 2023-10-27 13:16:01 UTC
Ah looks like the GLIBC_TUNABLES environment variable didn't appear in the output. Let me try again:

+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
80.6649 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
954.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1883.1 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
48.7753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
570.385 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
676.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
3.54696 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
42.5706 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
85.0753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
3.50689 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
41.5237 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
81.8951 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
102.05 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
1206.81 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2415.47 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
49.4859 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
583.279 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1066.54 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
97.1753 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
991.128 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2257.42 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
49.3362 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
571.026 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1075.03 GB/s
Comment 9 Bruce Merry 2023-10-30 08:21:16 UTC
So in those cases, REP MOVSB seems to be a slow-down, but there do also seem to be cases where REP MOVSB is much faster (this is on Zen 4) e.g.

$ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
94.5295 GB/s
94.3382 GB/s
94.474 GB/s
94.2385 GB/s
94.5105 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
56.5062 GB/s
55.3669 GB/s
56.4723 GB/s
55.857 GB/s
56.5396 GB/s

When not using huge pages, the vectorised memcpy hits 115.5 GB/s. I'm seeing a lot of cases on Zen 4 where huge pages actually makes things worse; maybe it's related to hardware prefetch reading past the end of the buffer?
Comment 10 Adhemerval Zanella 2023-10-30 13:30:58 UTC
On Zen3 I am not seeing such slowdown using vectorized instructions.  With a patch glibc to disable REP MOVSB I see:

$ ./testrun.sh ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000
146.593 GB/s

# Force REP MOVSB
$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_stop_threshold=4097 ./testrun.sh ./memcpy_loop  -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000
116.298 GB/s

And I don't see difference between mmap and mmap_huge.
Comment 11 Bruce Merry 2023-10-30 14:21:56 UTC
> On Zen3 I am not seeing such slowdown using vectorized instructions.

Agreed, I'm also not seeing this huge-page slowdown on our Zen 3 servers (this is with Ubuntu 22.04's glibc 2.32; I haven't got a hand-built glibc handy on  that server):

$ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
90.065 GB/s
89.9096 GB/s
89.9131 GB/s
89.8207 GB/s
89.952 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
116.997 GB/s
116.874 GB/s
116.937 GB/s
117.029 GB/s
117.007 GB/s

On the other hand, there seem to be other cases where REP MOVSB is faster on Zen 3:

$ ./memcpy_loop -D 512 -f memcpy_rep_movsb -r 5 -t mmap 0
Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
Using function memcpy_rep_movsb
22.045 GB/s
22.3135 GB/s
22.1144 GB/s
22.8571 GB/s
22.2688 GB/s

$ ./memcpy_loop -D 512 -f memcpy -r 5 -t mmap 0
Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
Using function memcpy
7.66155 GB/s
7.71314 GB/s
7.72952 GB/s
7.72505 GB/s
7.74309 GB/s

But overall it does seem like the vectorised copy performs better than REP MOVSB on Zen 3.
Comment 12 Adhemerval Zanella 2023-10-30 16:27:35 UTC
(In reply to Bruce Merry from comment #11)
> > On Zen3 I am not seeing such slowdown using vectorized instructions.
> 
> Agreed, I'm also not seeing this huge-page slowdown on our Zen 3 servers
> (this is with Ubuntu 22.04's glibc 2.32; I haven't got a hand-built glibc
> handy on  that server):
> 
> $ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
> Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
> Using function memcpy
> 90.065 GB/s
> 89.9096 GB/s
> 89.9131 GB/s
> 89.8207 GB/s
> 89.952 GB/s
> 
> $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D
> 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
> Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
> Using function memcpy
> 116.997 GB/s
> 116.874 GB/s
> 116.937 GB/s
> 117.029 GB/s
> 117.007 GB/s
> 
> On the other hand, there seem to be other cases where REP MOVSB is faster on
> Zen 3:
> 
> $ ./memcpy_loop -D 512 -f memcpy_rep_movsb -r 5 -t mmap 0
> Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
> Using function memcpy_rep_movsb
> 22.045 GB/s
> 22.3135 GB/s
> 22.1144 GB/s
> 22.8571 GB/s
> 22.2688 GB/s
> 
> $ ./memcpy_loop -D 512 -f memcpy -r 5 -t mmap 0
> Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
> Using function memcpy
> 7.66155 GB/s
> 7.71314 GB/s
> 7.72952 GB/s
> 7.72505 GB/s
> 7.74309 GB/s
> 
> But overall it does seem like the vectorised copy performs better than REP
> MOVSB on Zen 3.

The main issues seem to define when ERMS is better than vectorized based on arguments. Current glibc only takes into consideration the input size, whereas from the discussion it seems we need to also take into consideration the argument alignment (and both of them).

Also, it seems that Zen3 ERMS is slightly better than non-temporal instructions, which is another tuning heuristics since again only the size is used where to use it (currently x86_non_temporal_threshold).

In any case, I think at least for sizes where ERMS is currently being used it would be better to use the vectorized path. Most likely some more tunings to switch to ERMS on large sizes would be profitable for Zen cores.

Does AMD provide any tuning manual describing such characteristics for instruction and memory operations?
Comment 13 Sourceware Commits 2024-02-13 16:54:16 UTC
The master branch has been updated by H.J. Lu <hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e

commit 0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Thu Feb 8 10:08:38 2024 -0300

    x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
    
    The REP MOVSB usage on memcpy/memmove does not show much performance
    improvement on Zen3/Zen4 cores compared to the vectorized loops.  Also,
    as from BZ 30994, if the source is aligned and the destination is not
    the performance can be 20x slower.
    
    The performance difference is noticeable with small buffer sizes, closer
    to the lower bounds limits when memcpy/memmove starts to use ERMS.  The
    performance of REP MOVSB is similar to vectorized instruction on the
    size limit (the L2 cache).  Also, there is no drawback to multiple cores
    sharing the cache.
    
    Checked on x86_64-linux-gnu on Zen3.
    Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Comment 14 Sourceware Commits 2024-04-04 10:36:35 UTC
The release/2.39/master branch has been updated by Arjun Shankar <arjun@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=aa4249266e9906c4bc833e4847f4d8feef59504f

commit aa4249266e9906c4bc833e4847f4d8feef59504f
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Thu Feb 8 10:08:38 2024 -0300

    x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
    
    The REP MOVSB usage on memcpy/memmove does not show much performance
    improvement on Zen3/Zen4 cores compared to the vectorized loops.  Also,
    as from BZ 30994, if the source is aligned and the destination is not
    the performance can be 20x slower.
    
    The performance difference is noticeable with small buffer sizes, closer
    to the lower bounds limits when memcpy/memmove starts to use ERMS.  The
    performance of REP MOVSB is similar to vectorized instruction on the
    size limit (the L2 cache).  Also, there is no drawback to multiple cores
    sharing the cache.
    
    Checked on x86_64-linux-gnu on Zen3.
    Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
    
    (cherry picked from commit 0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e)