This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
REP MOVSB is slow on AMD Ryzen
- From: Denis Steckelmacher <steckdenis at yahoo dot fr>
- To: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Date: Fri, 24 Mar 2017 10:50:50 +0000 (UTC)
- Subject: REP MOVSB is slow on AMD Ryzen
- Authentication-results: sourceware.org; auth=none
- References: <906289045.4637325.1490352650173.ref@mail.yahoo.com>
- Reply-to: Denis Steckelmacher <steckdenis at yahoo dot fr>
Hello,
I am using an AMD Ryzen 1700 CPU, on which I'm currently profiling all the applications I regularly use. I use openSUSE 42.2 with glibc-2.22, but I think that some of my profiling results may still be applicable to the latest Git version of glibc.
During my testing, I noticed that the __memcpy_avx_unaligned regularly appears at the top of the CPU consumers, especially when my workload copies large buffers of memory (second consumer, at 7%, when training big neural networks). The detailed profile for this function shows that nearly all the time is spent in "rep movsb" (sorry for the bad formatting):
.......|2a0:...mov....__x86_shared_cache_size_half,%rcx
..0,01.|.......shl....$0x3,%rcx
.......|.......cmp....%rcx,%rdx
.......|.......jae....2c0
.......|.......mov....%rdx,%rcx
.......|.......mov....%rdx,%rcx
.45,01.|.......rep....movsb.%ds:(%rsi),%es:(%rdi)
..0,04.|.......retq
__memcpy_avx_unaligned does not exist anymore in Git, but has been replaced by variants of memmove. However, large copies still seem to be performed using "rep movsb". I have carried some more tests and have found that on my machine, "rep movsb" and "rep movsq" are able to copy 4 bytes per cycle, while vmovdqu-based AVX code copies around 11 bytes par cycle. Another person found the same issue: https://www.xaymar.com/2017/03/15/the-truth-about-amd-ryzens-performance-issues/.
Could someone try to reproduce and investigate this? I do not yet know if removing the "movsb" case in memmove and using AVX instructions instead speeds things up, but I may have time this week-end to install and test the latest glibc. Being far from completely understanding what sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S does, I however doubt that I could produce a nice patch.
By the way, string instructions seem to be generally slow on Ryzen. For instance, the kernel clear_page function also appeared at the top of my profiling before I replaced "rep stosq" with code based on clzero.
Best regards,
Denis Steckelmacher