This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

REP MOVSB is slow on AMD Ryzen


Hello,

I am using an AMD Ryzen 1700 CPU, on which I'm currently profiling all the applications I regularly use. I use openSUSE 42.2 with glibc-2.22, but I think that some of my profiling results may still be applicable to the latest Git version of glibc.

During my testing, I noticed that the __memcpy_avx_unaligned regularly appears at the top of the CPU consumers, especially when my workload copies large buffers of memory (second consumer, at 7%, when training big neural networks). The detailed profile for this function shows that nearly all the time is spent in "rep movsb" (sorry for the bad formatting):

.......|2a0:...mov....__x86_shared_cache_size_half,%rcx
..0,01.|.......shl....$0x3,%rcx
.......|.......cmp....%rcx,%rdx
.......|.......jae....2c0
.......|.......mov....%rdx,%rcx
.......|.......mov....%rdx,%rcx
.45,01.|.......rep....movsb.%ds:(%rsi),%es:(%rdi)
..0,04.|.......retq

__memcpy_avx_unaligned does not exist anymore in Git, but has been replaced by variants of memmove. However, large copies still seem to be performed using "rep movsb". I have carried some more tests and have found that on my machine, "rep movsb" and "rep movsq" are able to copy 4 bytes per cycle, while vmovdqu-based AVX code copies around 11 bytes par cycle. Another person found the same issue: https://www.xaymar.com/2017/03/15/the-truth-about-amd-ryzens-performance-issues/.

Could someone try to reproduce and investigate this? I do not yet know if removing the "movsb" case in memmove and using AVX instructions instead speeds things up, but I may have time this week-end to install and test the latest glibc. Being far from completely understanding what sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S does, I however doubt that I could produce a nice patch.


By the way, string instructions seem to be generally slow on Ryzen. For instance, the kernel clear_page function also appeared at the top of my profiling before I replaced "rep stosq" with code based on clzero.

Best regards,
Denis Steckelmacher


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]