memcpy performance on skylake server

H.J. Lu hjl.tools@gmail.com
Wed Jul 14 13:26:49 GMT 2021


On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote:
> > Hello,
> >
> > I found that memcpy is slower on skylake server CPUs during our
> > optimization work, and I can't really explain what we got and need some
> > guidance here.
> >
> > The problem is that memcpy is noticeably slower than a simple for loop when
> > copying large chunks of data. This genuinely sounds like an amateur mistake
> > in our testing code but here's what we have tried:
> >
> > * The test data is large enough: 1GB.
> > * We noticed a change quite a while ago regarding skylake and AVX512:
> > https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/
> > * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is 5%
> > faster but still slower than a simple loop.
> > * We tested on multiple bare metal machines with different cpus: Xeon Gold
> > 6132, Gold 6252, Silver 4114, as well as a virtual machine on google cloud,
> > the result is reproducible.
> > * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster than
> > the simple loop. On my desktop (i7-7700k) memcpy is also significantly
> > faster.
> > * numactl is used to ensure everything is running on a single core.
> > * The code is compiled by gcc 10.3
> >
> > The numbers on a Xeon Gold 6132, with glibc 2.33:
> > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> > simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s
> >
> > The result is worse with system provided glibc 2.17:
> > simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s
> > simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s
> > simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s
> > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> >
> >
> > The code to generate this result (compiled with g++ -O2 -g, run with: numactl
> > --membind 0 --physcpubind 0 -- ./a.out)
> > =====
> >
> > #include <chrono>
> > #include <cstring>
> > #include <functional>
> > #include <string>
> > #include <vector>
> >
> > class TestCase {
> >     using clock_t = std::chrono::high_resolution_clock;
> >     using sec_t = std::chrono::duration<double, std::ratio<1>>;
> >
> > public:
> >     static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 million *
> > 8 bytes = 1GiB
> >
> >     void init() {
> >         vals_.resize(NUM_VALUES);
> >         for (size_t i = 0; i < NUM_VALUES; ++i) {
> >             vals_[i] = i;
> >         }
> >         dest_.resize(NUM_VALUES);
> >     }
> >
> >     void run(std::string name, std::function<void(const int64_t *, int64_t
> > *, size_t)> &&func) {
> >         // ignore the result from first run
> >         func(vals_.data(), dest_.data(), vals_.size());
> >         constexpr size_t count = 20;
> >         auto start = clock_t::now();
> >         for (size_t i = 0; i < count; ++i) {
> >             func(vals_.data(), dest_.data(), vals_.size());
> >         }
> >         auto end = clock_t::now();
> >         double duration =
> > std::chrono::duration_cast<sec_t>(end-start).count();
> >         printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(),
> > duration,
> >                sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count /
> > duration,
> >                sizeof(int64_t) * NUM_VALUES / double(1e9) * count /
> > duration);
> >     }
> >
> > private:
> >     std::vector<int64_t> vals_;
> >     std::vector<int64_t> dest_;
> > };
> >
> > void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) {
> >     memcpy(dest, src, n * sizeof(int64_t));
> > }
> >
> > void simple_copy(const int64_t *src, int64_t *dest, size_t n) {
> >     for (size_t i = 0; i < n; ++i) {
> >         dest[i] = src[i];
> >     }
> > }
> >
> > int main(int, char **) {
> >     TestCase c;
> >     c.init();
> >
> >     c.run("simple_memcpy", simple_memcpy);
> >     c.run("simple_copy", simple_copy);
> >     c.run("simple_memcpy", simple_memcpy);
> >     c.run("simple_copy", simple_copy);
> > }
> >
> > =====
> >
> > The assembly of simple_copy generated by gcc is very simple:
> > Dump of assembler code for function _Z11simple_copyPKlPlm:
> >    0x0000000000401440 <+0>:     mov    %rdx,%rcx
> >    0x0000000000401443 <+3>:     test   %rdx,%rdx
> >    0x0000000000401446 <+6>:     je     0x401460 <_Z11simple_copyPKlPlm+32>
> >    0x0000000000401448 <+8>:     xor    %eax,%eax
> >    0x000000000040144a <+10>:    nopw   0x0(%rax,%rax,1)
> >    0x0000000000401450 <+16>:    mov    (%rdi,%rax,8),%rdx
> >    0x0000000000401454 <+20>:    mov    %rdx,(%rsi,%rax,8)
> >    0x0000000000401458 <+24>:    inc    %rax
> >    0x000000000040145b <+27>:    cmp    %rax,%rcx
> >    0x000000000040145e <+30>:    jne    0x401450 <_Z11simple_copyPKlPlm+16>
> >    0x0000000000401460 <+32>:    retq
> >
> > When compiling with -O3, gcc vectorized the loop using xmm0, the
> > simple_loop is around 1% faster.
>
> Usually differences of that magnitude falls either in noise or may be something
> related to OS jitter.
>
> >
> > I took a brief look at the glibc source code. Though I don't have enough
> > knowledge to understand it yet, I'm curious about the underlying mechanism.
> > Thanks.
>
> H.J, do you have any idea what might be happening here?

>From Intel optimization guide:

2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture
Because of the change in the size of each bank of last level cache on
Skylake Server microarchitecture, if
an application, library, or driver only considers the last level cache
to determine the size of on-chip cacheper-core, it may see a reduction
with Skylake Server microarchitecture and may use non-temporal store
with smaller blocks of memory writes. Since non-temporal stores evict
cache lines back to memory, this
may result in an increase in the number of subsequent cache misses and
memory bandwidth demands
on Skylake Server microarchitecture, compared to the previous Intel
Xeon processor family.
Also, because of a change in the handling of accesses resulting from
non-temporal stores by Skylake
Server microarchitecture, the resources within each core remain busy
for a longer duration compared to
similar accesses on the previous Intel Xeon processor family. As a
result, if a series of such instructions
are executed, there is a potential that the processor may run out of
resources and stall, thus limiting the
memory write bandwidth from each core.
The increase in cache misses due to overuse of non-temporal stores and
the limit on the memory write
bandwidth per core for non-temporal stores may result in reduced
performance for some applications.

-- 
H.J.


More information about the Libc-help mailing list