x86-64: memcpy performance reduce when running in virtual mechine

Shuo Wang wangshuo47@huawei.com
Mon Jan 11 08:41:57 GMT 2021


There is also performance reduce when memcpy enter __memmove_avx_unaligned_erms in
vm compared with host.
>memcpy performance reduce when running in virtual mechine compared with host.
>This is test result:
>-----------------------
>|       | host |  vm  | 
>|cycle: |  78  | 1503 |
>-----------------------
>
>From perf, we believe that they enter same bracnch between host and vm:
>[host]
>  78.61%  libc-2.28.so     [.] __memmove_sse2_unaligned_erms
>  12.85%  [kernel]         [k] nmi
>   6.38%  hot_host_memcpy  [.] main
>   
>[virtual machine]
>  98.64%  libc-2.28.so   [.] __memmove_sse2_unaligned_erms
>   0.17%  hot_vm_memcpy  [.] main
>   
>This is our demo:
>#include <unistd.h>
>#include <stdlib.h>
>#include <stdio.h>
>#include <string.h>
>
>static __inline__ unsigned long long rdtsc(void)
>{
>  unsigned hi, lo;
>  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
>  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
>}
>
>int main(int argc, char **argv)
>{
>        int i, defs, lm_optb;
>    if (argc == 3) {
>        defs = atoi(argv[1]);
>        lm_optb = atoi(argv[2]);
>    } else {
>        printf("error input!\n");
>        return 1;
>    }
>    char *src = (char *)valloc(defs);
>    char *dest = (char *)valloc(defs);
>    int opts = defs;
>
>    memset(src, 1, defs);
>    memset(dest, 1, defs);
>
>    unsigned long long begin, end;
>    begin = rdtsc();
>
>//while (1) {
>    for (i = 0; i < lm_optb; i++) {
>        (void) memcpy(dest, src, opts);
>    }
>//}
>
>    end = rdtsc();
>    printf("all cycle = %llu, percall = %llu\n", end - begin, (end - begin) / lm_optb);
>
>    return (0);
>}
>
>This is the test log:
># taskset -c 2 ./host_memcpy 1024 1024000
>all cycle = 80149652, percall = 78
># taskset -c 2 ./host_memcpy 1024 1024000
>all cycle = 93075200, percall = 90
>
># taskset -c 2 ./vm_memcpy 1024 1024000
>all cycle = 1539990968, percall = 1503
># taskset -c 2 ./vm_memcpy 1024 1024000
>all cycle = 1541243316, percall = 1505
>
>We build it by:
># gcc -g -O0 memcpy.c -o host_memcpy
># gcc -g -O0 memcpy.c -o vm_memcpy
>
>
>The environment information is as follows:
>[host]
>- kernel version: 4.18.0
>- glibc version: 2.28
>- gcc version: 8.3.1
>- qemu version: 2.12.0
>- libvirtd version: 4.5.0
>
># lscpu
>Architecture:        x86_64
>CPU op-mode(s):      32-bit, 64-bit
>Byte Order:          Little Endian
>CPU(s):              60
>On-line CPU(s) list: 0-59
>Thread(s) per core:  2
>Core(s) per socket:  15
>Socket(s):           8
>NUMA node(s):        8
>Vendor ID:           GenuineIntel
>CPU family:          6
>Model:               62
>Model name:          Intel(R) Xeon(R) CPU E7-8870 v2 @ 2.30GHz
>Stepping:            7
>CPU MHz:             2294.529
>CPU max MHz:         2300.0000
>CPU min MHz:         1200.0000
>BogoMIPS:            4589.07
>Virtualization:      VT-x
>L1d cache:           32K
>L1i cache:           32K
>L2 cache:            256K
>L3 cache:            30720K
>NUMA node0 CPU(s):   0-14,30-44
>NUMA node1 CPU(s):   15-29,45-59
>Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear flush_l1d
>
>[virtual machine]
>- kernel version: 4.18.0
>- glibc version: 2.28
>- gcc version: 8.3.1
>- qemu version: 2.12.0
>- libvirtd version: 4.5.0
>
># lscpu
>Architecture:        x86_64
>CPU op-mode(s):      32-bit, 64-bit
>Byte Order:          Little Endian
>CPU(s):              4
>On-line CPU(s) list: 0-3
>Thread(s) per core:  1
>Core(s) per socket:  1
>Socket(s):           4
>NUMA node(s):        1
>Vendor ID:           GenuineIntel
>CPU family:          6
>Model:               62
>Model name:          Intel(R) Xeon(R) CPU E7-8870 v2 @ 2.30GHz
>Stepping:            7
>CPU MHz:             2294.468
>BogoMIPS:            4588.93
>Hypervisor vendor:   KVM
>Virtualization type: full
>L1d cache:           32K
>L1i cache:           32K
>L2 cache:            4096K
>L3 cache:            16384K
>NUMA node0 CPU(s):   0-3
>Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms xsaveopt arat umip md_clear arch_capabilities
>



More information about the Libc-alpha mailing list