This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
memcpy performance regressions 2.19 -> 2.24(5)
- From: Erich Elsen <eriche at google dot com>
- To: libc-alpha at sourceware dot org
- Date: Fri, 5 May 2017 10:09:19 -0700
- Subject: memcpy performance regressions 2.19 -> 2.24(5)
- Authentication-results: sourceware.org; auth=none
Hi everyone,
I've noticed that there seem to be some noticeable performance
regressions for certain processors and certain sizes when moving from
2.19 to 2.24 (and 2.25).
In this (https://docs.google.com/spreadsheets/d/1Mpu1Kr9CNaa9HQjzKGL0tb2x_Nsx8vtLK3b0QnKesHg/edit?usp=sharing)
spreadsheet the regressions are highlighted with red. The three
benchmarks are:
readwritecache: both read and write locations are cached (if possible)
nocache: neither read or write locations will be cached
readcache: only the read location will be cached (if possible)
The regressions on IvyBridge are especially concerning and can be
fixed by using __memcpy_avx_unaligned instead of the current default
(__sse2_unaligned_erms).
The regressions at large sizes on IvyBridge and SandyBridge seem to be
due to using non-temporal stores and avoiding them also restores the
performance to 2.19 levels.
The regressions on Haswell can be fixed by using
__memcpy_avx_unaligned instead of __memcpy_avx_unaligned_erms in the
region of 32K <= N <= 4MB.
I had a couple of questions:
1) Are the large regressions at large sizes for IvyBridge and
SandyBridge expected? Is avoiding non-temporal stores a reasonable
solution?
2) Is it possible to fix the IvyBridge regressions by using model
information to force a specific implementation? I'm not sure how
other cpus (AMD) would be affected if the selection logic was modified
based on feature flags.
Thanks,
Erich