This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

memcpy performance regressions 2.19 -> 2.24(5)


Hi everyone,

I've noticed that there seem to be some noticeable performance
regressions for certain processors and certain sizes when moving from
2.19 to 2.24 (and 2.25).

In this (https://docs.google.com/spreadsheets/d/1Mpu1Kr9CNaa9HQjzKGL0tb2x_Nsx8vtLK3b0QnKesHg/edit?usp=sharing)
spreadsheet the regressions are highlighted with red.  The three
benchmarks are:

readwritecache: both read and write locations are cached (if possible)
nocache: neither read or write locations will be cached
readcache: only the read location will be cached (if possible)

The regressions on IvyBridge are especially concerning and can be
fixed by using __memcpy_avx_unaligned instead of the current default
(__sse2_unaligned_erms).

The regressions at large sizes on IvyBridge and SandyBridge seem to be
due to using non-temporal stores and avoiding them also restores the
performance to 2.19 levels.

The regressions on Haswell can be fixed by using
__memcpy_avx_unaligned instead of __memcpy_avx_unaligned_erms in the
region of 32K <= N <= 4MB.

I had a couple of questions:

1) Are the large regressions at large sizes for IvyBridge and
SandyBridge expected?  Is avoiding non-temporal stores a reasonable
solution?

2) Is it possible to fix the IvyBridge regressions by using model
information to force a specific implementation?  I'm not sure how
other cpus (AMD) would be affected if the selection logic was modified
based on feature flags.

Thanks,
Erich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]