This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: memcpy performance regressions 2.19 -> 2.24(5)
On Tue, May 9, 2017 at 4:48 PM, Erich Elsen <eriche@google.com> wrote:
> I've created a shareable benchmark, available here:
> https://gist.github.com/ekelsen/b66cc085eb39f0495b57679cdb1874fa .
> This is not the one the numbers on the spreadsheet are generated from,
> but the results are similar.
I will take a look.
> I think libc 2.19 chooses sse2_unaligned for all the cpus on the spreadsheet.
>
> You can use this to see the difference on Haswell between
> avx_unaligned and avx_unaligned_erms on the readcache and nocache
> benchmarks. It's true that for readwritecache, which corresponds to
> the libc benchmarks, avx_unaligned_erms is always at least as fast.
I created hjl/x86/optimize branch with memcpy-sse2-unaligned.S
from glibc 2.19 so that we can compare its performance against
others with glibc benchmark.
> You can also use it to see the regression on IvyBridge from 2.19 to 2.24.
That is expected since memcpy-sse2-unaligned.S doesn't use
non-temporal store.
> Are there standard benchmarks showing that using the non-temporal
How responsive is your glibc 2.19 machine when your memcpy benchmark
is running? I would expect glibc 2.24 machine is more responsive.
> store is a net win even though it causes a 2-3x decrease in single
> threaded performance for some processors? Or how else is the decision
> about the threshold made?
There is no perfect number to make everyone happy. I am open
to suggestion to improve the compromise.
H.J.
> Thanks,
> Erich
>
> On Sat, May 6, 2017 at 8:41 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Fri, May 5, 2017 at 5:57 PM, Erich Elsen <eriche@google.com> wrote:
>>> Hi Carlos,
>>>
>>> a/b) The number of runs is dependent on the time taken; the number
>>> iterations was such that each size took at least 500ms for all
>>> iterations. For many of the smaller sizes this means 10-100 million
>>> iterations, for the largest size, 64MB, it was ~60. 10 runs were
>>> launched separately, the difference between the maximum and the
>>> minimum average was never more than 6% for any size; all of the
>>> regressions are larger than this difference (usually much larger).
>>> The times on the spreadsheet are from a randomly chosen run - it would
>>> be possible to use a median or average, but given the large size of
>>> effect, it didn't seem necessary.
>>>
>>> b) The machines were idle (background processes only) except for the
>>> test being run. Boost was disabled. The benchmark is single
>>> threaded. I did not explicitly pin the process - but given that the
>>> machine was otherwise idle - it would be surprising if it was
>>> migrated. I can add this to see if the results change.
>>>
>>> c) The specific processors were E5-2699 (Haswell), E5-2696 (Ivy),
>>> E5-2689 (Sandy); I don't have motherboard or memory info. The kernel
>>> on the benchmark machines is 3.11.10.
>>>
>>> d) Only bench-memcpy-large would expose the problem at the largest
>>> sizes. 2.19 did not have bench-memcpy-large. The current benchmarks
>>> will not reveal the regressions on Ivy and Haswell in the intermediate
>>> size range because they only correspond to the readwritecache case on
>>> the spreadsheet. That is, they loop over the same src and dst buffers
>>> in the timing loop.
>>>
>>> nocache means that both the src and dst buffers go through memory with
>>> strides such that nothing will be cached.
>>> readcache means that the src buffer is fixed, but the dst buffer
>>> strides through memory.
>>>
>>> To see the difference at the largest sizes with the bench-memcpy-large
>>> you can run it twice; once forcing __x86_shared_non_temporal_threshold
>>> to LONG_MAX so the non-temporal path is never taken.
>>
>> The purpose of using non-tempora store is to avoid cache pullution
>> so that cache is also available to other threads. We can improve the
>> heuristic for non-temporal threshold. But we can't give all cache to
>> a single thread by default.
>>
>> As for Haswell, there are some cases where the SSSE3 memcpy in
>> glibc 2.19 is faster than the new AVX memcpy. But the new AVX
>> memcpy is faster than the SSSE3 memcpy in majority of cases. The
>> new AVX memcpy in glibc 2.24 replaces the old AVX memcpy in glibc
>> 2.23. So there is no regression from 2.23 to 2.24.
>>
>> I also checked my glibc performance data. For data > 32K,
>> __memcpy_avx_unaligned is slower than __memcpy_avx_unaligned_erms.
>> We have
>>
>> /* Threshold to use Enhanced REP MOVSB. Since there is overhead to set
>> up REP MOVSB operation, REP MOVSB isn't faster on short data. The
>> memcpy micro benchmark in glibc shows that 2KB is the approximate
>> value above which REP MOVSB becomes faster than SSE2 optimization
>> on processors with Enhanced REP MOVSB. Since larger register size
>> can move more data with a single load and store, the threshold is
>> higher with larger register size. */
>> #ifndef REP_MOVSB_THRESHOLD
>> # define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16))
>> #endif
>>
>> We can change it if there is improvement in glibc benchmarks.
>>
>>
>> H.J.
>>
>>> e) Yes, I can do this. It needs to go through approval to share
>>> publicly, will take a few days.
>>>
>>> Thanks,
>>> Erich
>>>
>>> On Fri, May 5, 2017 at 11:09 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>>>> On 05/05/2017 01:09 PM, Erich Elsen wrote:
>>>>> I had a couple of questions:
>>>>>
>>>>> 1) Are the large regressions at large sizes for IvyBridge and
>>>>> SandyBridge expected? Is avoiding non-temporal stores a reasonable
>>>>> solution?
>>>>
>>>> No large regressions are expected.
>>>>
>>>>> 2) Is it possible to fix the IvyBridge regressions by using model
>>>>> information to force a specific implementation? I'm not sure how
>>>>> other cpus (AMD) would be affected if the selection logic was modified
>>>>> based on feature flags.
>>>>
>>>> A different memcpy can be used for any detectable difference in hardware.
>>>> What you can't do is select a different memcpy for a different range of
>>>> inputs. You have to make the choice upfront with only the knowledge of
>>>> the hardware as your input. Though today we could augment that choice
>>>> with a glibc tunable set by the shell starting the process.
>>>>
>>>> I have questions of my own:
>>>>
>>>> (a) How statistically relevant were your results?
>>>> - What are your confidence intervals?
>>>> - What is your standard deviation?
>>>> - How many runs did you average?
>>>>
>>>> (b) Was your machine hardware stable?
>>>> - See:
>>>> https://developers.redhat.com/blog/2016/03/11/practical-micro-benchmarking-with-ltrace-and-sched/
>>>> - What methodology did you use to carry out your tests? Like CPU pinning.
>>>>
>>>> (c) Exactly what hardware did you use?
>>>> - You mention IvyBridge and SandyBridge, but what exact hardware did
>>>> you use for the tests, and what exact kernel version?
>>>>
>>>> (d) If you run glibc's own microbenchmarks do you see the same
>>>> performance problems? e.g. make bench, and look at the detailed
>>>> bench-memcpy, bench-memcpy-large, and bench-memcpy-random results.
>>>>
>>>> https://sourceware.org/glibc/wiki/Testing/Builds
>>>>
>>>> (e) Are you willing to publish your microbencmark sources for others
>>>> to confirm the results?
>>>>
>>>> --
>>>> Cheers,
>>>> Carlos.
>>
>>
>>
>> --
>> H.J.
--
H.J.