This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: memcpy performance regressions 2.19 -> 2.24(5)


On Tue, May 9, 2017 at 4:48 PM, Erich Elsen <eriche@google.com> wrote:
> I've created a shareable benchmark, available here:
> https://gist.github.com/ekelsen/b66cc085eb39f0495b57679cdb1874fa .
> This is not the one the numbers on the spreadsheet are generated from,
> but the results are similar.

I will take a look.

> I think libc 2.19 chooses sse2_unaligned for all the cpus on the spreadsheet.
>
> You can use this to see the difference on Haswell between
> avx_unaligned and avx_unaligned_erms on the readcache and nocache
> benchmarks.  It's true that for readwritecache, which corresponds to
> the libc benchmarks, avx_unaligned_erms is always at least as fast.

I created hjl/x86/optimize branch with memcpy-sse2-unaligned.S
from glibc 2.19 so that we can compare its performance against
others with glibc benchmark.

> You can also use it to see the regression on IvyBridge from 2.19 to 2.24.

That is expected since memcpy-sse2-unaligned.S doesn't use
non-temporal store.

> Are there standard benchmarks showing that using the non-temporal

How responsive is your glibc 2.19 machine when your memcpy benchmark
is running? I would expect glibc 2.24 machine is more responsive.

> store is a net win even though it causes a 2-3x decrease in single
> threaded performance for some processors?  Or how else is the decision
> about the threshold made?

There is no perfect number to make everyone happy.  I am open
to suggestion to improve the compromise.

H.J.

> Thanks,
> Erich
>
> On Sat, May 6, 2017 at 8:41 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Fri, May 5, 2017 at 5:57 PM, Erich Elsen <eriche@google.com> wrote:
>>> Hi Carlos,
>>>
>>> a/b) The number of runs is dependent on the time taken; the number
>>> iterations was such that each size took at least 500ms for all
>>> iterations.  For many of the smaller sizes this means 10-100 million
>>> iterations, for the largest size, 64MB, it was ~60.  10 runs were
>>> launched separately, the difference between the maximum and the
>>> minimum average was never more than 6% for any size; all of the
>>> regressions are larger than this difference (usually much larger).
>>> The times on the spreadsheet are from a randomly chosen run - it would
>>> be possible to use a median or average, but given the large size of
>>> effect, it didn't seem necessary.
>>>
>>> b) The machines were idle (background processes only) except for the
>>> test being run.  Boost was disabled.  The benchmark is single
>>> threaded.  I did not explicitly pin the process - but given that the
>>> machine was otherwise idle - it would be surprising if it was
>>> migrated.  I can add this to see if the results change.
>>>
>>> c) The specific processors were E5-2699 (Haswell), E5-2696 (Ivy),
>>> E5-2689 (Sandy); I don't have motherboard or memory info.  The kernel
>>> on the benchmark machines is 3.11.10.
>>>
>>> d)  Only bench-memcpy-large would expose the problem at the largest
>>> sizes.  2.19 did not have bench-memcpy-large.  The current benchmarks
>>> will not reveal the regressions on Ivy and Haswell in the intermediate
>>> size range because they only correspond to the readwritecache case on
>>> the spreadsheet.  That is, they loop over the same src and dst buffers
>>> in the timing loop.
>>>
>>> nocache means that both the src and dst buffers go through memory with
>>> strides such that nothing will be cached.
>>> readcache means that the src buffer is fixed, but the dst buffer
>>> strides through memory.
>>>
>>> To see the difference at the largest sizes with the bench-memcpy-large
>>> you can run it twice; once forcing __x86_shared_non_temporal_threshold
>>> to LONG_MAX so the non-temporal path is never taken.
>>
>> The purpose of using non-tempora store is to avoid cache pullution
>> so that cache is also available to other threads.  We can improve the
>> heuristic for non-temporal threshold.   But we can't give all cache to
>> a single thread by default.
>>
>> As for Haswell, there are some cases where the SSSE3 memcpy in
>> glibc 2.19 is faster than the new AVX memcpy.  But the new AVX
>> memcpy is faster than the SSSE3 memcpy in majority of cases.  The
>> new AVX memcpy in glibc 2.24 replaces the old AVX memcpy in glibc
>> 2.23. So there is no regression from 2.23 to 2.24.
>>
>> I also  checked my glibc performance data.  For data > 32K,
>> __memcpy_avx_unaligned is slower than __memcpy_avx_unaligned_erms.
>> We have
>>
>> /* Threshold to use Enhanced REP MOVSB.  Since there is overhead to set
>>    up REP MOVSB operation, REP MOVSB isn't faster on short data.  The
>>    memcpy micro benchmark in glibc shows that 2KB is the approximate
>>    value above which REP MOVSB becomes faster than SSE2 optimization
>>    on processors with Enhanced REP MOVSB.  Since larger register size
>>    can move more data with a single load and store, the threshold is
>>    higher with larger register size.  */
>> #ifndef REP_MOVSB_THRESHOLD
>> # define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16))
>> #endif
>>
>> We can change it if there is improvement in glibc benchmarks.
>>
>>
>> H.J.
>>
>>> e) Yes, I can do this. It needs to go through approval to share
>>> publicly, will take a few days.
>>>
>>> Thanks,
>>> Erich
>>>
>>> On Fri, May 5, 2017 at 11:09 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>>>> On 05/05/2017 01:09 PM, Erich Elsen wrote:
>>>>> I had a couple of questions:
>>>>>
>>>>> 1) Are the large regressions at large sizes for IvyBridge and
>>>>> SandyBridge expected?  Is avoiding non-temporal stores a reasonable
>>>>> solution?
>>>>
>>>> No large regressions are expected.
>>>>
>>>>> 2) Is it possible to fix the IvyBridge regressions by using model
>>>>> information to force a specific implementation?  I'm not sure how
>>>>> other cpus (AMD) would be affected if the selection logic was modified
>>>>> based on feature flags.
>>>>
>>>> A different memcpy can be used for any detectable difference in hardware.
>>>> What you can't do is select a different memcpy for a different range of
>>>> inputs. You have to make the choice upfront with only the knowledge of
>>>> the hardware as your input. Though today we could augment that choice
>>>> with a glibc tunable set by the shell starting the process.
>>>>
>>>> I have questions of my own:
>>>>
>>>> (a) How statistically relevant were your results?
>>>> - What are your confidence intervals?
>>>> - What is your standard deviation?
>>>> - How many runs did you average?
>>>>
>>>> (b) Was your machine hardware stable?
>>>> - See:
>>>> https://developers.redhat.com/blog/2016/03/11/practical-micro-benchmarking-with-ltrace-and-sched/
>>>> - What methodology did you use to carry out your tests? Like CPU pinning.
>>>>
>>>> (c) Exactly what hardware did you use?
>>>> - You mention IvyBridge and SandyBridge, but what exact hardware did
>>>>   you use for the tests, and what exact kernel version?
>>>>
>>>> (d) If you run glibc's own microbenchmarks do you see the same
>>>>     performance problems? e.g. make bench, and look at the detailed
>>>>     bench-memcpy, bench-memcpy-large, and bench-memcpy-random results.
>>>>
>>>> https://sourceware.org/glibc/wiki/Testing/Builds
>>>>
>>>> (e) Are you willing to publish your microbencmark sources for others
>>>>     to confirm the results?
>>>>
>>>> --
>>>> Cheers,
>>>> Carlos.
>>
>>
>>
>> --
>> H.J.



-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]