[PATCH v2] Reversing calculation of __x86_shared_non_temporal_threshold

Thu Sep 24 21:54:43 GMT 2020

On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty
<patrick.mcgehearty@oracle.com> wrote:
>
>
>
> On 9/23/2020 6:13 PM, H.J. Lu wrote:
> > On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
> > <patrick.mcgehearty@oracle.com> wrote:
> >>
> >>
> >> On 9/23/2020 4:37 PM, H.J. Lu wrote:
> >>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
> >>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>
> >>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
> >>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> >>>>> <libc-alpha@sourceware.org> wrote:
> >>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> >>>>>> uses non_temporal stores to avoid pushing other data out of the last
> >>>>>> level cache.
> >>>>>>
> >>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
> >>>>>> patch of June 2, 2017.
> >>>>>>
> >>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
> >>>>>> getting maximum performance. It was tuned using the single threaded
> >>>>>> large memcpy micro benchmark on an 8 core processor. The last change
> >>>>>> changes the threshold from using 3/4 of one thread's share of the
> >>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
> >>>>>> before switching to non-temporal stores. Multi-threaded systems with
> >>>>>> more than a few threads are server-class and typically have many
> >>>>>> active threads. If one thread consumes 3/4 of the available cache for
> >>>>>> all threads, it will cause other active threads to have data removed
> >>>>>> from the cache. Two examples show the range of the effect. John
> >>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
> >>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
> >>>>>> an internal system test of 128 threads. This regression was discovered
> >>>>>> when comparing OL8 performance to OL7.  An example that compares
> >>>>>> normal stores to non-temporal stores may be found at
> >>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
> >>>>>> shows performance loss of 400 to 500% due to a failure to use
> >>>>>> nontemporal stores. These performance losses are most likely to occur
> >>>>>> when the system load is heaviest and good performance is critical.
> >>>>>>
> >>>>>> The tunable x86_non_temporal_threshold can be used to override the
> >>>>>> default for the knowledgable user who really wants maximum cache
> >>>>>> allocation to a single thread in a multi-threaded system.
> >>>>>> The manual entry for the tunable has been expanded to provide
> >>>>>> more information about its purpose.
> >>>>>>
> >>>>>>            modified: sysdeps/x86/cacheinfo.c
> >>>>>>            modified: manual/tunables.texi
> >>>>>> ---
> >>>>>>     manual/tunables.texi    |  6 +++++-
> >>>>>>     sysdeps/x86/cacheinfo.c | 12 +++++++-----
> >>>>>>     2 files changed, 12 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
> >>>>>> index b6bb54d..94d4fbd 100644
> >>>>>> --- a/manual/tunables.texi
> >>>>>> +++ b/manual/tunables.texi
> >>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
> >>>>>>
> >>>>>>     @deftp Tunable glibc.tune.x86_non_temporal_threshold
> >>>>>>     The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> >>>>>> -to set threshold in bytes for non temporal store.
> >>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
> >>>>>> +give a hint to the hardware to move data directly to memory without
> >>>>>> +displacing other data from the cache. This tunable is used by some
> >>>>>> +platforms to determine when to use non temporal stores in operations
> >>>>>> +like memmove and memcpy.
> >>>>>>
> >>>>>>     This tunable is specific to i386 and x86-64.
> >>>>>>     @end deftp
> >>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> >>>>>> index b9444dd..c6767d9 100644
> >>>>>> --- a/sysdeps/x86/cacheinfo.c
> >>>>>> +++ b/sysdeps/x86/cacheinfo.c
> >>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
> >>>>>>           __x86_shared_cache_size = shared;
> >>>>>>         }
> >>>>>>
> >>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> >>>>>> -     shared cache size is the approximate value above which non-temporal
> >>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> >>>>>> -     total shared cache size.  */
> >>>>>> +  /* The default setting for the non_temporal threshold is 3/4
> >>>>>> +     of one thread's share of the chip's cache. While higher
> >>>>>> +     single thread performance may be observed with a higher
> >>>>>> +     threshold, having a single thread use more than it's share
> >>>>>> +     of the cache will negatively impact the performance of
> >>>>>> +     other threads running on the chip. */
> >>>>>>       __x86_shared_non_temporal_threshold
> >>>>>>         = (cpu_features->non_temporal_threshold != 0
> >>>>>>            ? cpu_features->non_temporal_threshold
> >>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
> >>>>>> +       : __x86_shared_cache_size * 3 / 4);
> >>>>>>     }
> >>>>>>
> >>>>> Can we tune it with the number of threads and/or total cache
> >>>>> size?
> >>>>>
> >>>> When you say "total cache size", is that different from
> >>>> shared_cache_size * threads?
> >>>>
> >>>> I see a fundamental conflict of optimization goals:
> >>>> 1) Provide best single thread performance (current code)
> >>>> 2) Provide best overall system performance under full load (proposed patch)
> >>>> I don't know of any way to have default behavior meet both goals without
> >>>> knowledge
> >>>> of the system size/usage/requirements.
> >>>>
> >>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
> >>>> total cache on the chip.
> >>>> That won't be uncommon in the coming years on server class systems,
> >>>> especially
> >>>> in large databases or HPC environments (think vision processing or
> >>>> weather modeling for example).
> >>>> If a single app owns the whole chip and is running a multi-threaded
> >>>> application but needs
> >>>> to memcpy a really large block of data when one phase of computation
> >>>> finished
> >>>> before moving to the next phase. A common practice would be to have 64
> >>>> parallel calls
> >>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
> >>>> compilers
> >>>> handle that with no trouble.
> >>>>
> >>>> In the example, the per thread share of the cache is 2 MB and the
> >>>> proposed formula will set
> >>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
> >>>> less, all threads comfortably
> >>>> fit in cache. If the total copy size is over that, then non-temporal
> >>>> stores are used and all is well there too.
> >>>>
> >>>> The current formula would set the threshold at 96 Mbytes for each
> >>>> thread. Only when the total
> >>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
> >>>> We'd like
> >>>> to switch to non-temporal stores much sooner as we will be thrashing all
> >>>> the threads caches.
> >>>>
> >>>> In practical terms, I've had access to typical memcpy copy lengths for a
> >>>> variety of commerical
> >>>> applications while studying memcpy on Solaris over the years. The vast
> >>>> majority of copies
> >>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
> >>>> of cache
> >>>> per thread, allowing in-cache copies for the common case, even without
> >>>> borrowing
> >>>> cache from other threads. The occasional really large copies tend to be
> >>>> when an application
> >>>> is passing a block of data to prepare for a new phase of computation or
> >>>> as a shared memory
> >>>> communication to another thread. In these cases, having the data remain
> >>>> in cache is usually
> >>>> not relevant and using non-temporal stores even when they are not
> >>>> strictly required does
> >>>> not have a negative affect on performance.
> >>>>
> >>>> A downside of tuning for a single thread comes in cloud computing
> >>>> environments, where
> >>>> having neighboring threads being cache hogs, even if relatively isolated
> >>>> in virtual machines,
> >>>> is a "bad thing" for having stable system performance. Whatever we can
> >>>> do to provide consistent,
> >>>> reasonable performance whatever the neighboring threads might be doing
> >>>> is a "good thing".
> >>>>
> >>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
> >>>
> >> I have not tested larger thresholds. I'd be more comfortable with a
> >> smaller one.
> >> We could construct specific tests to show either advantage or disadvantage
> >> to shifting from 3/4 to all of cache depending on what data access was used
> >> between memcpy operations.
> >>
> >> I consider pushing the limit on cache usage to be a risky approach. Few
> >> applications
> >> only work on a single block of data.  If all threads are doing a shared
> >> copy and
> >> they use all the available cache, then after the memcpy returns, any other
> >> active data would have been pushed out of the cache. That's likely to cost
> >> severe performance loss in more cases than the modest performance gains for
> >> a few cases where the application only is concerned with using the data that
> >> was just copied.
> >>
> >> Just to give a more detailed example where large copies are not followed
> >> by using
> >> the data. Consider garbage collection followed by compression. With a
> >> multi-age
> >> garbage collector, stable data that is active and survived several
> >> garbage collections
> >> is in a 'old' region. It does not need to be copied. The current 'new'
> >> region is full
> >> but has both referenced and unreferenced data. After the marking phase,
> >> the individual elements of the referenced data is copied to the base of
> >> the 'new' region.
> >> When complete, the rest of the 'new' region becomes the new free pool.
> >> The total amount copied may far exceed the processor cache.  Then the
> >> application
> >> exits garbage collection and resumes active use of mostly the stable
> >> data with
> >> some accesses to the just moved new data and fresh allocations. If we
> >> under-use
> >> non-temporal stores, we clear the cache and the whole application runs
> >> slower
> >> than otherwise.
> >>
> >> Individual memcpy benchmarks are useful in isolation testing and comparing
> >> code patterns but can mislead about overall application performance in the
> >> context of potential for cache abuse. I fell into that tarpit once while
> >> tuning
> >> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
> >> 5% faster for in-cache data) caused a major customer application to run
> >> slower
> >> because my new code abused the cache.  I modified my code to  only use the
> >> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
> >> 128Kbytes if I remember right) and all was well.
> >>
> > The new threshold can be substantially smaller with large core count.
> > Are you saying that even 3 / 4 may be too big?  Is there a reasonable
> > fixed threshold?
> >
>
> I don't have any evidence to say 3/4 is too big for typical applications
> and environments.
> In 2012, the default for memcpy was set to 1/2 the shared_cache_size
> which is what is
> the current default for Oracle el7 and Red Hat el7.
>
> Given the typically larger sized caches/thread today than 8 years, 3/4
> may work out well
> since the remaining 1/4 of today's larger cache is often greater than
> 1/2 of yesteryear's smaller cache.
>

Please update the comment with your rationale for 3/4.  Don't use
today or current.   Use 2020 instead.

Thanks.

-- 
H.J.