[PATCH v2] Reversing calculation of __x86_shared_non_temporal_threshold

Thu Sep 24 23:57:54 GMT 2020

On Thu, Sep 24, 2020 at 4:22 PM Patrick McGehearty
<patrick.mcgehearty@oracle.com> wrote:
>
>
>
> On 9/24/2020 4:54 PM, H.J. Lu wrote:
> > On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty
> > <patrick.mcgehearty@oracle.com> wrote:
> >>
> >>
> >> On 9/23/2020 6:13 PM, H.J. Lu wrote:
> >>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
> >>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>
> >>>> On 9/23/2020 4:37 PM, H.J. Lu wrote:
> >>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
> >>>>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
> >>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> >>>>>>> <libc-alpha@sourceware.org> wrote:
> >>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> >>>>>>>> uses non_temporal stores to avoid pushing other data out of the last
> >>>>>>>> level cache.
> >>>>>>>>
> >>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
> >>>>>>>> patch of June 2, 2017.
> >>>>>>>>
> >>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
> >>>>>>>> getting maximum performance. It was tuned using the single threaded
> >>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change
> >>>>>>>> changes the threshold from using 3/4 of one thread's share of the
> >>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
> >>>>>>>> before switching to non-temporal stores. Multi-threaded systems with
> >>>>>>>> more than a few threads are server-class and typically have many
> >>>>>>>> active threads. If one thread consumes 3/4 of the available cache for
> >>>>>>>> all threads, it will cause other active threads to have data removed
> >>>>>>>> from the cache. Two examples show the range of the effect. John
> >>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
> >>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
> >>>>>>>> an internal system test of 128 threads. This regression was discovered
> >>>>>>>> when comparing OL8 performance to OL7.  An example that compares
> >>>>>>>> normal stores to non-temporal stores may be found at
> >>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
> >>>>>>>> shows performance loss of 400 to 500% due to a failure to use
> >>>>>>>> nontemporal stores. These performance losses are most likely to occur
> >>>>>>>> when the system load is heaviest and good performance is critical.
> >>>>>>>>
> >>>>>>>> The tunable x86_non_temporal_threshold can be used to override the
> >>>>>>>> default for the knowledgable user who really wants maximum cache
> >>>>>>>> allocation to a single thread in a multi-threaded system.
> >>>>>>>> The manual entry for the tunable has been expanded to provide
> >>>>>>>> more information about its purpose.
> >>>>>>>>
> >>>>>>>>             modified: sysdeps/x86/cacheinfo.c
> >>>>>>>>             modified: manual/tunables.texi
> >>>>>>>> ---
> >>>>>>>>      manual/tunables.texi    |  6 +++++-
> >>>>>>>>      sysdeps/x86/cacheinfo.c | 12 +++++++-----
> >>>>>>>>      2 files changed, 12 insertions(+), 6 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
> >>>>>>>> index b6bb54d..94d4fbd 100644
> >>>>>>>> --- a/manual/tunables.texi
> >>>>>>>> +++ b/manual/tunables.texi
> >>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
> >>>>>>>>
> >>>>>>>>      @deftp Tunable glibc.tune.x86_non_temporal_threshold
> >>>>>>>>      The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> >>>>>>>> -to set threshold in bytes for non temporal store.
> >>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
> >>>>>>>> +give a hint to the hardware to move data directly to memory without
> >>>>>>>> +displacing other data from the cache. This tunable is used by some
> >>>>>>>> +platforms to determine when to use non temporal stores in operations
> >>>>>>>> +like memmove and memcpy.
> >>>>>>>>
> >>>>>>>>      This tunable is specific to i386 and x86-64.
> >>>>>>>>      @end deftp
> >>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> >>>>>>>> index b9444dd..c6767d9 100644
> >>>>>>>> --- a/sysdeps/x86/cacheinfo.c
> >>>>>>>> +++ b/sysdeps/x86/cacheinfo.c
> >>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
> >>>>>>>>            __x86_shared_cache_size = shared;
> >>>>>>>>          }
> >>>>>>>>
> >>>>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> >>>>>>>> -     shared cache size is the approximate value above which non-temporal
> >>>>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> >>>>>>>> -     total shared cache size.  */
> >>>>>>>> +  /* The default setting for the non_temporal threshold is 3/4
> >>>>>>>> +     of one thread's share of the chip's cache. While higher
> >>>>>>>> +     single thread performance may be observed with a higher
> >>>>>>>> +     threshold, having a single thread use more than it's share
> >>>>>>>> +     of the cache will negatively impact the performance of
> >>>>>>>> +     other threads running on the chip. */
> >>>>>>>>        __x86_shared_non_temporal_threshold
> >>>>>>>>          = (cpu_features->non_temporal_threshold != 0
> >>>>>>>>             ? cpu_features->non_temporal_threshold
> >>>>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
> >>>>>>>> +       : __x86_shared_cache_size * 3 / 4);
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>> Can we tune it with the number of threads and/or total cache
> >>>>>>> size?
> >>>>>>>
> >>>>>> When you say "total cache size", is that different from
> >>>>>> shared_cache_size * threads?
> >>>>>>
> >>>>>> I see a fundamental conflict of optimization goals:
> >>>>>> 1) Provide best single thread performance (current code)
> >>>>>> 2) Provide best overall system performance under full load (proposed patch)
> >>>>>> I don't know of any way to have default behavior meet both goals without
> >>>>>> knowledge
> >>>>>> of the system size/usage/requirements.
> >>>>>>
> >>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
> >>>>>> total cache on the chip.
> >>>>>> That won't be uncommon in the coming years on server class systems,
> >>>>>> especially
> >>>>>> in large databases or HPC environments (think vision processing or
> >>>>>> weather modeling for example).
> >>>>>> If a single app owns the whole chip and is running a multi-threaded
> >>>>>> application but needs
> >>>>>> to memcpy a really large block of data when one phase of computation
> >>>>>> finished
> >>>>>> before moving to the next phase. A common practice would be to have 64
> >>>>>> parallel calls
> >>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
> >>>>>> compilers
> >>>>>> handle that with no trouble.
> >>>>>>
> >>>>>> In the example, the per thread share of the cache is 2 MB and the
> >>>>>> proposed formula will set
> >>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
> >>>>>> less, all threads comfortably
> >>>>>> fit in cache. If the total copy size is over that, then non-temporal
> >>>>>> stores are used and all is well there too.
> >>>>>>
> >>>>>> The current formula would set the threshold at 96 Mbytes for each
> >>>>>> thread. Only when the total
> >>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
> >>>>>> We'd like
> >>>>>> to switch to non-temporal stores much sooner as we will be thrashing all
> >>>>>> the threads caches.
> >>>>>>
> >>>>>> In practical terms, I've had access to typical memcpy copy lengths for a
> >>>>>> variety of commerical
> >>>>>> applications while studying memcpy on Solaris over the years. The vast
> >>>>>> majority of copies
> >>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
> >>>>>> of cache
> >>>>>> per thread, allowing in-cache copies for the common case, even without
> >>>>>> borrowing
> >>>>>> cache from other threads. The occasional really large copies tend to be
> >>>>>> when an application
> >>>>>> is passing a block of data to prepare for a new phase of computation or
> >>>>>> as a shared memory
> >>>>>> communication to another thread. In these cases, having the data remain
> >>>>>> in cache is usually
> >>>>>> not relevant and using non-temporal stores even when they are not
> >>>>>> strictly required does
> >>>>>> not have a negative affect on performance.
> >>>>>>
> >>>>>> A downside of tuning for a single thread comes in cloud computing
> >>>>>> environments, where
> >>>>>> having neighboring threads being cache hogs, even if relatively isolated
> >>>>>> in virtual machines,
> >>>>>> is a "bad thing" for having stable system performance. Whatever we can
> >>>>>> do to provide consistent,
> >>>>>> reasonable performance whatever the neighboring threads might be doing
> >>>>>> is a "good thing".
> >>>>>>
> >>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
> >>>>>
> >>>> I have not tested larger thresholds. I'd be more comfortable with a
> >>>> smaller one.
> >>>> We could construct specific tests to show either advantage or disadvantage
> >>>> to shifting from 3/4 to all of cache depending on what data access was used
> >>>> between memcpy operations.
> >>>>
> >>>> I consider pushing the limit on cache usage to be a risky approach. Few
> >>>> applications
> >>>> only work on a single block of data.  If all threads are doing a shared
> >>>> copy and
> >>>> they use all the available cache, then after the memcpy returns, any other
> >>>> active data would have been pushed out of the cache. That's likely to cost
> >>>> severe performance loss in more cases than the modest performance gains for
> >>>> a few cases where the application only is concerned with using the data that
> >>>> was just copied.
> >>>>
> >>>> Just to give a more detailed example where large copies are not followed
> >>>> by using
> >>>> the data. Consider garbage collection followed by compression. With a
> >>>> multi-age
> >>>> garbage collector, stable data that is active and survived several
> >>>> garbage collections
> >>>> is in a 'old' region. It does not need to be copied. The current 'new'
> >>>> region is full
> >>>> but has both referenced and unreferenced data. After the marking phase,
> >>>> the individual elements of the referenced data is copied to the base of
> >>>> the 'new' region.
> >>>> When complete, the rest of the 'new' region becomes the new free pool.
> >>>> The total amount copied may far exceed the processor cache.  Then the
> >>>> application
> >>>> exits garbage collection and resumes active use of mostly the stable
> >>>> data with
> >>>> some accesses to the just moved new data and fresh allocations. If we
> >>>> under-use
> >>>> non-temporal stores, we clear the cache and the whole application runs
> >>>> slower
> >>>> than otherwise.
> >>>>
> >>>> Individual memcpy benchmarks are useful in isolation testing and comparing
> >>>> code patterns but can mislead about overall application performance in the
> >>>> context of potential for cache abuse. I fell into that tarpit once while
> >>>> tuning
> >>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
> >>>> 5% faster for in-cache data) caused a major customer application to run
> >>>> slower
> >>>> because my new code abused the cache.  I modified my code to  only use the
> >>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
> >>>> 128Kbytes if I remember right) and all was well.
> >>>>
> >>> The new threshold can be substantially smaller with large core count.
> >>> Are you saying that even 3 / 4 may be too big?  Is there a reasonable
> >>> fixed threshold?
> >>>
> >> I don't have any evidence to say 3/4 is too big for typical applications
> >> and environments.
> >> In 2012, the default for memcpy was set to 1/2 the shared_cache_size
> >> which is what is
> >> the current default for Oracle el7 and Red Hat el7.
> >>
> >> Given the typically larger sized caches/thread today than 8 years, 3/4
> >> may work out well
> >> since the remaining 1/4 of today's larger cache is often greater than
> >> 1/2 of yesteryear's smaller cache.
> >>
> > Please update the comment with your rationale for 3/4.  Don't use
> > today or current.   Use 2020 instead.
> >
> > Thanks.
> >
> I'm unsure about what needs to change in the comment which does not mention
> any dates currently. I'm assuming you are referring to the following
> comment in cacheinfo.c
>
>    /* The default setting for the non_temporal threshold is 3/4
>       of one thread's share of the chip's cache. While higher
>       single thread performance may be observed with a higher
>       threshold, having a single thread use more than it's share
>       of the cache will negatively impact the performance of
>       other threads running on the chip. */
>
> While I could add a comment on why 3/4 vs 1/2 is the best choice, I
> don't have hard
> data to back it up. I'd be comfortable with either  3/4 or 1/2. I
> selected 3/4 as it
> was closer to the formula you chose in 2017 instead of the formula you
> chose in 2012.

The comment is for readers 5 years from now who may be wondering
where 3/4 came from.  Just add something close to what you have said above.

-- 
H.J.