[PATCH v2] Reversing calculation of __x86_shared_non_temporal_threshold

Thu Sep 24 21:47:57 GMT 2020

On 9/23/2020 6:13 PM, H.J. Lu wrote:
> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
> <patrick.mcgehearty@oracle.com> wrote:
>>
>>
>> On 9/23/2020 4:37 PM, H.J. Lu wrote:
>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
>>> <patrick.mcgehearty@oracle.com> wrote:
>>>>
>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
>>>>> <libc-alpha@sourceware.org> wrote:
>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>>>>>> uses non_temporal stores to avoid pushing other data out of the last
>>>>>> level cache.
>>>>>>
>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
>>>>>> patch of June 2, 2017.
>>>>>>
>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
>>>>>> getting maximum performance. It was tuned using the single threaded
>>>>>> large memcpy micro benchmark on an 8 core processor. The last change
>>>>>> changes the threshold from using 3/4 of one thread's share of the
>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
>>>>>> before switching to non-temporal stores. Multi-threaded systems with
>>>>>> more than a few threads are server-class and typically have many
>>>>>> active threads. If one thread consumes 3/4 of the available cache for
>>>>>> all threads, it will cause other active threads to have data removed
>>>>>> from the cache. Two examples show the range of the effect. John
>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
>>>>>> an internal system test of 128 threads. This regression was discovered
>>>>>> when comparing OL8 performance to OL7.  An example that compares
>>>>>> normal stores to non-temporal stores may be found at
>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
>>>>>> shows performance loss of 400 to 500% due to a failure to use
>>>>>> nontemporal stores. These performance losses are most likely to occur
>>>>>> when the system load is heaviest and good performance is critical.
>>>>>>
>>>>>> The tunable x86_non_temporal_threshold can be used to override the
>>>>>> default for the knowledgable user who really wants maximum cache
>>>>>> allocation to a single thread in a multi-threaded system.
>>>>>> The manual entry for the tunable has been expanded to provide
>>>>>> more information about its purpose.
>>>>>>
>>>>>>            modified: sysdeps/x86/cacheinfo.c
>>>>>>            modified: manual/tunables.texi
>>>>>> ---
>>>>>>     manual/tunables.texi    |  6 +++++-
>>>>>>     sysdeps/x86/cacheinfo.c | 12 +++++++-----
>>>>>>     2 files changed, 12 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>>>>>> index b6bb54d..94d4fbd 100644
>>>>>> --- a/manual/tunables.texi
>>>>>> +++ b/manual/tunables.texi
>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>>>>>
>>>>>>     @deftp Tunable glibc.tune.x86_non_temporal_threshold
>>>>>>     The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>>>>>> -to set threshold in bytes for non temporal store.
>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
>>>>>> +give a hint to the hardware to move data directly to memory without
>>>>>> +displacing other data from the cache. This tunable is used by some
>>>>>> +platforms to determine when to use non temporal stores in operations
>>>>>> +like memmove and memcpy.
>>>>>>
>>>>>>     This tunable is specific to i386 and x86-64.
>>>>>>     @end deftp
>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>>>>>> index b9444dd..c6767d9 100644
>>>>>> --- a/sysdeps/x86/cacheinfo.c
>>>>>> +++ b/sysdeps/x86/cacheinfo.c
>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>>>>>>           __x86_shared_cache_size = shared;
>>>>>>         }
>>>>>>
>>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
>>>>>> -     shared cache size is the approximate value above which non-temporal
>>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
>>>>>> -     total shared cache size.  */
>>>>>> +  /* The default setting for the non_temporal threshold is 3/4
>>>>>> +     of one thread's share of the chip's cache. While higher
>>>>>> +     single thread performance may be observed with a higher
>>>>>> +     threshold, having a single thread use more than it's share
>>>>>> +     of the cache will negatively impact the performance of
>>>>>> +     other threads running on the chip. */
>>>>>>       __x86_shared_non_temporal_threshold
>>>>>>         = (cpu_features->non_temporal_threshold != 0
>>>>>>            ? cpu_features->non_temporal_threshold
>>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
>>>>>> +       : __x86_shared_cache_size * 3 / 4);
>>>>>>     }
>>>>>>
>>>>> Can we tune it with the number of threads and/or total cache
>>>>> size?
>>>>>
>>>> When you say "total cache size", is that different from
>>>> shared_cache_size * threads?
>>>>
>>>> I see a fundamental conflict of optimization goals:
>>>> 1) Provide best single thread performance (current code)
>>>> 2) Provide best overall system performance under full load (proposed patch)
>>>> I don't know of any way to have default behavior meet both goals without
>>>> knowledge
>>>> of the system size/usage/requirements.
>>>>
>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
>>>> total cache on the chip.
>>>> That won't be uncommon in the coming years on server class systems,
>>>> especially
>>>> in large databases or HPC environments (think vision processing or
>>>> weather modeling for example).
>>>> If a single app owns the whole chip and is running a multi-threaded
>>>> application but needs
>>>> to memcpy a really large block of data when one phase of computation
>>>> finished
>>>> before moving to the next phase. A common practice would be to have 64
>>>> parallel calls
>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
>>>> compilers
>>>> handle that with no trouble.
>>>>
>>>> In the example, the per thread share of the cache is 2 MB and the
>>>> proposed formula will set
>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
>>>> less, all threads comfortably
>>>> fit in cache. If the total copy size is over that, then non-temporal
>>>> stores are used and all is well there too.
>>>>
>>>> The current formula would set the threshold at 96 Mbytes for each
>>>> thread. Only when the total
>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
>>>> We'd like
>>>> to switch to non-temporal stores much sooner as we will be thrashing all
>>>> the threads caches.
>>>>
>>>> In practical terms, I've had access to typical memcpy copy lengths for a
>>>> variety of commerical
>>>> applications while studying memcpy on Solaris over the years. The vast
>>>> majority of copies
>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
>>>> of cache
>>>> per thread, allowing in-cache copies for the common case, even without
>>>> borrowing
>>>> cache from other threads. The occasional really large copies tend to be
>>>> when an application
>>>> is passing a block of data to prepare for a new phase of computation or
>>>> as a shared memory
>>>> communication to another thread. In these cases, having the data remain
>>>> in cache is usually
>>>> not relevant and using non-temporal stores even when they are not
>>>> strictly required does
>>>> not have a negative affect on performance.
>>>>
>>>> A downside of tuning for a single thread comes in cloud computing
>>>> environments, where
>>>> having neighboring threads being cache hogs, even if relatively isolated
>>>> in virtual machines,
>>>> is a "bad thing" for having stable system performance. Whatever we can
>>>> do to provide consistent,
>>>> reasonable performance whatever the neighboring threads might be doing
>>>> is a "good thing".
>>>>
>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
>>>
>> I have not tested larger thresholds. I'd be more comfortable with a
>> smaller one.
>> We could construct specific tests to show either advantage or disadvantage
>> to shifting from 3/4 to all of cache depending on what data access was used
>> between memcpy operations.
>>
>> I consider pushing the limit on cache usage to be a risky approach. Few
>> applications
>> only work on a single block of data.  If all threads are doing a shared
>> copy and
>> they use all the available cache, then after the memcpy returns, any other
>> active data would have been pushed out of the cache. That's likely to cost
>> severe performance loss in more cases than the modest performance gains for
>> a few cases where the application only is concerned with using the data that
>> was just copied.
>>
>> Just to give a more detailed example where large copies are not followed
>> by using
>> the data. Consider garbage collection followed by compression. With a
>> multi-age
>> garbage collector, stable data that is active and survived several
>> garbage collections
>> is in a 'old' region. It does not need to be copied. The current 'new'
>> region is full
>> but has both referenced and unreferenced data. After the marking phase,
>> the individual elements of the referenced data is copied to the base of
>> the 'new' region.
>> When complete, the rest of the 'new' region becomes the new free pool.
>> The total amount copied may far exceed the processor cache.  Then the
>> application
>> exits garbage collection and resumes active use of mostly the stable
>> data with
>> some accesses to the just moved new data and fresh allocations. If we
>> under-use
>> non-temporal stores, we clear the cache and the whole application runs
>> slower
>> than otherwise.
>>
>> Individual memcpy benchmarks are useful in isolation testing and comparing
>> code patterns but can mislead about overall application performance in the
>> context of potential for cache abuse. I fell into that tarpit once while
>> tuning
>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
>> 5% faster for in-cache data) caused a major customer application to run
>> slower
>> because my new code abused the cache.  I modified my code to  only use the
>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
>> 128Kbytes if I remember right) and all was well.
>>
> The new threshold can be substantially smaller with large core count.
> Are you saying that even 3 / 4 may be too big?  Is there a reasonable
> fixed threshold?
>

I don't have any evidence to say 3/4 is too big for typical applications 
and environments.
In 2012, the default for memcpy was set to 1/2 the shared_cache_size 
which is what is
the current default for Oracle el7 and Red Hat el7.

Given the typically larger sized caches/thread today than 8 years, 3/4 
may work out well
since the remaining 1/4 of today's larger cache is often greater than 
1/2 of yesteryear's smaller cache.

- patrick