[PATCH v2] Reversing calculation of __x86_shared_non_temporal_threshold

Patrick McGehearty patrick.mcgehearty@oracle.com
Wed Sep 23 22:39:15 GMT 2020



On 9/23/2020 4:37 PM, H.J. Lu wrote:
> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
> <patrick.mcgehearty@oracle.com> wrote:
>>
>>
>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>>>> uses non_temporal stores to avoid pushing other data out of the last
>>>> level cache.
>>>>
>>>> This patch proposes to revert the calculation change made by H.J. Lu's
>>>> patch of June 2, 2017.
>>>>
>>>> H.J. Lu's patch selected a threshold suitable for a single thread
>>>> getting maximum performance. It was tuned using the single threaded
>>>> large memcpy micro benchmark on an 8 core processor. The last change
>>>> changes the threshold from using 3/4 of one thread's share of the
>>>> cache to using 3/4 of the entire cache of a multi-threaded system
>>>> before switching to non-temporal stores. Multi-threaded systems with
>>>> more than a few threads are server-class and typically have many
>>>> active threads. If one thread consumes 3/4 of the available cache for
>>>> all threads, it will cause other active threads to have data removed
>>>> from the cache. Two examples show the range of the effect. John
>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
>>>> an internal system test of 128 threads. This regression was discovered
>>>> when comparing OL8 performance to OL7.  An example that compares
>>>> normal stores to non-temporal stores may be found at
>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
>>>> shows performance loss of 400 to 500% due to a failure to use
>>>> nontemporal stores. These performance losses are most likely to occur
>>>> when the system load is heaviest and good performance is critical.
>>>>
>>>> The tunable x86_non_temporal_threshold can be used to override the
>>>> default for the knowledgable user who really wants maximum cache
>>>> allocation to a single thread in a multi-threaded system.
>>>> The manual entry for the tunable has been expanded to provide
>>>> more information about its purpose.
>>>>
>>>>           modified: sysdeps/x86/cacheinfo.c
>>>>           modified: manual/tunables.texi
>>>> ---
>>>>    manual/tunables.texi    |  6 +++++-
>>>>    sysdeps/x86/cacheinfo.c | 12 +++++++-----
>>>>    2 files changed, 12 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>>>> index b6bb54d..94d4fbd 100644
>>>> --- a/manual/tunables.texi
>>>> +++ b/manual/tunables.texi
>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>>>
>>>>    @deftp Tunable glibc.tune.x86_non_temporal_threshold
>>>>    The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>>>> -to set threshold in bytes for non temporal store.
>>>> +to set threshold in bytes for non temporal store. Non temporal stores
>>>> +give a hint to the hardware to move data directly to memory without
>>>> +displacing other data from the cache. This tunable is used by some
>>>> +platforms to determine when to use non temporal stores in operations
>>>> +like memmove and memcpy.
>>>>
>>>>    This tunable is specific to i386 and x86-64.
>>>>    @end deftp
>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>>>> index b9444dd..c6767d9 100644
>>>> --- a/sysdeps/x86/cacheinfo.c
>>>> +++ b/sysdeps/x86/cacheinfo.c
>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>>>>          __x86_shared_cache_size = shared;
>>>>        }
>>>>
>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
>>>> -     shared cache size is the approximate value above which non-temporal
>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
>>>> -     total shared cache size.  */
>>>> +  /* The default setting for the non_temporal threshold is 3/4
>>>> +     of one thread's share of the chip's cache. While higher
>>>> +     single thread performance may be observed with a higher
>>>> +     threshold, having a single thread use more than it's share
>>>> +     of the cache will negatively impact the performance of
>>>> +     other threads running on the chip. */
>>>>      __x86_shared_non_temporal_threshold
>>>>        = (cpu_features->non_temporal_threshold != 0
>>>>           ? cpu_features->non_temporal_threshold
>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
>>>> +       : __x86_shared_cache_size * 3 / 4);
>>>>    }
>>>>
>>> Can we tune it with the number of threads and/or total cache
>>> size?
>>>
>> When you say "total cache size", is that different from
>> shared_cache_size * threads?
>>
>> I see a fundamental conflict of optimization goals:
>> 1) Provide best single thread performance (current code)
>> 2) Provide best overall system performance under full load (proposed patch)
>> I don't know of any way to have default behavior meet both goals without
>> knowledge
>> of the system size/usage/requirements.
>>
>> Consider a hypothetical single chip system with 64 threads and 128 MB of
>> total cache on the chip.
>> That won't be uncommon in the coming years on server class systems,
>> especially
>> in large databases or HPC environments (think vision processing or
>> weather modeling for example).
>> If a single app owns the whole chip and is running a multi-threaded
>> application but needs
>> to memcpy a really large block of data when one phase of computation
>> finished
>> before moving to the next phase. A common practice would be to have 64
>> parallel calls
>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
>> compilers
>> handle that with no trouble.
>>
>> In the example, the per thread share of the cache is 2 MB and the
>> proposed formula will set
>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
>> less, all threads comfortably
>> fit in cache. If the total copy size is over that, then non-temporal
>> stores are used and all is well there too.
>>
>> The current formula would set the threshold at 96 Mbytes for each
>> thread. Only when the total
>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
>> We'd like
>> to switch to non-temporal stores much sooner as we will be thrashing all
>> the threads caches.
>>
>> In practical terms, I've had access to typical memcpy copy lengths for a
>> variety of commerical
>> applications while studying memcpy on Solaris over the years. The vast
>> majority of copies
>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
>> of cache
>> per thread, allowing in-cache copies for the common case, even without
>> borrowing
>> cache from other threads. The occasional really large copies tend to be
>> when an application
>> is passing a block of data to prepare for a new phase of computation or
>> as a shared memory
>> communication to another thread. In these cases, having the data remain
>> in cache is usually
>> not relevant and using non-temporal stores even when they are not
>> strictly required does
>> not have a negative affect on performance.
>>
>> A downside of tuning for a single thread comes in cloud computing
>> environments, where
>> having neighboring threads being cache hogs, even if relatively isolated
>> in virtual machines,
>> is a "bad thing" for having stable system performance. Whatever we can
>> do to provide consistent,
>> reasonable performance whatever the neighboring threads might be doing
>> is a "good thing".
>>
> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
>

I have not tested larger thresholds. I'd be more comfortable with a 
smaller one.
We could construct specific tests to show either advantage or disadvantage
to shifting from 3/4 to all of cache depending on what data access was used
between memcpy operations.

I consider pushing the limit on cache usage to be a risky approach. Few 
applications
only work on a single block of data.  If all threads are doing a shared 
copy and
they use all the available cache, then after the memcpy returns, any other
active data would have been pushed out of the cache. That's likely to cost
severe performance loss in more cases than the modest performance gains for
a few cases where the application only is concerned with using the data that
was just copied.

Just to give a more detailed example where large copies are not followed 
by using
the data. Consider garbage collection followed by compression. With a 
multi-age
garbage collector, stable data that is active and survived several 
garbage collections
is in a 'old' region. It does not need to be copied. The current 'new' 
region is full
but has both referenced and unreferenced data. After the marking phase,
the individual elements of the referenced data is copied to the base of 
the 'new' region.
When complete, the rest of the 'new' region becomes the new free pool.
The total amount copied may far exceed the processor cache.  Then the 
application
exits garbage collection and resumes active use of mostly the stable 
data with
some accesses to the just moved new data and fresh allocations. If we 
under-use
non-temporal stores, we clear the cache and the whole application runs 
slower
than otherwise.

Individual memcpy benchmarks are useful in isolation testing and comparing
code patterns but can mislead about overall application performance in the
context of potential for cache abuse. I fell into that tarpit once while 
tuning
memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
5% faster for in-cache data) caused a major customer application to run 
slower
because my new code abused the cache.  I modified my code to  only use the
new "in-cache fast copy" for copies less than a threshold (64Kbytes or
128Kbytes if I remember right) and all was well.

- patrick



More information about the Libc-alpha mailing list