This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 0/3] sparc M7 optimized memcpy/memset


On 10/5/2017 7:19 AM, Adhemerval Zanella wrote:

On 04/10/2017 18:07, Patrick McGehearty wrote:
On 9/27/2017 11:41 PM, David Miller wrote:
From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
Date: Wed, 27 Sep 2017 16:09:56 -0400

The following patch set provides optimized versions of
memcpy/mempcpy/memmove/memset/bzero for Sparc versions M7 (and later).
Support for recognizing M7 is also provided.
An assembly version of memmove for ultra1+ is provided.

Jose E. Marchesi (2):
    sparc: support the ADP hw capability.
    sparc: assembly version of memmove for ultra1+

Patrick McGehearty (1):
    sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero.
Looks good to me.
Could someone install the 9/28/2017 sparc M7 optimized memcpy/memset
patches to the repository?

As a new submitter, I don't have permissions.
What about https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html ?
I am finishing up the ifunc C implementation and I plan to send it
to review today.
In https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html,
I see three issues which I will cover here.

Performance:

I reported:
>>> For memcpy/mempcpy/memmove, performance comparison with niagara4 code:
>>> Long word aligned data
>>>    0-127 bytes - minimal changes
>>>    128-1023 bytes - 7-30% gain
>>>    1024+ bytes - 1-7% gain (in cache); 30-100% gain (out of cache)
>>> Word aligned data
>>>    0-127 bytes - 50%+ gain
>>>    128-1023 bytes - 10-200% gain
>>>    1024+ bytes - 0-15% gain (in cache); 5-50% gain (out of cache)
>>> Unaligned data
>>>    0-127 bytes - 0-70%+ gain
>>>    128-447 bytes - 40-80%+ gain
>>>    448-511 bytes - 1-3% loss
>>>    512-4096 bytes - 2-3% gain (in cache); 0-20% gain (out of cache)
>>>    4096+ bytes - +/- 3% (in cache); 20-50% gain (out of cache)
>>>
>>> For memset/bzero, performance comparison with niagara4 code:
>>> For memset nonzero data,
>>>    256-1023 bytes - 60-90% gain (in cache); 5% gain (out of cache)
>>>    1K+ bytes - 80-260% gain (in cache); 40-80% gain (out of cache)
>>> For memset zero data (and bzero),
>>>    256-1023 bytes - 80-120% gain (in cache), 0% gain (out of cache)
>>>    1024+ bytes - 2-4x gain (in cache), 10-35% gain (out of cache)

Adhemerval asked:
>Which benchmark did you use to get theses values? If it where obtained
>with benchtests one please attach the resulting files.  If not, please
>either indicate how to reproduce the data or work to improve benchtests
>with required datapoints.

A fair question. As noted, the performance comparisions are with the
niagara4 code for memcpy and memset, the current default for m7/t7/s7
sparc platforms.

I used a personal perf test suite that I developed over a period of
years while tuning memcpy/memset for Sparc/Solaris for most Sparc
platforms going back to UltraSparc IV. The tests are run in isolation
and not quickly adaptable to the existing Linux bench tests
structure. I also ran the Linux bench tests. The results from
the Linux bench tests are generally similar to the numbers shown
above, with some individual differences due to different ordering
of test sizes which causes different branch prediction and cache
warmth behavior. Because these tests give similar results, I see
no particular need to add my tests to the Linux bench tests.
I've continued to use my tests because of my long familiarity them.

As possible interest for those designing "out of cache" tests, the
approach used in my personal tests is to allocate a "cache clearing
buffer" at least 4 times the size of the largest cache on the system
under test. In these tests, the L3 cache size is 64 MBytes and my
buffer size was 256 MBytes. Before each set of timing runs,
my program wrote new data to each long word address of the
cache clearing buffer. Then, during the timing test for each
combination of src and destination alignments and test lengths,
successive copies were executed with each successive src and dest
started at least 512 bytes after the end of the previous copy.
The innermost loop did 20 copies. Each timing test was repeated
7 times with the median being used to minimize noise effects
of other system activity.

> Adhemerval asked that memcpy and memset be in separate patches
> within the patch set.

If it is necessary to resubmit the patch set for other reasons,
I will split memcpy and memset. Without the need to resubmit,
it feels like 'make work' in my opinion. The optimizations to both
files are logically similar and driven by the architectural
changes in the Block Init Store (BIS) instruction.

> Adhemerval asked that the ifunc selector code use C instead of assembly.
> He has prepared changes to replace all assembly uses of ifunc with C.

The existing ifunc code in that directory uses assembly.
Adhemerval's changes were not available when I submitted the patch.
As far as I understand, they are still not available in the main branch.
Making those changes increases risk for my memcpy/memset optimizations.
I consider such work, while worthy in its own right, to be out of
scope for my patch set.

Respectively,
- Patrick McGehearty




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]