This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 3/3] sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero.
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: libc-alpha at sourceware dot org
- Date: Tue, 3 Oct 2017 15:41:08 -0300
- Subject: Re: [PATCH 3/3] sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero.
- Authentication-results: sourceware.org; auth=none
- References: <1506542999-97895-1-git-send-email-patrick.mcgehearty@oracle.com> <1506542999-97895-2-git-send-email-patrick.mcgehearty@oracle.com> <1506542999-97895-3-git-send-email-patrick.mcgehearty@oracle.com> <1506542999-97895-4-git-send-email-patrick.mcgehearty@oracle.com> <125b565d-1c04-c532-e7f7-8a95ba97702c@linaro.org> <4f2d87d6-05ca-bef5-3f01-7a97ce19ea16@oracle.com>
On 28/09/2017 16:43, Patrick McGehearty wrote:
> Responses after the comments. Apologies for the earlier incomplete message.
>
> On 9/28/2017 11:17 AM, Adhemerval Zanella wrote:
>>
>> On 27/09/2017 13:09, Patrick McGehearty wrote:
>>> Tested in sparcv9-*-* and sparc64-*-* targets in both multi and
>>> non-multi arch configurations.
>>>
>>> Support added to identify Sparc M7/T7/S7/M8/T8 processor capability.
>>> Usual "make check" correctness tests run with no regressions.
>>> Performance tests run on Sparc S7 using new code and old niagara4 code.
>>>
>>> Optimizations for memcpy also apply to mempcpy and memmove
>>> where they share code. Optimizations for memset also apply
>>> to bzero as they share code.
>>>
>>> For memcpy/mempcpy/memmove, performance comparison with niagara4 code:
>>> Long word aligned data
>>> 0-127 bytes - minimal changes
>>> 128-1023 bytes - 7-30% gain
>>> 1024+ bytes - 1-7% gain (in cache); 30-100% gain (out of cache)
>>> Word aligned data
>>> 0-127 bytes - 50%+ gain
>>> 128-1023 bytes - 10-200% gain
>>> 1024+ bytes - 0-15% gain (in cache); 5-50% gain (out of cache)
>>> Unaligned data
>>> 0-127 bytes - 0-70%+ gain
>>> 128-447 bytes - 40-80%+ gain
>>> 448-511 bytes - 1-3% loss
>>> 512-4096 bytes - 2-3% gain (in cache); 0-20% gain (out of cache)
>>> 4096+ bytes - +/- 3% (in cache); 20-50% gain (out of cache)
>>>
>>> For memset/bzero, performance comparison with niagara4 code:
>>> For memset nonzero data,
>>> 256-1023 bytes - 60-90% gain (in cache); 5% gain (out of cache)
>>> 1K+ bytes - 80-260% gain (in cache); 40-80% gain (out of cache)
>>> For memset zero data (and bzero),
>>> 256-1023 bytes - 80-120% gain (in cache), 0% gain (out of cache)
>>> 1024+ bytes - 2-4x gain (in cache), 10-35% gain (out of cache)
Which benchmark did you use to get theses values? If it where obtained
with benchtests one please attach the resulting files. If not, please
either indicate how to reproduce the data or work to improve benchtests
with required datapoints.
>>> ---
>>> ChangeLog | 20 +
>>> sysdeps/sparc/sparc32/sparcv9/multiarch/Makefile | 3 +-
>>> .../sparcv9/multiarch/memcpy-memmove-niagara7.S | 2 +
>>> sysdeps/sparc/sparc32/sparcv9/multiarch/memmove.S | 2 +
>>> .../sparc32/sparcv9/multiarch/memset-niagara7.S | 2 +
>>> .../sparc/sparc32/sparcv9/multiarch/rtld-memmove.c | 1 +
>>> sysdeps/sparc/sparc64/multiarch/Makefile | 3 +-
>>> sysdeps/sparc/sparc64/multiarch/ifunc-impl-list.c | 13 +
>>> .../sparc64/multiarch/memcpy-memmove-niagara7.S | 979 ++++++++++++++++++++
>>> sysdeps/sparc/sparc64/multiarch/memcpy.S | 28 +-
>>> sysdeps/sparc/sparc64/multiarch/memmove.S | 72 ++
>>> sysdeps/sparc/sparc64/multiarch/memset-niagara7.S | 334 +++++++
>>> sysdeps/sparc/sparc64/multiarch/memset.S | 28 +-
>>> sysdeps/sparc/sparc64/multiarch/rtld-memmove.c | 1 +
>>> 14 files changed, 1482 insertions(+), 6 deletions(-)
>>> create mode 100644 sysdeps/sparc/sparc32/sparcv9/multiarch/memcpy-memmove-niagara7.S
>>> create mode 100644 sysdeps/sparc/sparc32/sparcv9/multiarch/memmove.S
>>> create mode 100644 sysdeps/sparc/sparc32/sparcv9/multiarch/memset-niagara7.S
>>> create mode 100644 sysdeps/sparc/sparc32/sparcv9/multiarch/rtld-memmove.c
>>> create mode 100644 sysdeps/sparc/sparc64/multiarch/memcpy-memmove-niagara7.S
>>> create mode 100644 sysdeps/sparc/sparc64/multiarch/memmove.S
>>> create mode 100644 sysdeps/sparc/sparc64/multiarch/memset-niagara7.S
>>> create mode 100644 sysdeps/sparc/sparc64/multiarch/rtld-memmove.c
>> I would prefer if you split this patch in a memcpy/mempcpy/memmove and
>> a memset/bzero one since they are separated implementations.
> I can see your point, and if I had started that way, it would not have been a problem. At this point,
> it would take some effort to redo the patch structure at this point given our internal development
> process and might significantly delay the completion of this patch due to requirements to make
> progress on other tasks. From a user's point of view, the optimizations for memcpy on M7/T7/S7
> and memset on M7/T7/S7 both are driven by the use of block initializing store as a 64 byte operation
> instead of a 32 byte operation. (side note: block initializing store on M8/T8 is also 64 bytes, so these
> optimizations are also appropriate for the new M8/T8 systems). Thus, a user should be equally
> willing to use both or neither of the new memcpy/memset optimizations.
This is not really a good reason to *not* structure better the
pachset, and the change does not relly require extensive changes
(rather to move the memset bits not another patch). Please split
the patch in future submission.
>
>
>>
>> I also noticed this whitespace issues, so you might want to check this out:
>>
>> ---
>>
>> patch03:886: trailing whitespace.
>> * lines from memory. Use ST_CHUNK stores to first element of each cache
>> patch03:890: space before tab in indent.
>> andn %o2, 0x3f, %o5 /* %o5 is multiple of block size */
>> patch03:891: space before tab in indent.
>> and %o2, 0x3f, %o2 /* residue bytes in %o2 */
>> patch03:892: trailing whitespace.
>> patch03:1216: new blank line at EOF.
>> +
>> warning: 5 lines add whitespace errors.
>> ---
> I've cleaned up the whitespace issues. Thank you for catching them.
>
>>
>>> diff --git a/sysdeps/sparc/sparc64/multiarch/memcpy.S b/sysdeps/sparc/sparc64/multiarch/memcpy.S
>>> index b6396ee..d72f4b1 100644
>>> --- a/sysdeps/sparc/sparc64/multiarch/memcpy.S
>>> +++ b/sysdeps/sparc/sparc64/multiarch/memcpy.S
>>> @@ -27,7 +27,19 @@ ENTRY(memcpy)
>>> # ifdef SHARED
>>> SETUP_PIC_REG_LEAF(o3, o5)
>>> # endif
>>> - set HWCAP_SPARC_CRYPTO, %o1
>>> + set HWCAP_SPARC_ADP, %o1
>>> + andcc %o0, %o1, %g0
>>> + be 1f
>>> + nop
>>> +# ifdef SHARED
>>> + sethi %gdop_hix22(__memcpy_niagara7), %o1
>>> + xor %o1, %gdop_lox10(__memcpy_niagara7), %o1
>>> +# else
>>> + set __memcpy_niagara7, %o1
>>> +# endif
>>> + ba 10f
>>> + nop
>>> +1: set HWCAP_SPARC_CRYPTO, %o1
>>> andcc %o0, %o1, %g0
>>> be 1f
>>> andcc %o0, HWCAP_SPARC_N2, %g0
>>> @@ -89,7 +101,19 @@ ENTRY(__mempcpy)
>>> # ifdef SHARED
>>> SETUP_PIC_REG_LEAF(o3, o5)
>>> # endif
>>> - set HWCAP_SPARC_CRYPTO, %o1
>>> + set HWCAP_SPARC_ADP, %o1
>>> + andcc %o0, %o1, %g0
>>> + be 1f
>>> + nop
>>> +# ifdef SHARED
>>> + sethi %gdop_hix22(__mempcpy_niagara7), %o1
>>> + xor %o1, %gdop_lox10(__mempcpy_niagara7), %o1
>>> +# else
>>> + set __mempcpy_niagara7, %o1
>>> +# endif
>>> + ba 10f
>>> + nop
>>> +1: set HWCAP_SPARC_CRYPTO, %o1
>>> andcc %o0, %o1, %g0
>>> be 1f
>>> andcc %o0, HWCAP_SPARC_N2, %g0
>> Since you are touching this file, I think it would be better to change the
>> ifunc selector to a C implementation by using the __ifunc/sparc_libc_ifunc
>> macros.
> Again, I agree philosophically, but the actual code changes for ADP were made many months ago.
> Revising those changes now would require more work with little perceived benefit.
SPARC is the only one still using ifunc resolvers coded in assembly and
currently there is no gain in continuing doing so. You can rebase your next
version against azanella/sparc64-ifunc [1] which I intend to send upstream
shortly.
It refactors sparc64 memcpy and memset ifunc selection current implementation
to C. I think I get the selector logic correct and I checked on a UltraSparc T5,
so please tell if I got something wrong.
[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/sparc64-ifunc