This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove

From: Carlos O'Donell <carlos at redhat dot com>
To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
Cc: raji at linux dot vnet dot ibm dot com
Date: Thu, 19 Oct 2017 15:12:21 -0700
Subject: Re: [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove
Authentication-results: sourceware.org; auth=none
References: <20171019182056.11179-1-tuliom@linux.vnet.ibm.com> <bd6351dd-b6d2-e672-ea07-8659ec6a28a9@linaro.org> <87y3o79doe.fsf@linux.vnet.ibm.com> <d50c0ba1-5887-e014-6fab-ac4a2ba31f34@linaro.org> <04e5f36e-e9bf-90ab-749a-39a4a67da151@redhat.com> <d85e00c1-e9b3-5e2f-b6d2-f6277c61ccf1@linaro.org>

On 10/19/2017 02:41 PM, Adhemerval Zanella wrote:
> 
> 
> On 19/10/2017 19:06, Carlos O'Donell wrote:
>> On 10/19/2017 01:19 PM, Adhemerval Zanella wrote:
>>>
>>>
>>> On 19/10/2017 16:48, Tulio Magno Quites Machado Filho wrote:
>>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>>
>>>>> On 19/10/2017 16:20, Tulio Magno Quites Machado Filho wrote:
>>>>>> From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
>>>>>>
>>>>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>>>>
>>>>>>> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
>>>>>>> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
>>>>>>> uses the same pipeline, have the same latency and same throughput.  The
>>>>>>> only difference is lxv2dx/stxv2x have microcode handling for unaligned
>>>>>>> case and for 4k crossing or 32-byte cross L1 miss (which should not
>>>>>>> occur in the with aligned address).
>>>>>>>
>>>>>>> Why not change POWER7 implementation instead of dropping another one
>>>>>>> which is exactly the same for POWER9?
>>>>>>
>>>>>> We're trying to limit the impact of this requirement on other processors so
>>>>>> that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.
>>>>>>
>>>>>> However, we could avoid source code duplication with the macros LVX and STVX
>>>>>> I propose here in version 2.
>>>>>> That way, we will postpone the copy to when/if a P7 optimization is
>>>>>> contributed.
>>>>>
>>>>> And which benefit will be exactly? For this specific case current code 
>>>>> already only does aligned accesses, so it does not really matter whether 
>>>>> you use VSX or VMX instruction. If I recall correctly, both lxv2dx/lvx 
>>>>> and stxvd2x/stvx shows the same latency and throughput also for POWER7.  
>>>>>
>>>>> I see no gain on using this POWER9 specific case where you could adjust
>>>>> POWER7 one.
>>>>
>>>> There are no gains now.  The problem arises when contributing a new
>>>> optimization, e.g. a memcpy optimization for POWER8 using lxv2dx or stxvd2x.
>>>>
>>>> If POWER9 doesn't have its own implementation, this problem will appear again.
>>>>
>>>
>>> I think if eventually a POWER8 optimization could not be used as is for POWER9,
>>> then a new ifunc variant would make sense.  But I still think we current
>>> variant, a much simpler solutions (in code sense and maintainability) would be
>>> to just adapt POWER7 variant to use VMX instructions.
>>  
>> We are arguing about taste and style here. About duplication versus functionality.
>> I would leave it up the machine maintainer to decide how best to move forward.
>>
>> Tulio knows, and may not be able to say, if there are future optimizations coming
>> down the line. So we lack a clear picture for deciding on this issue of duplication.
>>
>> My opinion is that I would *rather* see a POWER7 version that is just for POWER7,
>> and a POWER8 or POWER9 version that is *just* for POWER8 or POWER9.
>>
>> The separation of the files allows for simpler incremental distro testing of the
>> changes without needing to revalidate the POWER7 code again.
> 
> I agree with you if the case of the new file justify a new implementation
> for instance by either using new ISA instructions, a new strategy (such 
> as unaligned memory access vs aligned ones), to hoist some internal checks
> which can lead to different implementations (such as the aarch64 memset), 
> or to avoid some chip limitation (such as the different selections on x86 
> for intel and amd chips).
> 
> However for this *specific* case there absolutely no gain by adding a
> similar copy for POWER9 where the same implementation will work perfectly
> fine on POWER7.  And my position is based with the provided information:
> the new implementation is to fix an *issue* within a chip revision.
> 
> Now Tulio told me that the idea is indeed adding a POWER8 optimization,
> but even if the idea is to have a POWER8 specialized implementation
> is does prevent glibc to select the POWER7 memcpy for POWER7 and POWER9.
> (if it will be case a better name for the memcpy implementation would be
> better, for instance __memcpy_vsx_aligned).
> 
> In fact I think have less possible ifunc implementation is indeed better
> for testing, in the memcpy case for instance a developer would not
> require to actually use a POWER9 to validate the algorithm correctness.
> This might not be the best strategy for an incremental testing if the
> idea is backport on distros, but even then I think having the minimum
> required ifunc variant is still a better way forward.

Let me summarize your positions if I can:

Tulio suggests:
* Add a POWER N specific optimization, which is largely a copy of POWER N-M
  but fixes one issue with POWER N, and in the future may be more optimized.
* Incremental testing only requires testing on POWER N.
* Full testing requires testing on POWER N, and POWER N-M. (No difference)
* Maintenance cost increased for additional POWER N specific assembly files.
* Future optimizations for POWER N possible with minor tweaks. (A little easier)

You suggest:
* Simplify the POWER N-M implementation to cover all POWER variants >M.
* Incremental testing requires testing POWER M and POWER N. (Increased cost)
* Full testing requires testing on POWER N, and POWER N-M. (No difference)
* Maintenance decreased with only one file the POWER M one to maintain.
* Future optimizations for POWER N require a full review again like this one.

These two positions seem, to me, to be a matter of development taste and
discretion when it comes to future changes, current incremental testing
cost versus maintenance burden in the short term for IBM.

Did I understand this correctly?

-- 
Cheers,
Carlos.

Follow-Ups:
- Re: [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove
  - From: Adhemerval Zanella

References:
- [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove
  - From: Tulio Magno Quites Machado Filho
- Re: [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove
  - From: Adhemerval Zanella
- Re: [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove
  - From: Tulio Magno Quites Machado Filho
- Re: [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove
  - From: Adhemerval Zanella
- Re: [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove
  - From: Carlos O'Donell
- Re: [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove
  - From: Adhemerval Zanella

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]