This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove



On 19/10/2017 19:06, Carlos O'Donell wrote:
> On 10/19/2017 01:19 PM, Adhemerval Zanella wrote:
>>
>>
>> On 19/10/2017 16:48, Tulio Magno Quites Machado Filho wrote:
>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>
>>>> On 19/10/2017 16:20, Tulio Magno Quites Machado Filho wrote:
>>>>> From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
>>>>>
>>>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>>>
>>>>>> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
>>>>>> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
>>>>>> uses the same pipeline, have the same latency and same throughput.  The
>>>>>> only difference is lxv2dx/stxv2x have microcode handling for unaligned
>>>>>> case and for 4k crossing or 32-byte cross L1 miss (which should not
>>>>>> occur in the with aligned address).
>>>>>>
>>>>>> Why not change POWER7 implementation instead of dropping another one
>>>>>> which is exactly the same for POWER9?
>>>>>
>>>>> We're trying to limit the impact of this requirement on other processors so
>>>>> that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.
>>>>>
>>>>> However, we could avoid source code duplication with the macros LVX and STVX
>>>>> I propose here in version 2.
>>>>> That way, we will postpone the copy to when/if a P7 optimization is
>>>>> contributed.
>>>>
>>>> And which benefit will be exactly? For this specific case current code 
>>>> already only does aligned accesses, so it does not really matter whether 
>>>> you use VSX or VMX instruction. If I recall correctly, both lxv2dx/lvx 
>>>> and stxvd2x/stvx shows the same latency and throughput also for POWER7.  
>>>>
>>>> I see no gain on using this POWER9 specific case where you could adjust
>>>> POWER7 one.
>>>
>>> There are no gains now.  The problem arises when contributing a new
>>> optimization, e.g. a memcpy optimization for POWER8 using lxv2dx or stxvd2x.
>>>
>>> If POWER9 doesn't have its own implementation, this problem will appear again.
>>>
>>
>> I think if eventually a POWER8 optimization could not be used as is for POWER9,
>> then a new ifunc variant would make sense.  But I still think we current
>> variant, a much simpler solutions (in code sense and maintainability) would be
>> to just adapt POWER7 variant to use VMX instructions.
>  
> We are arguing about taste and style here. About duplication versus functionality.
> I would leave it up the machine maintainer to decide how best to move forward.
> 
> Tulio knows, and may not be able to say, if there are future optimizations coming
> down the line. So we lack a clear picture for deciding on this issue of duplication.
> 
> My opinion is that I would *rather* see a POWER7 version that is just for POWER7,
> and a POWER8 or POWER9 version that is *just* for POWER8 or POWER9.
> 
> The separation of the files allows for simpler incremental distro testing of the
> changes without needing to revalidate the POWER7 code again.

I agree with you if the case of the new file justify a new implementation
for instance by either using new ISA instructions, a new strategy (such 
as unaligned memory access vs aligned ones), to hoist some internal checks
which can lead to different implementations (such as the aarch64 memset), 
or to avoid some chip limitation (such as the different selections on x86 
for intel and amd chips).

However for this *specific* case there absolutely no gain by adding a
similar copy for POWER9 where the same implementation will work perfectly
fine on POWER7.  And my position is based with the provided information:
the new implementation is to fix an *issue* within a chip revision.

Now Tulio told me that the idea is indeed adding a POWER8 optimization,
but even if the idea is to have a POWER8 specialized implementation
is does prevent glibc to select the POWER7 memcpy for POWER7 and POWER9.
(if it will be case a better name for the memcpy implementation would be
better, for instance __memcpy_vsx_aligned).

In fact I think have less possible ifunc implementation is indeed better
for testing, in the memcpy case for instance a developer would not
require to actually use a POWER9 to validate the algorithm correctness.
This might not be the best strategy for an incremental testing if the
idea is backport on distros, but even then I think having the minimum
required ifunc variant is still a better way forward.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]