[PATCH] powerpc: Optimized memmove for POWER10
Matheus Castanho
msc@linux.ibm.com
Thu Apr 29 19:18:38 GMT 2021
Lucas A. M. Magalhaes <lamm@linux.ibm.com> writes:
> Quoting Matheus Castanho (2021-04-29 11:18:30)
>>
>> No new test failures after this patch.
>>
>> Some comments and questions below.
>>
>> Lucas A. M. Magalhaes via Libc-alpha <libc-alpha@sourceware.org> writes:
>>
>> > This patch was initially based on the __memmove_power7 with some ideas
>> > from strncpy implementation for Power 9.
>> >
>> > Improvements from __memmove_power7:
>> >
>> > 1. Use lxvl/stxvl for alignment code.
>> >
>> > The code for Power 7 uses branches when the input is not naturally
>> > aligned to the width of a vector. The new implementation uses
>> > lxvl/stxvl instead which reduces pressure on GPRs. It also allows
>> > the removal of branch instructions, implicitly removing branch stalls
>> > and mispredictions.
>> >
>> > 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited
>> > memory.
>> >
>> > On Power 10 vector load and stores are safe to use on CI memory for
>> > addresses unaligned to 16B. This code takes advantage of this to
>> > do unaligned loads.
>> >
>> > The unaligned loads don't have a significant performance impact by
>> > themselves. However doing so decreases register pressure on GPRs
>> > and interdependence stalls on load/store pairs. This also improved
>> > readability as there are now less code paths for different alignments.
>> > Finally this reduces the overall code size.
>> >
>> > 3. Improved performance.
>> >
>> > This version runs on average about 30% better than memmove_power7
>> > for lengths larger than 8KB. For input lengths shorter than 8KB
>> > the improvement is smaller, it has on average about 17% better
>> > performance.
>> >
>> > This version has a degradation of about 50% for input lengths
>> > in the 0 to 31 bytes range when dest is unaligned src.
>>
>> What do you mean? When dest is unaligned or when src is unaligned? Or both?
>>
> Fixed.
>
>> > ---
>> > .../powerpc/powerpc64/le/power10/memmove.S | 313 ++++++++++++++++++
>> > sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +-
>> > .../powerpc64/multiarch/ifunc-impl-list.c | 7 +
>> > .../powerpc64/multiarch/memmove-power10.S | 24 ++
>> > sysdeps/powerpc/powerpc64/multiarch/memmove.c | 16 +-
>> > 5 files changed, 358 insertions(+), 5 deletions(-)
>> > create mode 100644 sysdeps/powerpc/powerpc64/le/power10/memmove.S
>> > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S
>> >
>> > diff --git a/sysdeps/powerpc/powerpc64/le/power10/memmove.S b/sysdeps/powerpc/powerpc64/le/power10/memmove.S
>> > new file mode 100644
>> > index 0000000000..7cff5ef2ac
>> > --- /dev/null
>> > +++ b/sysdeps/powerpc/powerpc64/le/power10/memmove.S
>> > @@ -0,0 +1,313 @@
>> > +/* Optimized memmove implementation for POWER10.
>> > + Copyright (C) 2021 Free Software Foundation, Inc.
>> > + This file is part of the GNU C Library.
>> > +
>> > + The GNU C Library is free software; you can redistribute it and/or
>> > + modify it under the terms of the GNU Lesser General Public
>> > + License as published by the Free Software Foundation; either
>> > + version 2.1 of the License, or (at your option) any later version.
>> > +
>> > + The GNU C Library is distributed in the hope that it will be useful,
>> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
>> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>> > + Lesser General Public License for more details.
>> > +
>> > + You should have received a copy of the GNU Lesser General Public
>> > + License along with the GNU C Library; if not, see
>> > + <https://www.gnu.org/licenses/>. */
>> > +
>> > +#include <sysdep.h>
>> > +
>> > +
>> > +/* void* [r3] memmove (void *dest [r3], const void *src [r4], size_t len [r5])
>> > +
>> > + This optimization checks if 'src' and 'dst' overlap. If they do not
>>
>> Two spaces after '.'
>>
> Fixed.
>
>> > + or 'src' is ahead of 'dest' then it copies forward.
>> > + Otherwise, an optimized backward copy is used. */
>> > +
>> > +#ifndef MEMMOVE
>> > +# define MEMMOVE memmove
>> > +#endif
>> > + .machine power9
>> > +ENTRY_TOCLESS (MEMMOVE, 5)
>> > + CALL_MCOUNT 3
>> > +
>> > + .p2align 5
>> > + /* Check if there is overlap, if so it will branch to backward copy. */
>> > + subf r9,r4,r3
>> > + cmpld cr7,r9,r5
>> > + blt cr7,L(memmove_bwd)
>> > +
>> > + /* Fast path for length shorter than 16 bytes. */
>> > + sldi r7,r5,56
>> > + lxvl 32+v2,r4,r7
>> > + stxvl 32+v2,r3,r7
>> > + subic. r8,r5,16
>> > + blelr
>>
>> Ok.
>>
>> > +
>> > + cmpldi cr6,r5,256
>> > + bge cr6,L(ge_256)
>> > + /* Account for the first 16-byte copy. For shorter lengths the alignment
>> > + either slows down or is irrelevant. I'm making use of this comparison
>> > + to skip the alignment. */
>> > + addi r4,r4,16
>> > + addi r11,r3,16 /* use r11 to keep dest address on r3. */
>> > + subi r5,r5,16
>> > + b L(loop_head)
>>
>> Two spaces after '.' (x2)
>>
> Fixed.
>
>> Also, I'd suggest slightly rephrasing the big comment and splitting it
>> in two:
>>
>> /* For shorter lengths aligning the address to 16B either
>> decreases performance or is irrelevant. So make use of this
>> comparison to skip the alignment code in such cases. */
>> cmpldi cr6,r5,256
>> bge cr6,L(ge_256)
>>
>> /* Account for the first 16-byte copy. */
>> addi r4,r4,16
>> addi r11,r3,16 /* use r11 to keep dest address on r3. */
>> subi r5,r5,16
>> b L(loop_head)
>>
> Fixed.
>
>> > +
>> > + .p2align 5
>> > +L(ge_256):
>> > + /* Account for the first copy <= 16 bytes. This is necessary for
>> > + memmove because at this point the src address can be in front of the
> t> > + dest address. */
>> > + clrldi r9,r5,56
>> > + li r8,16
>> > + cmpldi r9,16
>> > + iselgt r9,r8,r9
>> > + add r4,r4,r9
>> > + add r11,r3,r9 /* use r11 to keep dest address on r3. */
>> > + sub r5,r5,r9
>> > +
>> > + /* Align dest to 16 bytes. */
>> > + neg r7,r3
>> > + clrldi. r9,r7,60
>> > + beq L(loop_head)
>> > +
>> > + .p2align 5
>> > + sldi r6,r9,56
>> > + lxvl 32+v0,r4,r6
>> > + stxvl 32+v0,r11,r6
>> > + sub r5,r5,r9
>> > + add r4,r4,r9
>> > + add r11,r11,r9
>>
>> Ok. Only align for to 16B for strings >256.
>>
>> > +
>> > +L(loop_head):
>> > + cmpldi r5,63
>> > + ble L(final_64)
>> > +
>> > + srdi. r7,r5,7
>> > + beq L(loop_tail)
>> > +
>> > + mtctr r7
>> > +
>> > +/* Main loop that copies 128 bytes each iteration. */
>> > + .p2align 5
>> > +L(loop):
>> > + addi r9,r4,64
>> > + addi r10,r11,64
>> > +
>> > + lxv 32+v0,0(r4)
>> > + lxv 32+v1,16(r4)
>> > + lxv 32+v2,32(r4)
>> > + lxv 32+v3,48(r4)
>> > +
>> > + stxv 32+v0,0(r11)
>> > + stxv 32+v1,16(r11)
>> > + stxv 32+v2,32(r11)
>> > + stxv 32+v3,48(r11)
>> > +
>> > + addi r4,r4,128
>> > + addi r11,r11,128
>> > +
>> > + lxv 32+v4,0(r9)
>> > + lxv 32+v5,16(r9)
>> > + lxv 32+v6,32(r9)
>> > + lxv 32+v7,48(r9)
>> > +
>> > + stxv 32+v4,0(r10)
>> > + stxv 32+v5,16(r10)
>> > + stxv 32+v6,32(r10)
>> > + stxv 32+v7,48(r10)
>> > +
>> > + bdnz L(loop)
>> > + clrldi. r5,r5,57
>> > + beqlr
>> > +
>>
>> Ok. Copy 128 bytes per iteration using 2 pairs of pointers.
>>
>> > +/* Copy 64 bytes. */
>> > + .p2align 5
>> > +L(loop_tail):
>> > + cmpldi cr5,r5,63
>> > + ble cr5,L(final_64)
>> > +
>> > + lxv 32+v0,0(r4)
>> > + lxv 32+v1,16(r4)
>> > + lxv 32+v2,32(r4)
>> > + lxv 32+v3,48(r4)
>> > +
>> > + stxv 32+v0,0(r11)
>> > + stxv 32+v1,16(r11)
>> > + stxv 32+v2,32(r11)
>> > + stxv 32+v3,48(r11)
>> > +
>> > + addi r4,r4,64
>> > + addi r11,r11,64
>> > + subi r5,r5,64
>> > +
>>
>> Ok. Copy an extra 64B block if needed.
>>
>> > +/* Copies the last 1-63 bytes. */
>> > + .p2align 5
>> > +L(final_64):
>> > + /* r8 hold the number of bytes that will be copied with lxv/stxv. */
>>
>> r8 hold -> r8 holds
>>
> Fixed.
>
>> > + clrrdi. r8,r5,4
>> > + beq L(tail1)
>> > +
>> > + cmpldi cr5,r5,32
>> > + lxv 32+v0,0(r4)
>> > + blt cr5,L(tail2)
>> > +
>> > + cmpldi cr6,r5,48
>> > + lxv 32+v1,16(r4)
>> > + blt cr6,L(tail3)
>> > +
>> > + lxv 32+v2,32(r4)
>> > +
>> > + .p2align 5
>> > +L(tail4):
>> > + stxv 32+v2,32(r11)
>> > +L(tail3):
>> > + stxv 32+v1,16(r11)
>> > +L(tail2):
>> > + stxv 32+v0,0(r11)
>> > + sub r5,r5,r8
>> > + add r4,r4,r8
>> > + add r11,r11,r8
>> > + .p2align 5
>> > +L(tail1):
>> > + sldi r6,r5,56
>> > + lxvl v4,r4,r6
>> > + stxvl v4,r11,r6
>> > + blr
>> > +
>>
>> Ok.
>>
>> > +/* If dest and src overlap, we should copy backwards. */
>> > +L(memmove_bwd):
>> > + add r11,r3,r5
>> > + add r4,r4,r5
>> > +
>> > + /* Optimization for length smaller than 16 bytes. */
>> > + cmpldi cr5,r5,15
>> > + ble cr5,L(tail1_bwd)
>> > +
>> > + /* For shorter lengths the alignment either slows down or is irrelevant.
>> > + The forward copy uses a already need 256 comparison for that. Here
>> > + it's using 128 as it will reduce code and improve readability. */
>>
>> What about something like this instead:
>>
>> /* For shorter lengths the alignment either decreases
>> performance or is irrelevant. The forward copy applies the
>> alignment only for len >= 256. Here it's using 128 as it
>> will reduce code and improve readability. */
>>
>> But, I'm curious about why you chose a different threshold than the
>> forward case. I agree it reduces code and improves readability, but then
>> shouldn't the same be done for the forward case? If not, because of
>> performance reasons, why the same 256 threshold was not used here?
>>
>> The two codepaths are fairly similar (basically replacing sub by add and
>> vice-versa and using some different offsets), so I'd expect an
>> optimization like that to apply to both cases.
>>
>
> To do a first copy here would need a code like the bellow.
>
> /* Calculate how many bytes will be copied to adjust r4 and r3. Saturate
> it to 0 if subi results in a negative number. */
> subi. r10,r5,16
> isellt r10,0,r10
>
> /* Ajust the pointers. */
> sub r4,r4,r10
> sub r11,r11,r10
> sldi r6,r10,56
> lxvl 32+v0,r4,r6
> stxvl 32+v0,r11,r6
> blelr /* Using the subi. result. */
>
> /* Account for the size copied here. Different than the forward path
> I need to calculate the size before the copy. If we get here I will
> always have copied 16 bytes. */
> subi r5, r5, 16
>
> After this I will have to decide if I will skip the alignment for some
> size. I would decide here for 128 bytes anyway, for two reasons. First
> empirically the alignment overhead will not decrease performance for
> sizes bigger than 100 bytes. Second because the code already is made
> for blocks of 128, so I will use the already code to copy the last 127.
>
> For sizes less then 16 bytes the code today is as shown bellow. I'm
> doing the branch so we can see the whole code path.
>
> cmpldi cr5,r5,15
> ble cr5,L(tail1_bwd)
> /* Branches to this.*/
> L(tail1_bwd):
> sub r11,r11,r5
> sldi r6,r5,56
> lxvl v4,r4,r6
> stxvl v4,r11,r6
> blr
>
> IMHO both code will have similar performance although the second version
> is smaller and simpler. With that said I don't think it's worth to add
> that optimization here.
>
> Maybe add a comparison before adjusting the pointers would be a better
> option. Something like:
>
> L(memmove_bwd):
> cmpldi cr5,r5,16
> bgt cr5,L(bwd_gt_16)
> sldi r6,r5,56
> lxvl v4,r4,r6
> stxvl v4,r11,r6
> blr
>
> L(bwd_gt_16):
> /* Continue the code. */
>
> Or maybe I could save a lot of code by doing this comparison for the
> forward version as well. I would like to hear your thoughts about it.
>
I think I get it now. You used 256 in the fwd case because that's the
length limit for lxvl/stvxl. I'm OK with that. No changes needed.
But thanks for the detailed explanation anyways =)
>> > + cmpldi cr7,r5,128
>> > + blt cr7,L(bwd_loop_tail)
>> > +
>> > + .p2align 5
>> > + clrldi. r9,r11,60
>> > + beq L(bwd_loop_head)
>> > + sub r4,r4,r9
>> > + sub r11,r11,r9
>> > + lxv 32+v0,0(r4)
>> > + sldi r6,r9,56
>> > + stxvl 32+v0,r11,r6
>> > + sub r5,r5,r9
>> > +
>>
>> This block would benefit from a one-line comment about what it does.
>>
> Fixed.
>
>> > +L(bwd_loop_head):
>> > + srdi. r7,r5,7
>> > + beq L(bwd_loop_tail)
>> > +
>> > + mtctr r7
>> > +
>> > +/* Main loop that copies 128 bytes every iteration. */
>> > + .p2align 5
>> > +L(bwd_loop):
>> > + addi r9,r4,-64
>> > + addi r10,r11,-64
>> > +
>> > + lxv 32+v0,-16(r4)
>> > + lxv 32+v1,-32(r4)
>> > + lxv 32+v2,-48(r4)
>> > + lxv 32+v3,-64(r4)
>> > +
>> > + stxv 32+v0,-16(r11)
>> > + stxv 32+v1,-32(r11)
>> > + stxv 32+v2,-48(r11)
>> > + stxv 32+v3,-64(r11)
>> > +
>> > + addi r4,r4,-128
>> > + addi r11,r11,-128
>> > +
>> > + lxv 32+v0,-16(r9)
>> > + lxv 32+v1,-32(r9)
>> > + lxv 32+v2,-48(r9)
>> > + lxv 32+v3,-64(r9)
>> > +
>> > + stxv 32+v0,-16(r10)
>> > + stxv 32+v1,-32(r10)
>> > + stxv 32+v2,-48(r10)
>> > + stxv 32+v3,-64(r10)
>> > +
>> > + bdnz L(bwd_loop)
>> > + clrldi. r5,r5,57
>> > + beqlr
>> > +
>>
>> Ok.
>>
>> > +/* Copy 64 bytes. */
>> > + .p2align 5
>> > +L(bwd_loop_tail):
>> > + cmpldi cr5,r5,63
>> > + ble cr5,L(bwd_final_64)
>> > +
>> > + addi r4,r4,-64
>> > + addi r11,r11,-64
>> > +
>> > + lxv 32+v0,0(r4)
>> > + lxv 32+v1,16(r4)
>> > + lxv 32+v2,32(r4)
>> > + lxv 32+v3,48(r4)
>> > +
>> > + stxv 32+v0,0(r11)
>> > + stxv 32+v1,16(r11)
>> > + stxv 32+v2,32(r11)
>> > + stxv 32+v3,48(r11)
>> > +
>> > + subi r5,r5,64
>> > +
>>
>> Ok.
>>
>> > +/* Copies the last 1-63 bytes. */
>> > + .p2align 5
>> > +L(bwd_final_64):
>> > + /* r8 hold the number of bytes that will be copied with lxv/stxv. */
>>
>> r8 hold -> r8 holds
>>
> Fixed.
>
>> > + clrrdi. r8,r5,4
>> > + beq L(tail1_bwd)
>> > +
>> > + cmpldi cr5,r5,32
>> > + lxv 32+v2,-16(r4)
>> > + blt cr5,L(tail2_bwd)
>> > +
>> > + cmpldi cr6,r5,48
>> > + lxv 32+v1,-32(r4)
>> > + blt cr6,L(tail3_bwd)
>> > +
>> > + lxv 32+v0,-48(r4)
>> > +
>> > + .p2align 5
>> > +L(tail4_bwd):
>> > + stxv 32+v0,-48(r11)
>> > +L(tail3_bwd):
>> > + stxv 32+v1,-32(r11)
>> > +L(tail2_bwd):
>> > + stxv 32+v2,-16(r11)
>> > + sub r4,r4,r5
>> > + sub r11,r11,r5
>> > + sub r5,r5,r8
>> > + sldi r6,r5,56
>> > + lxvl v4,r4,r6
>> > + stxvl v4,r11,r6
>> > + blr
>> > +
>> > +/* Copy last 16 bytes. */
>> > + .p2align 5
>> > +L(tail1_bwd):
>> > + sub r4,r4,r5
>> > + sub r11,r11,r5
>> > + sldi r6,r5,56
>> > + lxvl v4,r4,r6
>> > + stxvl v4,r11,r6
>> > + blr
>>
>> The last two blocks are almost identical, I imagine you separated both
>> to save one sub for the cases that go straight to tail1_bwd because you
>> already know r8 will be 0 and thus the sub is not needed. So probably
>> saves 1 cycle.
>>
> Thats it. It will actually save more cycles the sub instructions will
> depend on the update of r5.
>
>> Ok.
>>
>> > +
>> > +
>> > +END (MEMMOVE)
>> > +
>> > +#ifdef DEFINE_STRLEN_HIDDEN_DEF
>> > +weak_alias (__memmove, memmove)
>> > +libc_hidden_builtin_def (memmove)
>> > +#endif
>> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile
>> > index 8aa46a3702..16ad1ab439 100644
>> > --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
>> > +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
>> > @@ -24,7 +24,8 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \
>> > stpncpy-power8 stpncpy-power7 stpncpy-ppc64 \
>> > strcmp-power8 strcmp-power7 strcmp-ppc64 \
>> > strcat-power8 strcat-power7 strcat-ppc64 \
>> > - memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \
>> > + memmove-power10 memmove-power7 memmove-ppc64 \
>> > + wordcopy-ppc64 bcopy-ppc64 \
>> > strncpy-power8 strstr-power7 strstr-ppc64 \
>> > strspn-power8 strspn-ppc64 strcspn-power8 strcspn-ppc64 \
>> > strlen-power8 strcasestr-power8 strcasestr-ppc64 \
>>
>> I agree with the issue Raoni pointed. This should be in the list of
>> LE-only ifuncs.
>>
> Fixed.
>
>> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
>> > index 1a6993616f..d1c159f2f7 100644
>> > --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
>> > +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
>> > @@ -67,6 +67,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>> >
>> > /* Support sysdeps/powerpc/powerpc64/multiarch/memmove.c. */
>> > IFUNC_IMPL (i, name, memmove,
>> > +#ifdef __LITTLE_ENDIAN__
>> > + IFUNC_IMPL_ADD (array, i, memmove,
>> > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 |
>> > + PPC_FEATURE2_HAS_ISEL)
>> > + && (hwcap & PPC_FEATURE_HAS_VSX),
>> > + __memmove_power10)
>> > +#endif
>> > IFUNC_IMPL_ADD (array, i, memmove, hwcap & PPC_FEATURE_HAS_VSX,
>> > __memmove_power7)
>> > IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_ppc))
>>
>> Ok.
>>
> Fixed.
>
>> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S
>> > new file mode 100644
>> > index 0000000000..d6d8ea83ec
>> > --- /dev/null
>> > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S
>> > @@ -0,0 +1,24 @@
>> > +/* Optimized memmove implementation for POWER10.
>> > + Copyright (C) 2021 Free Software Foundation, Inc.
>> > + This file is part of the GNU C Library.
>> > +
>> > + The GNU C Library is free software; you can redistribute it and/or
>> > + modify it under the terms of the GNU Lesser General Public
>> > + License as published by the Free Software Foundation; either
>> > + version 2.1 of the License, or (at your option) any later version.
>> > +
>> > + The GNU C Library is distributed in the hope that it will be useful,
>> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
>> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>> > + Lesser General Public License for more details.
>> > +
>> > + You should have received a copy of the GNU Lesser General Public
>> > + License along with the GNU C Library; if not, see
>> > + <https://www.gnu.org/licenses/>. */
>> > +
>> > +#define MEMMOVE __memmove_power10
>> > +
>> > +#undef libc_hidden_builtin_def
>> > +#define libc_hidden_builtin_def(name)
>> > +
>> > +#include <sysdeps/powerpc/powerpc64/le/power10/memmove.S>
>> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove.c b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
>> > index 9bec61a321..4704636f5d 100644
>> > --- a/sysdeps/powerpc/powerpc64/multiarch/memmove.c
>> > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
>> > @@ -28,14 +28,22 @@
>> > # include "init-arch.h"
>> >
>> > extern __typeof (__redirect_memmove) __libc_memmove;
>> > -
>> > extern __typeof (__redirect_memmove) __memmove_ppc attribute_hidden;
>> > extern __typeof (__redirect_memmove) __memmove_power7 attribute_hidden;
>> > +#ifdef __LITTLE_ENDIAN__
>> > +extern __typeof (__redirect_memmove) __memmove_power10 attribute_hidden;
>> > +#endif
>> >
>> > libc_ifunc (__libc_memmove,
>> > - (hwcap & PPC_FEATURE_HAS_VSX)
>> > - ? __memmove_power7
>> > - : __memmove_ppc);
>> > +#ifdef __LITTLE_ENDIAN__
>> > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 |
>> > + PPC_FEATURE2_HAS_ISEL)
>> > + && (hwcap & PPC_FEATURE_HAS_VSX)
>> > + ? __memmove_power10 :
>> > +#endif
>>
>> This should be indented 1 level down.
>>
> Fixed.
>
>> > + (hwcap & PPC_FEATURE_HAS_VSX)
>> > + ? __memmove_power7
>> > + : __memmove_ppc);
>> >
>> > #undef memmove
>> > strong_alias (__libc_memmove, memmove);
>>
>>
>> --
>> Matheus Castanho
--
Matheus Castanho
More information about the Libc-alpha
mailing list