[PATCH] x86_64: Optimize ffsll function code size.

Thu Jul 27 00:00:31 GMT 2023

On Wed, Jul 26, 2023 at 5:38 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
>
> On Wed, Jul 26, 2023 at 2:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>> >
>> >
>> >
>> > On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>> >>
>> >> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
>> >> <libc-alpha@sourceware.org> wrote:
>> >> >
>> >> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
>> >> > richard.henderson@linaro.org> wrote:
>> >> >
>> >> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
>> >> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>> >> > > > Currently ffsll function randomly regress by ~20%, depending on how
>> >> > > > code get aligned.
>> >> > > >
>> >> > > > This patch fixes ffsll function random performance regression.
>> >> > > > ---
>> >> > > >   sysdeps/x86_64/ffsll.c | 2 +-
>> >> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
>> >> > > >
>> >> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>> >> > > > index a1c13d4906..dbded6f0a1 100644
>> >> > > > --- a/sysdeps/x86_64/ffsll.c
>> >> > > > +++ b/sysdeps/x86_64/ffsll.c
>> >> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
>> >> > > >     long long int tmp;
>> >> > > >
>> >> > > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
>> >> > > in %1.  */
>> >> > > > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
>> >> > > result.  */
>> >> > > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
>> >> > > */
>> >> > >
>> >> > > This no longer produces -1, but 0xffffffff in cnt.  However, since the
>> >> > > return type is
>> >> > > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
>> >> > > exists at all, since
>> >> > > cnt is the only register modified.
>> >> > >
>> >> >
>> >> > Here is the exact assembly produced with this change.
>> >> > ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
>> >> >
>> >> >
>> >> > Disassembly of section .text:
>> >> >
>> >> > 0000000000000000 <ffsll>:
>> >> >    0: ba ff ff ff ff       mov    $0xffffffff,%edx
>> >> >    5: 48 0f bc c7           bsf    %rdi,%rax
>> >> >    9: 0f 44 c2             cmove  %edx,%eax
>> >> >    c: 83 c0 01             add    $0x1,%eax
>> >> >    f: c3                   ret
>> >> >
>> >>
>> >> FWIW it should be:
>> >> ```
>> >> 0000000000000000 <.text>:
>> >>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
>> >>    5: 48 0f bc c7          bsf    %rdi,%rax
>> >>    9: ff c0                inc    %eax
>> >> ```
>> >>
>> >> And since its in inline asm no reason not to get that.
>> >
>> >
>> > We shouldn't remove cmove because as per Intel BSF instruction manual, if the content of source operand
>> >  is 0, the content of the destination operand is undefined. Also removing cmove doesn't provide any perf
>> > advantage.
>>
>> We rely on that behavior in other areas (memchr-evex512 for example).
>> It's been confirmed it is defined to not changed the destination (like AMD).
>>
>> It saves an instructions..
>>
>> Either way, however, I think adhemerval is right and we should use builtins
>> for this.
>
>
> Few things to notice here.
>
>       This code is more than 20 years old, not sure what all intel processors this code may be running.
>       Builtin will have the same issue, unless it can produce code 16 byte or less.
>       Also the purpose of this patch is to fix random unexpected ~20% perf loss.
>
Likewise for the string/memory library....
Why not just update it w.o cmov? Just seems like a waste not to
address it while we're at it.

>> >
>> >
>> >>
>> >> >
>> >> > >
>> >> > >
>> >> > > r~
>> >> > >