[PATCH] x86_64: Optimize ffsll function code size.
Noah Goldstein
goldstein.w.n@gmail.com
Thu Jul 27 00:00:31 GMT 2023
On Wed, Jul 26, 2023 at 5:38 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
>
> On Wed, Jul 26, 2023 at 2:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>> >
>> >
>> >
>> > On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>> >>
>> >> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
>> >> <libc-alpha@sourceware.org> wrote:
>> >> >
>> >> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
>> >> > richard.henderson@linaro.org> wrote:
>> >> >
>> >> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
>> >> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>> >> > > > Currently ffsll function randomly regress by ~20%, depending on how
>> >> > > > code get aligned.
>> >> > > >
>> >> > > > This patch fixes ffsll function random performance regression.
>> >> > > > ---
>> >> > > > sysdeps/x86_64/ffsll.c | 2 +-
>> >> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >> > > >
>> >> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>> >> > > > index a1c13d4906..dbded6f0a1 100644
>> >> > > > --- a/sysdeps/x86_64/ffsll.c
>> >> > > > +++ b/sysdeps/x86_64/ffsll.c
>> >> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
>> >> > > > long long int tmp;
>> >> > > >
>> >> > > > asm ("bsfq %2,%0\n" /* Count low bits in X and store
>> >> > > in %1. */
>> >> > > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as
>> >> > > result. */
>> >> > > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result.
>> >> > > */
>> >> > >
>> >> > > This no longer produces -1, but 0xffffffff in cnt. However, since the
>> >> > > return type is
>> >> > > 'int', cnt need not be 'long long int' either. I'm not sure why tmp
>> >> > > exists at all, since
>> >> > > cnt is the only register modified.
>> >> > >
>> >> >
>> >> > Here is the exact assembly produced with this change.
>> >> > ./build-x86_64-linux/string/ffsll.o: file format elf64-x86-64
>> >> >
>> >> >
>> >> > Disassembly of section .text:
>> >> >
>> >> > 0000000000000000 <ffsll>:
>> >> > 0: ba ff ff ff ff mov $0xffffffff,%edx
>> >> > 5: 48 0f bc c7 bsf %rdi,%rax
>> >> > 9: 0f 44 c2 cmove %edx,%eax
>> >> > c: 83 c0 01 add $0x1,%eax
>> >> > f: c3 ret
>> >> >
>> >>
>> >> FWIW it should be:
>> >> ```
>> >> 0000000000000000 <.text>:
>> >> 0: b8 ff ff ff ff mov $0xffffffff,%eax
>> >> 5: 48 0f bc c7 bsf %rdi,%rax
>> >> 9: ff c0 inc %eax
>> >> ```
>> >>
>> >> And since its in inline asm no reason not to get that.
>> >
>> >
>> > We shouldn't remove cmove because as per Intel BSF instruction manual, if the content of source operand
>> > is 0, the content of the destination operand is undefined. Also removing cmove doesn't provide any perf
>> > advantage.
>>
>> We rely on that behavior in other areas (memchr-evex512 for example).
>> It's been confirmed it is defined to not changed the destination (like AMD).
>>
>> It saves an instructions..
>>
>> Either way, however, I think adhemerval is right and we should use builtins
>> for this.
>
>
> Few things to notice here.
>
> This code is more than 20 years old, not sure what all intel processors this code may be running.
> Builtin will have the same issue, unless it can produce code 16 byte or less.
> Also the purpose of this patch is to fix random unexpected ~20% perf loss.
>
Likewise for the string/memory library....
Why not just update it w.o cmov? Just seems like a waste not to
address it while we're at it.
>> >
>> >
>> >>
>> >> >
>> >> > >
>> >> > >
>> >> > > r~
>> >> > >
More information about the Libc-alpha
mailing list