This is the mail archive of the
libc-ports@sources.redhat.com
mailing list for the libc-ports project.
Re: [PATCH 24/26] arm: Add optimized addmul_1
Richard Henderson <rth@twiddle.net> writes:
> +ENTRY(__mpn_addmul_1)
> + push { r4, r5, r6 }
> + cfi_adjust_cfa_offset (12)
> + cfi_rel_offset (r4, 0)
> + cfi_rel_offset (r5, 4)
> + cfi_rel_offset (r6, 8)
> +
> + ldr r6, [r1], #4
> + ldr r5, [r0]
> + mov r4, #0 /* init carry in */
> + b 1f
> +0:
> + ldr r6, [r1], #4 /* load next ul */
> + adds r4, r4, r5 /* (out, c) = cl + lpl */
> + ldr r5, [r0, #4] /* load next rl */
> + str r4, [r0], #4
> + adc r4, ip, #0 /* cl = hpl + c */
You might gain a cycle here on some cores by replacing r4 by something
else in the adds/str sequence and reversing the order of the last two
insns to better exploit dual-issue. On most semi-modern cores you can
get another register for free by pushing one more to the stack
(load/store multiple instructions transfer registers pairwise).
I'd expect this to benefit the A8 and maybe A9. On A15 it should make
no difference.
> +1:
> + mov ip, #0 /* zero-extend rl */
> + umlal r5, ip, r6, r3 /* (hpl, lpl) = ul * vl + rl */
> + subs r2, r2, #1
> + bne 0b
> +
> + adds r4, r4, r5 /* (out, c) = cl + llpl */
> + str r4, [r0]
> + adc r0, ip, #0 /* return hpl + c */
> +
> + pop { r4, r5, r6 }
> + DO_RET(lr)
> +END(__mpn_addmul_1)
--
Måns Rullgård
mans@mansr.com