This is the mail archive of the
libc-ports@sources.redhat.com
mailing list for the libc-ports project.
Re: [PATCH 24/26] arm: Add optimized addmul_1
Richard Henderson <rth@twiddle.net> writes:
> On 02/28/2013 05:58 AM, Måns Rullgård wrote:
>>> > +0:
>>> > + ldr r6, [r1], #4 /* load next ul */
>>> > + adds r4, r4, r5 /* (out, c) = cl + lpl */
>>> > + ldr r5, [r0, #4] /* load next rl */
>>> > + str r4, [r0], #4
>>> > + adc r4, ip, #0 /* cl = hpl + c */
>> You might gain a cycle here on some cores by replacing r4 by something
>> else in the adds/str sequence and reversing the order of the last two
>> insns to better exploit dual-issue. On most semi-modern cores you can
>> get another register for free by pushing one more to the stack
>> (load/store multiple instructions transfer registers pairwise).
>>
>> I'd expect this to benefit the A8 and maybe A9. On A15 it should make
>> no difference.
>>
>
> To swap the adc and str, I'd have to add another move insn too. I guess the
> intent is that would dual-issue with the store, giving us 6 insns in 3 cycles
> as opposed to 5 insns in 4 cycles?
I meant like this:
ldr r6, [r1], #4 /* load next ul */
adds r7, r4, r5 /* (out, c) = cl + lpl */
ldr r5, [r0, #4] /* load next rl */
adc r4, ip, #0 /* cl = hpl + c */
str r7, [r0], #4
It seems to me this leaves everything with the same values as your
version. r7 can be pushed/popped for free since you're currently
preserving and odd number of registers.
> Fair enough.
>
> I'm not willing to work *too* hard on this. If someone cares about the last
> cycle of performance on A[89], they should work on getting the real libgmp
> routines re-licensed for glibc. I'm not willing to do politics.
Nor am I.
--
Måns Rullgård
mans@mansr.com