Re: [PATCH 24/26] arm: Add optimized addmul_1

On 02/28/2013 05:58 AM, Måns Rullgård wrote:
>> > +0:
>> > +	ldr	r6, [r1], #4		/* load next ul */
>> > +	adds	r4, r4, r5		/* (out, c) = cl + lpl */
>> > +	ldr	r5, [r0, #4]		/* load next rl */
>> > +	str	r4, [r0], #4
>> > +	adc	r4, ip, #0		/* cl = hpl + c */
> You might gain a cycle here on some cores by replacing r4 by something
> else in the adds/str sequence and reversing the order of the last two
> insns to better exploit dual-issue.  On most semi-modern cores you can
> get another register for free by pushing one more to the stack
> (load/store multiple instructions transfer registers pairwise).
> I'd expect this to benefit the A8 and maybe A9.  On A15 it should make
> no difference.

To swap the adc and str, I'd have to add another move insn too.  I guess the
intent is that would dual-issue with the store, giving us 6 insns in 3 cycles
as opposed to 5 insns in 4 cycles?

Fair enough.

I'm not willing to work *too* hard on this.  If someone cares about the last
cycle of performance on A[89], they should work on getting the real libgmp
routines re-licensed for glibc.  I'm not willing to do politics.


