ARM floating point differences

Rick Mann rmann@latencyzero.com
Wed Jan 16 01:29:00 GMT 2008


I was posting about this problem to gcc-help, and then gnuarm, but  
I've not gotten many responses lately on gnuarm, so I thought I'd try  
here. The original messages are appended at the bottom.

Basically, an older set of tools I built is
generating much faster floating point code. A new set of tools I built
does not have such fast FP code, and I'd like to figure out how to
rebuild it so that it does.


I've compared some of the floating point code in the disassembly of
our code. In one example, __addfs3, the code from one toolsuite is
markedly different from the other. I've included the disassembly below.

Clearly, the floating point code in the fast case is highly optimized.
It doesn't use the stack, it doesn't branch to other routines, etc.

Is there a configuration option I missed when I built the toolchain?
Or something else? The "slow" toolchain is built from the same or more  
recent versions of the tools. In practice, we probably won't use any  
floating point code, but it makes me wonder what other code lacks  
optimization.

Any help would be greatly appreciated.

TIA,
Rick


The slow code generated is:

80101a30 <__addsf3>:
80101a30:	e92d4030 	stmdb	sp!, {r4, r5, lr}
80101a34:	e24dd038 	sub	sp, sp, #56	; 0x38
80101a38:	e28d5020 	add	r5, sp, #32	; 0x20
80101a3c:	e58d0034 	str	r0, [sp, #52]
80101a40:	e58d1030 	str	r1, [sp, #48]
80101a44:	e28d0034 	add	r0, sp, #52	; 0x34
80101a48:	e1a01005 	mov	r1, r5
80101a4c:	e28d4010 	add	r4, sp, #16	; 0x10
80101a50:	eb0001b1 	bl	8010211c <__unpack_f>
80101a54:	e28d0030 	add	r0, sp, #48	; 0x30
80101a58:	e1a01004 	mov	r1, r4
80101a5c:	eb0001ae 	bl	8010211c <__unpack_f>
80101a60:	e1a01004 	mov	r1, r4
80101a64:	e1a0200d 	mov	r2, sp
80101a68:	e1a00005 	mov	r0, r5
80101a6c:	ebffff55 	bl	801017c8 <_fpadd_parts>
80101a70:	eb00014e 	bl	80101fb0 <__pack_f>
80101a74:	e28dd038 	add	sp, sp, #56	; 0x38
80101a78:	e8bd8030 	ldmia	sp!, {r4, r5, pc}


While the fast code (despite being much longer) is:

80101750 <__addsf3>:
80101750:	e1b02080 	lsls	r2, r0, #1
80101754:	11b03081 	lslsne	r3, r1, #1
80101758:	11320003 	teqne	r2, r3
8010175c:	11f0cc42 	mvnsne	ip, r2, asr #24
80101760:	11f0cc43 	mvnsne	ip, r3, asr #24
80101764:	0a00003c 	beq	8010185c <__addsf3+0x10c>
80101768:	e1a02c22 	lsr	r2, r2, #24
8010176c:	e0723c23 	rsbs	r3, r2, r3, lsr #24
80101770:	c0822003 	addgt	r2, r2, r3
80101774:	c0201001 	eorgt	r1, r0, r1
80101778:	c0210000 	eorgt	r0, r1, r0
8010177c:	c0201001 	eorgt	r1, r0, r1
80101780:	b2633000 	rsblt	r3, r3, #0	; 0x0
80101784:	e3530019 	cmp	r3, #25	; 0x19
80101788:	812fff1e 	bxhi	lr
8010178c:	e3100102 	tst	r0, #-2147483648	; 0x80000000
80101790:	e3800502 	orr	r0, r0, #8388608	; 0x800000
80101794:	e3c004ff 	bic	r0, r0, #-16777216	; 0xff000000
80101798:	12600000 	rsbne	r0, r0, #0	; 0x0
8010179c:	e3110102 	tst	r1, #-2147483648	; 0x80000000
801017a0:	e3811502 	orr	r1, r1, #8388608	; 0x800000
801017a4:	e3c114ff 	bic	r1, r1, #-16777216	; 0xff000000
801017a8:	12611000 	rsbne	r1, r1, #0	; 0x0
801017ac:	e1320003 	teq	r2, r3
801017b0:	0a000023 	beq	80101844 <__addsf3+0xf4>
801017b4:	e2422001 	sub	r2, r2, #1	; 0x1
801017b8:	e0900351 	adds	r0, r0, r1, asr r3
801017bc:	e2633020 	rsb	r3, r3, #32	; 0x20
801017c0:	e1a01311 	lsl	r1, r1, r3
801017c4:	e2003102 	and	r3, r0, #-2147483648	; 0x80000000
801017c8:	5a000001 	bpl	801017d4 <__addsf3+0x84>
801017cc:	e2711000 	rsbs	r1, r1, #0	; 0x0
801017d0:	e2e00000 	rsc	r0, r0, #0	; 0x0
801017d4:	e3500502 	cmp	r0, #8388608	; 0x800000
801017d8:	3a00000b 	bcc	8010180c <__addsf3+0xbc>
801017dc:	e3500401 	cmp	r0, #16777216	; 0x1000000
801017e0:	3a000004 	bcc	801017f8 <__addsf3+0xa8>
801017e4:	e1b000a0 	lsrs	r0, r0, #1
801017e8:	e1a01061 	rrx	r1, r1
801017ec:	e2822001 	add	r2, r2, #1	; 0x1
801017f0:	e35200fe 	cmp	r2, #254	; 0xfe
801017f4:	2a00002d 	bcs	801018b0 <__addsf3+0x160>
801017f8:	e3510102 	cmp	r1, #-2147483648	; 0x80000000
801017fc:	e0a00b82 	adc	r0, r0, r2, lsl #23
80101800:	03c00001 	biceq	r0, r0, #1	; 0x1
80101804:	e1800003 	orr	r0, r0, r3
80101808:	e12fff1e 	bx	lr
8010180c:	e1b01081 	lsls	r1, r1, #1
80101810:	e0a00000 	adc	r0, r0, r0
80101814:	e3100502 	tst	r0, #8388608	; 0x800000
80101818:	e2422001 	sub	r2, r2, #1	; 0x1
8010181c:	1afffff5 	bne	801017f8 <__addsf3+0xa8>
80101820:	e16fcf10 	clz	ip, r0
80101824:	e24cc008 	sub	ip, ip, #8	; 0x8
80101828:	e052200c 	subs	r2, r2, ip
8010182c:	e1a00c10 	lsl	r0, r0, ip
80101830:	a0800b82 	addge	r0, r0, r2, lsl #23
80101834:	b2622000 	rsblt	r2, r2, #0	; 0x0
80101838:	a1800003 	orrge	r0, r0, r3
8010183c:	b1830230 	orrlt	r0, r3, r0, lsr r2
80101840:	e12fff1e 	bx	lr
80101844:	e3320000 	teq	r2, #0	; 0x0
80101848:	e2211502 	eor	r1, r1, #8388608	; 0x800000
8010184c:	02200502 	eoreq	r0, r0, #8388608	; 0x800000
80101850:	02822001 	addeq	r2, r2, #1	; 0x1
80101854:	12433001 	subne	r3, r3, #1	; 0x1
80101858:	eaffffd5 	b	801017b4 <__addsf3+0x64>
8010185c:	e1a03081 	lsl	r3, r1, #1
80101860:	e1f0cc42 	mvns	ip, r2, asr #24
80101864:	11f0cc43 	mvnsne	ip, r3, asr #24
80101868:	0a000013 	beq	801018bc <__addsf3+0x16c>
8010186c:	e1320003 	teq	r2, r3
80101870:	0a000002 	beq	80101880 <__addsf3+0x130>
80101874:	e3320000 	teq	r2, #0	; 0x0
80101878:	01a00001 	moveq	r0, r1
8010187c:	e12fff1e 	bx	lr
80101880:	e1300001 	teq	r0, r1
80101884:	13a00000 	movne	r0, #0	; 0x0
80101888:	112fff1e 	bxne	lr
8010188c:	e31204ff 	tst	r2, #-16777216	; 0xff000000
80101890:	1a000002 	bne	801018a0 <__addsf3+0x150>
80101894:	e1b00080 	lsls	r0, r0, #1
80101898:	23800102 	orrcs	r0, r0, #-2147483648	; 0x80000000
8010189c:	e12fff1e 	bx	lr
801018a0:	e2922402 	adds	r2, r2, #33554432	; 0x2000000
801018a4:	32800502 	addcc	r0, r0, #8388608	; 0x800000
801018a8:	312fff1e 	bxcc	lr
801018ac:	e2003102 	and	r3, r0, #-2147483648	; 0x80000000
801018b0:	e383047f 	orr	r0, r3, #2130706432	; 0x7f000000
801018b4:	e3800502 	orr	r0, r0, #8388608	; 0x800000
801018b8:	e12fff1e 	bx	lr
801018bc:	e1f02c42 	mvns	r2, r2, asr #24
801018c0:	11a00001 	movne	r0, r1
801018c4:	01f03c43 	mvnseq	r3, r3, asr #24
801018c8:	11a01000 	movne	r1, r0
801018cc:	e1b02480 	lsls	r2, r0, #9
801018d0:	01b03481 	lslseq	r3, r1, #9
801018d4:	01300001 	teqeq	r0, r1
801018d8:	13800501 	orrne	r0, r0, #4194304	; 0x400000
801018dc:	e12fff1e 	bx	lr




A little more information: there seems to be a difference in the
resulting binary's floating point (which would go a long way to
explaining what I'm seeing). The ELF built with the more recent tools
results in this:

$ xscale-elf-readelf -h h.elf
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 61 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            ARM
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           ARM
  Version:                           0x1
  Entry point address:               0x80100000
  Start of program headers:          52 (bytes into file)
  Start of section headers:          448508 (bytes into file)
  Flags:                             0x602, has entry point, GNU EABI,
software FP, VFP
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         1
  Size of section headers:           40 (bytes)
  Number of section headers:         25
  Section header string table index: 22


The ELF built with the older (faster) tools results in this:

$ arm-elf-readelf -h h.elf
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 61 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            ARM
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           ARM
  Version:                           0x1
  Entry point address:               0x80100000
  Start of program headers:          52 (bytes into file)
  Start of section headers:          411484 (bytes into file)
  Flags:                             0x402, has entry point, GNU EABI,
VFP
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         1
  Size of section headers:           40 (bytes)
  Number of section headers:         26
  Section header string table index: 23

The relevant change is in the Flags: field. The new tools include
"software FP", the old tools don't.

Now, the processor doesn't have hardware floating point, yet the code
runs in both cases, so some kind of software floating point code is
being emitted.

TIA,
Rick

(original post below)


> I've been building tools targeting the Marvell Xscale processor a
> lot lately. A set of tools I build a few months ago seem to generate
> much faster code on our target hardware than tools I built more
> recently. There were some significant differences in the way the
> tools were built, but it doesn't seem like that's enough to explain
> the difference. Unfortunately, I don't remember exactly how I built
> the older toolchain, so I'm hoping someone can help me determine
> what it was by looking at the build result.
>
> Old tools:
>
> $ arm-elf-gcc -v
> Using built-in specs.
> Target: arm-elf
> Configured with: ../configure --prefix=/usr/local/arm3 --target=arm-
> elf --with-newlib --with-cpu=xscale --enable-languages=c,c++
> Thread model: single
> gcc version 4.2.1
>
> $ arm-elf-ld --version
> GNU ld (GNU Binutils) 2.18
>
> How do I tell what version of newlib is installed (I think it's 1.15)?
>
> Built using a multistep process, where I first built binutils, then
> gcc, then newlib (I don't recall if I did a stage 1 GCC build first,
> but somehow I got it all working).
>
>
> The latest tools are slightly different, and built with a combined
> tree build:
>
> gcc-4.2.2
> binutils-2.17
> newlib-1.15
>
> $ xscale-elf-gcc -v
> Using built-in specs.
> Target: xscale-elf
> Configured with: ../combined/configure --target=xscale-elf --disable-
> nls --with-newlib --prefix=/usr/local/gcc-xscale-elf --disable-
> newlib-supplied-syscalls
> Thread model: single
> gcc version 4.2.2
>
>
>
> I'm sorry I can't provide better information, but I'd really like to
> figure this out. The code doesn't call into the standard C library,
> but does make use of a lot of floating point code. Is it possible
> that this code is better with the other tools (either built more
> optimized, or generally different)? I don't know I'm just
> speculating. It is C++ code (bouncing balls on a screen, the balls
> are object instances).
>
> Thanks for any help!
>
> -- 
> Rick



--
For unsubscribe information see http://sourceware.org/lists.html#faq



More information about the crossgcc mailing list