ARM floating point differences
Rick Mann
rmann@latencyzero.com
Wed Jan 16 01:29:00 GMT 2008
I was posting about this problem to gcc-help, and then gnuarm, but
I've not gotten many responses lately on gnuarm, so I thought I'd try
here. The original messages are appended at the bottom.
Basically, an older set of tools I built is
generating much faster floating point code. A new set of tools I built
does not have such fast FP code, and I'd like to figure out how to
rebuild it so that it does.
I've compared some of the floating point code in the disassembly of
our code. In one example, __addfs3, the code from one toolsuite is
markedly different from the other. I've included the disassembly below.
Clearly, the floating point code in the fast case is highly optimized.
It doesn't use the stack, it doesn't branch to other routines, etc.
Is there a configuration option I missed when I built the toolchain?
Or something else? The "slow" toolchain is built from the same or more
recent versions of the tools. In practice, we probably won't use any
floating point code, but it makes me wonder what other code lacks
optimization.
Any help would be greatly appreciated.
TIA,
Rick
The slow code generated is:
80101a30 <__addsf3>:
80101a30: e92d4030 stmdb sp!, {r4, r5, lr}
80101a34: e24dd038 sub sp, sp, #56 ; 0x38
80101a38: e28d5020 add r5, sp, #32 ; 0x20
80101a3c: e58d0034 str r0, [sp, #52]
80101a40: e58d1030 str r1, [sp, #48]
80101a44: e28d0034 add r0, sp, #52 ; 0x34
80101a48: e1a01005 mov r1, r5
80101a4c: e28d4010 add r4, sp, #16 ; 0x10
80101a50: eb0001b1 bl 8010211c <__unpack_f>
80101a54: e28d0030 add r0, sp, #48 ; 0x30
80101a58: e1a01004 mov r1, r4
80101a5c: eb0001ae bl 8010211c <__unpack_f>
80101a60: e1a01004 mov r1, r4
80101a64: e1a0200d mov r2, sp
80101a68: e1a00005 mov r0, r5
80101a6c: ebffff55 bl 801017c8 <_fpadd_parts>
80101a70: eb00014e bl 80101fb0 <__pack_f>
80101a74: e28dd038 add sp, sp, #56 ; 0x38
80101a78: e8bd8030 ldmia sp!, {r4, r5, pc}
While the fast code (despite being much longer) is:
80101750 <__addsf3>:
80101750: e1b02080 lsls r2, r0, #1
80101754: 11b03081 lslsne r3, r1, #1
80101758: 11320003 teqne r2, r3
8010175c: 11f0cc42 mvnsne ip, r2, asr #24
80101760: 11f0cc43 mvnsne ip, r3, asr #24
80101764: 0a00003c beq 8010185c <__addsf3+0x10c>
80101768: e1a02c22 lsr r2, r2, #24
8010176c: e0723c23 rsbs r3, r2, r3, lsr #24
80101770: c0822003 addgt r2, r2, r3
80101774: c0201001 eorgt r1, r0, r1
80101778: c0210000 eorgt r0, r1, r0
8010177c: c0201001 eorgt r1, r0, r1
80101780: b2633000 rsblt r3, r3, #0 ; 0x0
80101784: e3530019 cmp r3, #25 ; 0x19
80101788: 812fff1e bxhi lr
8010178c: e3100102 tst r0, #-2147483648 ; 0x80000000
80101790: e3800502 orr r0, r0, #8388608 ; 0x800000
80101794: e3c004ff bic r0, r0, #-16777216 ; 0xff000000
80101798: 12600000 rsbne r0, r0, #0 ; 0x0
8010179c: e3110102 tst r1, #-2147483648 ; 0x80000000
801017a0: e3811502 orr r1, r1, #8388608 ; 0x800000
801017a4: e3c114ff bic r1, r1, #-16777216 ; 0xff000000
801017a8: 12611000 rsbne r1, r1, #0 ; 0x0
801017ac: e1320003 teq r2, r3
801017b0: 0a000023 beq 80101844 <__addsf3+0xf4>
801017b4: e2422001 sub r2, r2, #1 ; 0x1
801017b8: e0900351 adds r0, r0, r1, asr r3
801017bc: e2633020 rsb r3, r3, #32 ; 0x20
801017c0: e1a01311 lsl r1, r1, r3
801017c4: e2003102 and r3, r0, #-2147483648 ; 0x80000000
801017c8: 5a000001 bpl 801017d4 <__addsf3+0x84>
801017cc: e2711000 rsbs r1, r1, #0 ; 0x0
801017d0: e2e00000 rsc r0, r0, #0 ; 0x0
801017d4: e3500502 cmp r0, #8388608 ; 0x800000
801017d8: 3a00000b bcc 8010180c <__addsf3+0xbc>
801017dc: e3500401 cmp r0, #16777216 ; 0x1000000
801017e0: 3a000004 bcc 801017f8 <__addsf3+0xa8>
801017e4: e1b000a0 lsrs r0, r0, #1
801017e8: e1a01061 rrx r1, r1
801017ec: e2822001 add r2, r2, #1 ; 0x1
801017f0: e35200fe cmp r2, #254 ; 0xfe
801017f4: 2a00002d bcs 801018b0 <__addsf3+0x160>
801017f8: e3510102 cmp r1, #-2147483648 ; 0x80000000
801017fc: e0a00b82 adc r0, r0, r2, lsl #23
80101800: 03c00001 biceq r0, r0, #1 ; 0x1
80101804: e1800003 orr r0, r0, r3
80101808: e12fff1e bx lr
8010180c: e1b01081 lsls r1, r1, #1
80101810: e0a00000 adc r0, r0, r0
80101814: e3100502 tst r0, #8388608 ; 0x800000
80101818: e2422001 sub r2, r2, #1 ; 0x1
8010181c: 1afffff5 bne 801017f8 <__addsf3+0xa8>
80101820: e16fcf10 clz ip, r0
80101824: e24cc008 sub ip, ip, #8 ; 0x8
80101828: e052200c subs r2, r2, ip
8010182c: e1a00c10 lsl r0, r0, ip
80101830: a0800b82 addge r0, r0, r2, lsl #23
80101834: b2622000 rsblt r2, r2, #0 ; 0x0
80101838: a1800003 orrge r0, r0, r3
8010183c: b1830230 orrlt r0, r3, r0, lsr r2
80101840: e12fff1e bx lr
80101844: e3320000 teq r2, #0 ; 0x0
80101848: e2211502 eor r1, r1, #8388608 ; 0x800000
8010184c: 02200502 eoreq r0, r0, #8388608 ; 0x800000
80101850: 02822001 addeq r2, r2, #1 ; 0x1
80101854: 12433001 subne r3, r3, #1 ; 0x1
80101858: eaffffd5 b 801017b4 <__addsf3+0x64>
8010185c: e1a03081 lsl r3, r1, #1
80101860: e1f0cc42 mvns ip, r2, asr #24
80101864: 11f0cc43 mvnsne ip, r3, asr #24
80101868: 0a000013 beq 801018bc <__addsf3+0x16c>
8010186c: e1320003 teq r2, r3
80101870: 0a000002 beq 80101880 <__addsf3+0x130>
80101874: e3320000 teq r2, #0 ; 0x0
80101878: 01a00001 moveq r0, r1
8010187c: e12fff1e bx lr
80101880: e1300001 teq r0, r1
80101884: 13a00000 movne r0, #0 ; 0x0
80101888: 112fff1e bxne lr
8010188c: e31204ff tst r2, #-16777216 ; 0xff000000
80101890: 1a000002 bne 801018a0 <__addsf3+0x150>
80101894: e1b00080 lsls r0, r0, #1
80101898: 23800102 orrcs r0, r0, #-2147483648 ; 0x80000000
8010189c: e12fff1e bx lr
801018a0: e2922402 adds r2, r2, #33554432 ; 0x2000000
801018a4: 32800502 addcc r0, r0, #8388608 ; 0x800000
801018a8: 312fff1e bxcc lr
801018ac: e2003102 and r3, r0, #-2147483648 ; 0x80000000
801018b0: e383047f orr r0, r3, #2130706432 ; 0x7f000000
801018b4: e3800502 orr r0, r0, #8388608 ; 0x800000
801018b8: e12fff1e bx lr
801018bc: e1f02c42 mvns r2, r2, asr #24
801018c0: 11a00001 movne r0, r1
801018c4: 01f03c43 mvnseq r3, r3, asr #24
801018c8: 11a01000 movne r1, r0
801018cc: e1b02480 lsls r2, r0, #9
801018d0: 01b03481 lslseq r3, r1, #9
801018d4: 01300001 teqeq r0, r1
801018d8: 13800501 orrne r0, r0, #4194304 ; 0x400000
801018dc: e12fff1e bx lr
A little more information: there seems to be a difference in the
resulting binary's floating point (which would go a long way to
explaining what I'm seeing). The ELF built with the more recent tools
results in this:
$ xscale-elf-readelf -h h.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 61 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: ARM
ABI Version: 0
Type: EXEC (Executable file)
Machine: ARM
Version: 0x1
Entry point address: 0x80100000
Start of program headers: 52 (bytes into file)
Start of section headers: 448508 (bytes into file)
Flags: 0x602, has entry point, GNU EABI,
software FP, VFP
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 1
Size of section headers: 40 (bytes)
Number of section headers: 25
Section header string table index: 22
The ELF built with the older (faster) tools results in this:
$ arm-elf-readelf -h h.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 61 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: ARM
ABI Version: 0
Type: EXEC (Executable file)
Machine: ARM
Version: 0x1
Entry point address: 0x80100000
Start of program headers: 52 (bytes into file)
Start of section headers: 411484 (bytes into file)
Flags: 0x402, has entry point, GNU EABI,
VFP
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 1
Size of section headers: 40 (bytes)
Number of section headers: 26
Section header string table index: 23
The relevant change is in the Flags: field. The new tools include
"software FP", the old tools don't.
Now, the processor doesn't have hardware floating point, yet the code
runs in both cases, so some kind of software floating point code is
being emitted.
TIA,
Rick
(original post below)
> I've been building tools targeting the Marvell Xscale processor a
> lot lately. A set of tools I build a few months ago seem to generate
> much faster code on our target hardware than tools I built more
> recently. There were some significant differences in the way the
> tools were built, but it doesn't seem like that's enough to explain
> the difference. Unfortunately, I don't remember exactly how I built
> the older toolchain, so I'm hoping someone can help me determine
> what it was by looking at the build result.
>
> Old tools:
>
> $ arm-elf-gcc -v
> Using built-in specs.
> Target: arm-elf
> Configured with: ../configure --prefix=/usr/local/arm3 --target=arm-
> elf --with-newlib --with-cpu=xscale --enable-languages=c,c++
> Thread model: single
> gcc version 4.2.1
>
> $ arm-elf-ld --version
> GNU ld (GNU Binutils) 2.18
>
> How do I tell what version of newlib is installed (I think it's 1.15)?
>
> Built using a multistep process, where I first built binutils, then
> gcc, then newlib (I don't recall if I did a stage 1 GCC build first,
> but somehow I got it all working).
>
>
> The latest tools are slightly different, and built with a combined
> tree build:
>
> gcc-4.2.2
> binutils-2.17
> newlib-1.15
>
> $ xscale-elf-gcc -v
> Using built-in specs.
> Target: xscale-elf
> Configured with: ../combined/configure --target=xscale-elf --disable-
> nls --with-newlib --prefix=/usr/local/gcc-xscale-elf --disable-
> newlib-supplied-syscalls
> Thread model: single
> gcc version 4.2.2
>
>
>
> I'm sorry I can't provide better information, but I'd really like to
> figure this out. The code doesn't call into the standard C library,
> but does make use of a lot of floating point code. Is it possible
> that this code is better with the other tools (either built more
> optimized, or generally different)? I don't know I'm just
> speculating. It is C++ code (bouncing balls on a screen, the balls
> are object instances).
>
> Thanks for any help!
>
> --
> Rick
--
For unsubscribe information see http://sourceware.org/lists.html#faq
More information about the crossgcc
mailing list