Bug 24226 - Need advise on the binutils problem that generating wrong instruction like lw a3,-2048(a5) on RISC-V backend
Summary: Need advise on the binutils problem that generating wrong instruction like lw...
Status: UNCONFIRMED
Alias: None
Product: binutils
Classification: Unclassified
Component: ld (show other bugs)
Version: unspecified
: P2 critical
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-18 06:44 UTC by GraceLiu
Modified: 2022-09-06 05:55 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
test case (21.24 KB, text/plain)
2019-02-18 06:44 UTC, GraceLiu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description GraceLiu 2019-02-18 06:44:37 UTC
Created attachment 11610 [details]
test case

Hello experts,

We are having the similar problem described in below: https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/KnziiZtEJNo/Jg4YBJ56CgAJ 

 


In our testcase, we are using the option  –mcmodel = medlow  and   -mexplicit_relocs, but we got the similar problem that binutils will generate wrong instruction like  lw       a3,-2048(a5).

When compiled to .S file, the result is correct:
                lui           a5,%hi(g_3030)
                lw           a4,%lo(g_3030)(a5)
                srli          a4,a4,8
                lw           a3,%lo(g_3030+4)(a5)



After link, the result is not correct:
   1db2a:               0002a7b7             lui           a5,0x2c
   1db2e:               7fc7a703              lw           a4,2044(a5) # 2c7fc <g_3030>
   1db32:               8321                       srli          a4,a4,0x8
   1db34:               8007a683             lw           a3,-2048(a5)


The result of %lo(g_3030+4) should be 2044 + 4 = 2048, but the value 2048 overflowed the 12 bit signed value, turn to -2048 and cause problem.

The workaround to not use –mexplicit_relocs (discussed in https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/KnziiZtEJNo/Jg4YBJ56CgAJ )would not work for us. It enlarges the code size for 10-20%. Is there any advice to fix this problem without bring too much effect on the code size?


Best wishes,

Grace
Comment 1 Jim Wilson 2019-02-18 23:55:30 UTC
The medany explicit-relocs problem is different than the one here.  This requires an auipc to trigger, and there is no auipc here.

This looks like a compiler bug if this is compiler generated code, or a programmer error if this is hand written assembly code.  It is not safe to use
    lw a3,%lo(g_3030+4)(a5)
unless g_3030 has 8-byte alignment, and if it does, then the %lo can not overflow.

The current linker sources will give an error if it detects an auipc/lw overflow, but I hadn't considered this case with an incorrect lui/lw instruction pair.  Even though this is compiler/user error, it would still be useful if the linker gave an error for it instead of silently producing incorrect code, as finding this incorrect code after the fact is likely to be hard.
Comment 2 Jim Wilson 2019-02-19 01:03:19 UTC
Another possibility here is a broken linker script that isn't respecting section alignment.
Comment 3 GraceLiu 2019-02-19 01:22:58 UTC
(In reply to Jim Wilson from comment #1)
> The medany explicit-relocs problem is different than the one here.  This
> requires an auipc to trigger, and there is no auipc here.
> 
> This looks like a compiler bug if this is compiler generated code, or a
> programmer error if this is hand written assembly code.  It is not safe to
> use
>     lw a3,%lo(g_3030+4)(a5)
> unless g_3030 has 8-byte alignment, and if it does, then the %lo can not
> overflow.
> 
> The current linker sources will give an error if it detects an auipc/lw
> overflow, but I hadn't considered this case with an incorrect lui/lw
> instruction pair.  Even though this is compiler/user error, it would still
> be useful if the linker gave an error for it instead of silently producing
> incorrect code, as finding this incorrect code after the fact is likely to
> be hard.


Thanks Jim for your comments. 
We are not using any hand written assembly code or any customized link script. The code is generated by compiler.
The struct of g_3030 is 
#pragma pack(push)
#pragma pack(1)
struct S0 {
   signed f0 : 4;
   const volatile int64_t  f1;
   volatile signed f2 : 1;
   signed f3 : 31;
   unsigned f4 : 8;
   signed f5 : 20;
   unsigned f6 : 5;
};
#pragma pack(pop)

static const struct S0 g_3030 = {0,-9L,-0,-22553,7,-841,1};/* VOLATILE GLOBAL g_3030 */

we tried to print the value of -9L in g_3030 but the value is wrong. I have attached  the testcase.
Comment 4 Jim Wilson 2019-02-19 19:24:05 UTC
Yes, I'd call this a compiler bug.  It is triggered when we have a long long inside a packed structure compiled for a 32-bit target, where the long long must be partially contained in the first word of the struct, in which case the long long has a 1 in 1K chance of getting an address that will generate an overflow when resolving relocations.  That would explain why I haven't seen it before.  Too many low chance conditions to trigger easily.  The same problem could occur for a 64-bit target with a 128-bit type in a packed struct, but that is probably even more rare.

There is still a linker issue here, in that the linker should generate an error when the reloc overflows and computes the wrong address.

You can work around the compiler bug by forcing the variable to have 8-byte alignment.  This can be done with an attribute
struct S0 g_3030 __attribute__ ((aligned(8))) = {0,-9L,-0,-22553,7,-841,1};
But that may be impractical if you have no easy way to identify the variables that need to be "fixed" to avoid the compiler problem.

The compiler bug needs to be reported into the gcc bugzilla so it can be fixed there.
Comment 5 GraceLiu 2019-02-20 01:28:51 UTC
(In reply to Jim Wilson from comment #4)
> Yes, I'd call this a compiler bug.  It is triggered when we have a long long
> inside a packed structure compiled for a 32-bit target, where the long long
> must be partially contained in the first word of the struct, in which case
> the long long has a 1 in 1K chance of getting an address that will generate
> an overflow when resolving relocations.  That would explain why I haven't
> seen it before.  Too many low chance conditions to trigger easily.  The same
> problem could occur for a 64-bit target with a 128-bit type in a packed
> struct, but that is probably even more rare.
> 
> There is still a linker issue here, in that the linker should generate an
> error when the reloc overflows and computes the wrong address.
> 
> You can work around the compiler bug by forcing the variable to have 8-byte
> alignment.  This can be done with an attribute
> struct S0 g_3030 __attribute__ ((aligned(8))) = {0,-9L,-0,-22553,7,-841,1};
> But that may be impractical if you have no easy way to identify the
> variables that need to be "fixed" to avoid the compiler problem.
> 
> The compiler bug needs to be reported into the gcc bugzilla so it can be
> fixed there.


Thanks Jim for the advice. 
I will try to file a bug to gcc bugzilla.
Comment 6 Nelson Chu 2022-09-06 05:55:05 UTC
(In reply to Jim Wilson from comment #4)
> There is still a linker issue here, in that the linker should generate an
> error when the reloc overflows and computes the wrong address.

Seems hard to check this in linker since the lui and lw are not chained together.

Consider the following reduced test case,

% cat tmp.s
lui a5, %hi(g_0x2c7fc)
lw  a4, %lo(g_0x2c7fc)(a5)
lui a5, %hi(g_0x2c7fc+4)
lw  a4, %lo(g_0x2c7fc+4)(a5)
lui a5, %hi(g_0x2c800)
lw  a4, %lo(g_0x2c800)(a5)
% riscv64-unknown-elf-as tmp.s -o tmp.o
% riscv64-unknown-elf-ld -defsym g_0x2c7fc=0x2c7fc -defsym g_0x2c800=0x2c800 tmp.o
/Users/nelsonc/work/build-upstream/build-elf64-upstream/build-install/bin/riscv64-unknown-elf-ld: warning: cannot find entry symbol _start; defaulting to 00000000000100b0
% riscv64-unknown-elf-objdump -d a.out 

a.out:     file format elf64-littleriscv


Disassembly of section .text:

00000000000100b0 <__BSS_END__-0x1018>:
   100b0:	0002c7b7          	lui	a5,0x2c
   100b4:	7fc7a703          	lw	a4,2044(a5) # 2c7fc <g_0x2c7fc>
   100b8:	0002d7b7          	lui	a5,0x2d
   100bc:	8007a703          	lw	a4,-2048(a5) # 2c800 <g_0x2c800>
   100c0:	0002d7b7          	lui	a5,0x2d
   100c4:	8007a703          	lw	a4,-2048(a5) # 2c800 <g_0x2c800>

We don't chain the low instruction to it's lui, so we don't even know if the lui is missing or cannot be shared (changed from 0x2c to 0x2d).  Compared to auipc, since the low instructions are chained to quips, so we can check these in details, https://github.com/bminor/binutils-gdb/blob/master/bfd/elfnn-riscv.c#L1884