This is the mail archive of the
gdb@sourceware.org
mailing list for the GDB project.
Debugging return.exp on ARM
- From: Simon Marchi <simon dot marchi at ericsson dot com>
- To: <gdb at sourceware dot org>
- Cc: Yao Qi <qiyaoltc at gmail dot com>
- Date: Thu, 26 May 2016 11:15:08 -0400
- Subject: Debugging return.exp on ARM
- Authentication-results: sourceware.org; auth=none
Hi everyone,
In an attempt to fix flaky tests on ARM, I started looking at gdb.base/return.exp.
The last test, which tests the "return" command on a function that returns a double,
fails randomly on our ODroid XU-4 board. We have another board, a Firefly RK3288,
which fails the same way (and even more frequently). I have the feeling that there's
a race somewhere in the kernel/cache/memory/something.
I isolated a minimal reproducer from the test case, that goes like this:
double
func3 ()
{
return -5.0;
}
double tmp3;
int main ()
{
tmp3 = func3 ();
return 0;
}
Built with:
$ arm-linux-gnueabihf-gcc -g3 -O0 return.c -o return
And here is the gdb script to run:
file ~/return
b func3
run
return 2.0
n
print tmp3
quit tmp3 != 2
I simply run gdb like this:
$ ./gdb -nx -batch -x run.gdb
What the test does is run to the beginning of func3, then issues the command
"return 2.0", which makes the function artificially return with the value 2.0.
It then does a "next" to complete the assignment to tmp3, and then prints the
value of tmp3. Most of the time, we see the expected value, 2.0. Once in a
while, we get 0.
When doing the return, GDB writes 2.0 in the d0 register, which is the place where
a return value of type "double" should be (and writes other registers including pc and
sp to actually pop the stack frame). I added debug traces to confirm that the
right value is written in d0 though ptrace by GDB (even in failure cases). So when we
resume the thread (when doing the "next" command), it should have the right value in
its d0 register. When doing the next, those are the exact instructions it executes (also
confirmed by infrun debug):
83e4: eeb0 7b40 vmov.f64 d7, d0
83e8: f241 0330 movw r3, #4144 ; 0x1030
83ec: f2c0 0301 movt r3, #1
83f0: ed83 7b00 vstr d7, [r3]
In other words, move d0 to d7 and then store it to tmp3's address (0x11030). I
don't see anything that can go wrong with these instructions... if d0 contains
the right value at the time the thread is resumed, the tmp3 should contain the
right value at the end. However, as I said earlier, we get the wrong value once
in a while. So it sounds like somehow the value didn't make it in time to the d0
register when the thread was resumed, or it's GDB reads the value of tmp3 before
the effect of the vstr is visible...
Given that we give the right input to the kernel, even in the cases that
fail, I assume that the problem must be something like wrong cache invalidation
or memory barrier/sequencing.
I ran this test in a loop and got these results:
ODroid XU-4:
263 fails
737 successes
Firefly RK3288:
336 fails
163 success
First, is anybody able to reproduce the problem on other boards? Then, does anybody
have an idea what could cause this?
Thanks!
Simon