Bug 29721 - [gdb, record, aarch64] FAIL: gdb.reverse/solib-precsave.exp: reverse-next third shr1
Summary: [gdb, record, aarch64] FAIL: gdb.reverse/solib-precsave.exp: reverse-next thi...
Status: NEW
Alias: None
Product: gdb
Classification: Unclassified
Component: record (show other bugs)
Version: HEAD
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-10-25 13:30 UTC by Tom de Vries
Modified: 2023-01-20 08:57 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
gdb.log on aarch64-linux, openSUSE Leap 15.4 (2.27 KB, text/x-log)
2022-10-26 14:16 UTC, Tom de Vries
Details
gdb.log on aarch64-linux, openSUSE Leap 15.4 (gdb.base/unwind-on-each-insn.exp) (2.04 KB, text/x-log)
2022-10-26 14:43 UTC, Tom de Vries
Details
Reproducer patch, using Leap 15.4 .s files (2.54 KB, patch)
2022-11-10 14:18 UTC, Tom de Vries
Details | Diff
Log failing to reproduce (2.30 KB, text/x-log)
2022-12-13 14:57 UTC, Guinevere Larsen
Details
Reproducer patch, using Leap 15.4 .s files (2.18 KB, patch)
2023-01-19 13:47 UTC, Tom de Vries
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tom de Vries 2022-10-25 13:30:58 UTC
With current trunk on aarch64-linux I'm running into:
...
FAIL: gdb.reverse/solib-precsave.exp: reverse-next third shr1
FAIL: gdb.reverse/solib-precsave.exp: reverse-next second shr1
FAIL: gdb.reverse/solib-precsave.exp: reverse-next first shr1
FAIL: gdb.reverse/solib-precsave.exp: reverse-next generic
FAIL: gdb.reverse/solib-precsave.exp: reverse-step into solib function one
FAIL: gdb.reverse/solib-precsave.exp: reverse-step within solib function one
FAIL: gdb.reverse/solib-precsave.exp: reverse-step back to main one
...

This seems to be a regression since gdb 12.1.

I'll try to bisect.
Comment 1 Tom de Vries 2022-10-25 16:20:02 UTC
1f3e37e057e876b37db49dbd8ed5ca22c33f6772 is the first bad commit
commit 1f3e37e057e876b37db49dbd8ed5ca22c33f6772
Author: Bruno Larsen <blarsen@redhat.com>
Date:   Wed May 25 15:02:47 2022 -0300

    gdb/reverse: Fix stepping over recursive functions
Comment 2 Guinevere Larsen 2022-10-26 11:56:21 UTC
Running this test on Fedora 36 in aarch64, I only get 5 fails. Looking at the log:

reverse-next^M
40        b[0] = 6;   b[1] = 9;         /* generic statement, end part two */^M
(gdb) PASS: gdb.reverse/solib-precsave.exp: reverse-next generic
reverse-step^M
40        b[0] = 6;   b[1] = 9;         /* generic statement, end part two */^M
(gdb) FAIL: gdb.reverse/solib-precsave.exp: reverse-step into solib function one
reverse-step^M
38        b[1] = shr2(17);              /* middle part two */^M
(gdb) FAIL: gdb.reverse/solib-precsave.exp: reverse-step within solib function one
reverse-step^M
shr2 (x=17) at /root/bld/gdb/testsuite/../../../binutils-gdb/gdb/testsuite/gdb.reverse/shr2.c:23^M
23      }^M
(gdb) FAIL: gdb.reverse/solib-precsave.exp: reverse-step back to main one
reverse-step^M
22        return 2*x;^M
(gdb) FAIL: gdb.reverse/solib-precsave.exp: reverse-step into solib function two
reverse-step^M
main () at /root/bld/gdb/testsuite/../../../binutils-gdb/gdb/testsuite/gdb.reverse/solib-reverse.c:38^M
38        b[1] = shr2(17);              /* middle part two */^M
(gdb) FAIL: gdb.reverse/solib-precsave.exp: reverse-step within solib function two

It seems that aarch64 is also experiencing https://sourceware.org/bugzilla/show_bug.cgi?id=28426 now, and x86 stopped experiencing it. While I agree that this should be fixed, I'm not sure 1f3e37e057e876b37db49dbd8ed5ca22c33f6772 is the actual culprit. I'll add this to my backlog, but it is pretty low priority, since testing in aarch64 is cumbersome for me, unfortunately.

Also, if you are still seeing the failures with third, second and first shr1 with the cureent trunk, please post the relevant log here :-)
Comment 3 Tom de Vries 2022-10-26 14:16:38 UTC
Created attachment 14416 [details]
gdb.log on aarch64-linux, openSUSE Leap 15.4
Comment 4 Guinevere Larsen 2022-10-26 14:36:24 UTC
This looks like a completely different issue to what I found. Looking at

until 46
main () at /home/tdevries/gdb/src/gdb/testsuite/gdb.reverse/solib-reverse.c:46
46	  return 0;			/* end part one */
(gdb) PASS: gdb.reverse/solib-precsave.exp: forward to end part one
reverse-next

No more reverse-execution history.
main () at /home/tdevries/gdb/src/gdb/testsuite/gdb.reverse/solib-reverse.c:27
27	  char* cptr = "String 1";

It looks to be the reason I made my patch that changes frame-id calculations for amd64 epilogue unwinder. Essentially, an incorrect frame-id was set as the owner of the step-resume breakpoint used by reverse-next, GDB never "hit" that breakpoint and went back to the start of the execution history.

Is gdb.base/unwind-on-each-insn.exp passing for you?
Comment 5 Tom de Vries 2022-10-26 14:43:55 UTC
Created attachment 14417 [details]
gdb.log on aarch64-linux, openSUSE Leap 15.4 (gdb.base/unwind-on-each-insn.exp)

(In reply to B. Larsen from comment #4)
> Is gdb.base/unwind-on-each-insn.exp passing for you?

Yes.
Comment 6 Tom de Vries 2022-10-26 14:47:35 UTC
(In reply to B. Larsen from comment #4)
> This looks like a completely different issue to what I found. 

Well, it looks like you ran into PR28426, which requires column info to reproduce.

This PR is using gcc 7.5.0, which does not have column info by default.

So, you could try -gno-column-info with your compiler, and perhaps that'll allow you to get past PR28426 and run into this PR.
Comment 7 Luis Machado 2022-10-27 20:15:01 UTC
I can test on aarch64 and armhf quite easily if you want.
Comment 8 Tom de Vries 2022-11-02 14:05:34 UTC
(In reply to Tom de Vries from comment #6)
> (In reply to B. Larsen from comment #4)
> > This looks like a completely different issue to what I found. 
> 
> Well, it looks like you ran into PR28426, which requires column info to
> reproduce.
> 
> This PR is using gcc 7.5.0, which does not have column info by default.
> 
> So, you could try -gno-column-info with your compiler, and perhaps that'll
> allow you to get past PR28426 and run into this PR.

Bruno, have you been able to reproduce ?
Comment 9 Guinevere Larsen 2022-11-02 15:42:59 UTC
> Bruno, have you been able to reproduce ?

Haven't gotten around to it yet. I'm working on the regression that Simon found on gdb.reverse/step-reverse.exp using native-gdbserver. I hope to be able to get to this bug next week, but since that bug happens because of a corrupted stack somewhere, it might take me a little while.
Comment 10 Guinevere Larsen 2022-11-08 16:08:49 UTC
Hi Tom,

I am unable to reproduce the bug using gcc 12.2.1. Adding -gno-column-info only removed those 5 errors.

I'm still suspecting that a frame id is being calculated incorrectly somewhere, I'll try to build gcc 7.5.0 and reproduce
Comment 11 Guinevere Larsen 2022-11-10 12:07:51 UTC
The oldest gcc I could find was 8.5.0, and it isn't reproducing the issue. I don't think I can help with this regression... sorry
Comment 12 Luis Machado 2022-11-10 12:40:22 UTC
Are we sure this isn't the known issue of reverse-stepping through a non-contiguous block of PC addresses? The one we stop halfway through line because we have a couple distinct entries for the same line?
Comment 13 Tom de Vries 2022-11-10 14:18:51 UTC
Created attachment 14450 [details]
Reproducer patch, using Leap 15.4 .s files
Comment 14 Tom de Vries 2022-11-10 14:20:32 UTC
(In reply to B. Larsen from comment #11)
> The oldest gcc I could find was 8.5.0, and it isn't reproducing the issue. I
> don't think I can help with this regression... sorry

I created a patch that uses .s files for the two files that use debug info, so perhaps using those you should be able to reproduce.

I understand if you consider this too much of a security risk...
Comment 15 Guinevere Larsen 2022-12-13 14:57:56 UTC
Created attachment 14516 [details]
Log failing to reproduce

I've just attempted to reproduce using the .s files you provided, and I still don't see the same failures you have, tom. The failures that show up in this log are probably because some .s file had the path from your machine.

I've attached the execution log, but I'm not sure how useful it will be...

Again, I'm very sorry, I don't think I can help with this one.
Comment 16 Tom de Vries 2023-01-16 10:43:39 UTC
First sign of trouble:
...
    [infrun] bpstat_check_breakpoint_conditions: incorrect frame {stack=0xfffffffff320,code=0x0000fffff7f9d594,!special} not {stack=0xfffffffff330,code=0x0000fffff7f9d594,!special}, not stopping
...

Using additional_flags=-fasynchronous-unwind-tables works around this.

On x86_64, using additional_flags=-fno-asynchronous-unwind-tables to drop the cfi info, the problem doesn't trigger.

With x86_64, we have:
...
000000000000050a <shr1>:
 50a:   55                      push   %rbp
 50b:   48 89 e5                mov    %rsp,%rbp
 50e:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
 512:   90                      nop
 513:   5d                      pop    %rbp
 514:   c3                      ret
...
and at the first insn:
...
1: /x $pc = 0x7ffff7ff150a
2: /x $sp = 0x7fffffffdad8
3: /x $rbp = 0x7fffffffdaf0
(gdb) info frame
Stack level 0, frame at 0x7fffffffdae0:
 rip = 0x7ffff7ff150a in shr1; saved rip = 0x4005df
 called by frame at 0x7fffffffdb00
 Arglist at 0x7fffffffdad0, args: 
 Locals at 0x7fffffffdad0, Previous frame's sp is 0x7fffffffdae0
 Saved registers:
  rip at 0x7fffffffdad8
...
and at the last insn:
...
1: /x $pc = 0x7ffff7ff1514
2: /x $sp = 0x7fffffffdad8
3: /x $rbp = 0x7fffffffdaf0
(gdb) info frame
Stack level 0, frame at 0x7fffffffdae0:
 rip = 0x7ffff7ff1514 in shr1; saved rip = 0x4005df
 called by frame at 0x7fffffffdb00
 Arglist at 0x7fffffffdaf0, args: 
 Locals at 0x7fffffffdaf0, Previous frame's sp is 0x7fffffffdae0
 Saved registers:
  rip at 0x7fffffffdad8
...
the same frame address (as well as the same $sp and $fp).

On aarch64 (with additional_flags=-fno-omit-frame-pointer additional_flags=-mno-omit-leaf-frame-pointer added, to have more comparable code), we have:
...
0000000000000594 <shr1>:
 594:   a9be7bfd        stp     x29, x30, [sp, #-32]!
 598:   910003fd        mov     x29, sp
 59c:   f9000fa0        str     x0, [x29, #24]
 5a0:   d503201f        nop
 5a4:   a8c27bfd        ldp     x29, x30, [sp], #32
 5a8:   d65f03c0        ret
...
and at the first insn:
...
1: /x $pc = 0xfffff7f9d594
2: /x $sp = 0xfffffffff320
3: /x $x29 = 0xfffffffff320
(gdb) info frame
Stack level 0, frame at 0xfffffffff320:
 pc = 0xfffff7f9d594 in shr1; saved pc = 0x400718
 called by frame at 0xfffffffff340
 Arglist at 0xfffffffff320, args: 
 Locals at 0xfffffffff320, Previous frame's sp is 0xfffffffff320
...
and at the last insn:
...
1: /x $pc = 0xfffff7f9d5a8
2: /x $sp = 0xfffffffff320
3: /x $x29 = 0xfffffffff320
(gdb) info frame
Stack level 0, frame at 0xfffffffff340:
 pc = 0xfffff7f9d5a8 in shr1; saved pc = 0xfffff7d68fa0
 called by frame at 0xfffffffff490
 Arglist at 0xfffffffff320, args: 
 Locals at 0xfffffffff320, Previous frame's sp is 0xfffffffff340
 Saved registers:
  x29 at 0xfffffffff320, x30 at 0xfffffffff328
...
so a different frame address (while having the same $sp and $fp).
Comment 17 Tom de Vries 2023-01-16 15:58:45 UTC
(In reply to Tom de Vries from comment #16)
> On aarch64 (with additional_flags=-fno-omit-frame-pointer
> additional_flags=-mno-omit-leaf-frame-pointer added, to have more comparable
> code), we have:

And for completeness-sake, without those flags (so, exercising the original failure):
...
0000000000000594 <shr1>:
 594:   d10043ff        sub     sp, sp, #0x10
 598:   f90007e0        str     x0, [sp, #8]
 59c:   d503201f        nop
 5a0:   910043ff        add     sp, sp, #0x10
 5a4:   d65f03c0        ret
...
and at first insn:
...
1: /x $pc = 0xfffff7f9d594
2: /x $sp = 0xfffffffff320
3: /x $x29 = 0xfffffffff320
(gdb) info frame
Stack level 0, frame at 0xfffffffff320:
 pc = 0xfffff7f9d594 in shr1; saved pc = 0x400718
 called by frame at 0xfffffffff340
 Arglist at 0xfffffffff320, args: 
 Locals at 0xfffffffff320, Previous frame's sp is 0xfffffffff320
...
and at last insn:
...
1: /x $pc = 0xfffff7f9d5a4
2: /x $sp = 0xfffffffff320
3: /x $x29 = 0xfffffffff320
(gdb) info frame
Stack level 0, frame at 0xfffffffff330:
 pc = 0xfffff7f9d5a4 in shr1; saved pc = 0x400718
 called by frame at 0xfffffffff340
 Arglist at 0xfffffffff320, args: 
 Locals at 0xfffffffff320, Previous frame's sp is 0xfffffffff330
 Saved registers:
  x0 at 0xfffffffff328
...
Comment 18 Tom de Vries 2023-01-16 16:40:56 UTC
Tentative patch:
...
diff --git a/gdb/aarch64-tdep.c b/gdb/aarch64-tdep.c
index b576d3b9d99..06349353716 100644
--- a/gdb/aarch64-tdep.c
+++ b/gdb/aarch64-tdep.c
@@ -996,7 +996,11 @@ aarch64_make_prologue_cache_1 (frame_info_ptr this_frame,
   if (unwound_fp == 0)
     return;
 
-  cache->prev_sp = unwound_fp + cache->framesize;
+  if (cache->framereg == AARCH64_SP_REGNUM
+      && get_frame_register_unsigned (this_frame, AARCH64_FP_REGNUM) == unwound_fp)
+    cache->prev_sp = unwound_fp;
+  else
+    cache->prev_sp = unwound_fp + cache->framesize;
 
   /* Calculate actual addresses of saved registers using offsets
      determined by aarch64_analyze_prologue.  */
...
Comment 19 Luis Machado 2023-01-16 18:59:52 UTC
Isn't this a failure of gdb to interpret the line table changes properly? If so, why the target-specific change to aarch64's prologue cache?
Comment 20 Tom de Vries 2023-01-16 19:39:15 UTC
(In reply to Luis Machado from comment #19)
> Isn't this a failure of gdb to interpret the line table changes properly? If
> so, why the target-specific change to aarch64's prologue cache?

The problem happens in libshr1.sl, which is compiled without debug info, so there are no line tables...
Comment 21 Luis Machado 2023-01-16 19:53:45 UTC
Ok. So is this a completely different issue compared to the line table one?
Comment 22 Tom de Vries 2023-01-16 19:57:54 UTC
(In reply to Luis Machado from comment #21)
> Ok. So is this a completely different issue compared to the line table one?

I'm not sure yet if there is one.
Comment 23 Tom de Vries 2023-01-17 16:47:31 UTC
Filed spinoff PRs:
- PR30010 - [gdb/tdep, aarch64] Incorrect frame address for last insn
  (non-leaf case)
  https://sourceware.org/bugzilla/show_bug.cgi?id=30010
- PR30011 - [gdb/tdep, aarch64] Incorrect frame address for last insn
  (leaf case) 
  https://sourceware.org/bugzilla/show_bug.cgi?id=30011
Comment 24 Tom de Vries 2023-01-19 13:42:37 UTC
(In reply to B. Larsen from comment #15)
> I've attached the execution log, but I'm not sure how useful it will be...

In the log I see this:
...
warning: Can not parse XML target description; XML support was disabled at compile time
...
I'd advise you to always go and fix this if you run into this, some targets don't function properly without.
Comment 25 Tom de Vries 2023-01-19 13:46:05 UTC
(In reply to B. Larsen from comment #15)
> I've just attempted to reproduce using the .s files you provided, and I
> still don't see the same failures you have, tom. 

OK, I'll attach a patch that uses the .s file for all three files involved, hopefully that'll allow reproduction.
Comment 26 Tom de Vries 2023-01-19 13:47:46 UTC
Created attachment 14609 [details]
Reproducer patch, using Leap 15.4 .s files
Comment 27 Luis Machado 2023-01-19 13:56:31 UTC
(In reply to Tom de Vries from comment #24)
> (In reply to B. Larsen from comment #15)
> > I've attached the execution log, but I'm not sure how useful it will be...
> 
> In the log I see this:
> ...
> warning: Can not parse XML target description; XML support was disabled at
> compile time
> ...
> I'd advise you to always go and fix this if you run into this, some targets
> don't function properly without.

We should probably start erroring out if XML is not supported in some targets. Assuming targets descriptions when XML is not available is very error-prone.
Comment 28 Tom de Vries 2023-01-19 14:00:57 UTC
(In reply to Tom de Vries from comment #26)
> Created attachment 14609 [details]
> Reproducer patch, using Leap 15.4 .s files

And if there is somehow a problem with running the test-case, try the minimal:
...
$ cat gdb.in
set trace-commands on
start
record
b 43
continue
reverse-next
$ gdb -q -batch outputs/gdb.reverse/solib-reverse/solib-reverse -x gdb.in
+start
Temporary breakpoint 1 at 0x40069c: file /home/tdevries/gdb/src/gdb/testsuite/gdb.reverse/solib-reverse.c, line 27.

Temporary breakpoint 1, main () at /home/tdevries/gdb/src/gdb/testsuite/gdb.reverse/solib-reverse.c:27
27        char* cptr = "String 1";
+record
+target record-full
+b 43
Breakpoint 2 at 0x400700: file /home/tdevries/gdb/src/gdb/testsuite/gdb.reverse/solib-reverse.c, line 43.
+continue

Breakpoint 2, main () at /home/tdevries/gdb/src/gdb/testsuite/gdb.reverse/solib-reverse.c:43
43        shr1 ("message 2\n");         /* shr1 two */
+reverse-next
+next

No more reverse-execution history.
main () at /home/tdevries/gdb/src/gdb/testsuite/gdb.reverse/solib-reverse.c:27
27        char* cptr = "String 1";
...
Comment 29 Tom de Vries 2023-01-19 14:14:42 UTC
I've submitted a patch series here ( https://sourceware.org/pipermail/gdb-patches/2023-January/195920.html ) that contains a fix for PR30011. With that fix, this PR no longer triggers.

The remaining question for me is whether PR30011 and this PR are duplicates.

Put differently, is there something we can do in gdb to deal with the situation that the frame id for first and last insn in a function are not necessarily equal.

Say we assume that they are equal if the target contains an epilogue unwinder, but if not, we're handling things more conservatively.  This kind of fix could be backported to fix the regression that was introduced.

Bruno, any comment?
Comment 30 Guinevere Larsen 2023-01-19 14:28:35 UTC
> Say we assume that they are equal if the target contains an epilogue unwinder, but if not, we're handling things more conservatively.  This kind of fix could be backported to fix the regression that was introduced.

This sounds ok, but I wonder if "handling things more conservatively" will regress [record/16678](https://sourceware.org/bugzilla/show_bug.cgi?id=16678). To fix that bug I assumed that we knew the frame ID would be the same throughout the whole function. If we can't rely on that, how are we supposed to detect a breakpoint being hit in a recursive call?
Comment 31 Tom de Vries 2023-01-19 17:30:31 UTC
(In reply to B. Larsen from comment #30)
> > Say we assume that they are equal if the target contains an epilogue unwinder, but if not, we're handling things more conservatively.  This kind of fix could be backported to fix the regression that was introduced.
> 
> This sounds ok, but I wonder if "handling things more conservatively" will
> regress
> [record/16678](https://sourceware.org/bugzilla/show_bug.cgi?id=16678). To
> fix that bug I assumed that we knew the frame ID would be the same
> throughout the whole function. If we can't rely on that, how are we supposed
> to detect a breakpoint being hit in a recursive call?

Well, you've fixed that problem for a recursive function with debug info, and for x86_64 without debug info (meaning, also using -fno-asynchronous-unwind-tables to make sure there's no cfi directives).

On aarch64 though, I see:
...
$ cat test.c
extern int factorial (int x);
int v;
int main() {
  v = 1;
  factorial(5);
  return 0;
}
$ cat test2.c
int factorial(int x) {
  if (x == 1) return 1;
  int result = x * factorial(x-1);
  return result;
}
$ gcc test2.c -c ; gcc test.c test2.o -g
$ gdb a.out
Reading symbols from a.out...
(gdb) start
Temporary breakpoint 1 at 0x40056c: file test.c, line 4.
Starting program: a.out 

Temporary breakpoint 1, main () at test.c:4
4         v = 1;
(gdb) record
(gdb) next
5         factorial(5);
(gdb) next
6         return 0;
(gdb) reverse-next
0x00000000004005d4 in factorial ()
(gdb) 
...

This is essentially the same problem as in this PR ... and I'm proposing that we fix it, in a target-independent fashion.

I think the following cases can be distinguished:
- cfi info present, assume frame ID's accurate, no special handling needed
- cfi info not present, but target has an epilogue unwinder , assume frame
  ID's accurate, no special handling needed
- cfi info not present, no epilogue unwinder. Special handling needed,
  but no need to worry about breaking handling of recursive function,
  because it's already broken, and the special handling intends to fix it.

My naive idea on how to approach this was to instead of doing a step-resume breakpoint at the last insn and resume, do first a single step back, and then insert the step-resume breakpoint and resume.  This will already fix the aarch64 case.  There may be cases where this gives the wrong answer, but AFAIU those cases are broken anyway.
Comment 32 Guinevere Larsen 2023-01-20 08:57:12 UTC
> I think the following cases can be distinguished:
> - cfi info present, assume frame ID's accurate, no special handling needed
> - cfi info not present, but target has an epilogue unwinder , assume frame
>   ID's accurate, no special handling needed
> - cfi info not present, no epilogue unwinder. Special handling needed,
>   but no need to worry about breaking handling of recursive function,
>   because it's already broken, and the special handling intends to fix it.
> 
> My naive idea on how to approach this was to instead of doing a step-resume
> breakpoint at the last insn and resume, do first a single step back, and
> then insert the step-resume breakpoint and resume.  This will already fix
> the aarch64 case.  There may be cases where this gives the wrong answer, but
> AFAIU those cases are broken anyway.

Ah ok, sounds like a good naive approach. My thought was avoiding the previous naive solution!