[PATCH] Fix inline frame unwinding breakage

Andrew Burgess andrew.burgess@embecosm.com
Thu Jun 18 18:31:42 GMT 2020


* Luis Machado <luis.machado@linaro.org> [2020-06-18 15:19:56 -0300]:

> On 6/18/20 2:40 PM, Andrew Burgess wrote:
> > * Andrew Burgess <andrew.burgess@embecosm.com> [2020-06-18 18:29:22 +0100]:
> > 
> > > * Andrew Burgess <andrew.burgess@embecosm.com> [2020-06-18 17:58:55 +0100]:
> > > 
> > > > * Luis Machado via Gdb-patches <gdb-patches@sourceware.org> [2020-04-14 18:38:36 -0300]:
> > > > 
> > > > > *** re-sending due to the poor choice of characters for the backtrace
> > > > > annotations. GIT swallowed parts of it.
> > > > > 
> > > > > There has been some breakage for aarch64-linux, arm-linux and s390-linux in
> > > > > terms of inline frame unwinding. There may be other targets, but these are
> > > > > the ones i'm aware of.
> > > > > 
> > > > > The following testcases started to show numerous failures and trigger internal
> > > > > errors in GDB after commit 1009d92fc621bc4d017029b90a5bfab16e17fde5,
> > > > > "Find tailcall frames before inline frames".
> > > > > 
> > > > > gdb.opt/inline-break.exp
> > > > > gdb.opt/inline-cmds.exp
> > > > > gdb.python/py-frame-inline.exp
> > > > > gdb.reverse/insn-reverse.exp
> > > > > 
> > > > > The internal errors were of this kind:
> > > > > 
> > > > > binutils-gdb/gdb/frame.c:579: internal-error: frame_id get_frame_id(frame_info*): Assertion `fi->level == 0' failed.
> > > > > 
> > > > > After a lengthy investigation to try and find the cause of these assertions,
> > > > > it seems we're dealing with some fragile/poorly documented code to handle inline
> > > > > frames and we are attempting to unwind from this fragile section of code.
> > > > > 
> > > > > Before commit 1009d92fc621bc4d017029b90a5bfab16e17fde5, the tailcall sniffer
> > > > > was invoked from dwarf2_frame_prev_register. By the time we invoke the
> > > > > dwarf2_frame_prev_register function, we've probably already calculated the
> > > > > frame id (via compute_frame_id).
> > > > > 
> > > > > After said commit, the call to dwarf2_tailcall_sniffer_first was moved to
> > > > > dwarf2_frame_cache. This is very early in a frame creation process, and
> > > > > we're still calculating the frame ID (so compute_frame_id is in the call
> > > > > stack).
> > > > > 
> > > > > This would be fine for regular frames, but the above testcases all deal
> > > > > with some inline frames.
> > > > > 
> > > > > The particularity of inline frames is that their frame ID's depend on
> > > > > the previous frame's ID, and the previous frame's ID relies in the inline
> > > > > frame's registers. So it is a bit of a messy situation.
> > > > > 
> > > > > We have comments in various parts of the code warning about some of these
> > > > > particularities.
> > > > > 
> > > > > In the case of dwarf2_tailcall_sniffer_first, we attempt to unwind the PC,
> > > > > which goes through various functions until we eventually invoke
> > > > > frame_unwind_got_register. This function will eventually attempt to create
> > > > > a lazy value for a particular register, and this lazy value will require
> > > > > a valid frame ID.  Since the inline frame doesn't have a valid frame ID
> > > > > yet (remember we're still calculating the previous frame's ID so we can tell
> > > > > what the inline frame ID is) we will call compute_frame_id for the inline
> > > > > frame (level 0).
> > > > > 
> > > > > We'll eventually hit the assertion above, inside get_frame_id:
> > > > > 
> > > > > --
> > > > >        /* If we haven't computed the frame id yet, then it must be that
> > > > >           this is the current frame.  Compute it now, and stash the
> > > > >           result.  The IDs of other frames are computed as soon as
> > > > >           they're created, in order to detect cycles.  See
> > > > >           get_prev_frame_if_no_cycle.  */
> > > > >        gdb_assert (fi->level == 0);
> > > > > --
> > > > > 
> > > > > It seems to me we shouldn't have reached this assertion without having the
> > > > > inline frame ID already calculated. In fact, it seems we even start recursing
> > > > > a bit when we invoke get_prev_frame_always within inline_frame_this_id. But
> > > > > a check makes us quit the recursion and proceed to compute the id.
> > > > > 
> > > > > Here's the call stack for context:
> > > > > 
> > > > > #0  get_prev_frame_always_1 (this_frame=0xaaaaab85a670) at ../../../repos/binutils-gdb/gdb/frame.c:2109
> > > > > RECURSION - #1  0x0000aaaaaae1d098 in get_prev_frame_always (this_frame=0xaaaaab85a670) at ../../../repos/binutils-gdb/gdb/frame.c:2124
> > > > > #2  0x0000aaaaaae95768 in inline_frame_this_id (this_frame=0xaaaaab85a670, this_cache=0xaaaaab85a688, this_id=0xaaaaab85a6d0)
> > > > >      at ../../../repos/binutils-gdb/gdb/inline-frame.c:165
> > > > > #3  0x0000aaaaaae1916c in compute_frame_id (fi=0xaaaaab85a670) at ../../../repos/binutils-gdb/gdb/frame.c:550
> > > > > #4  0x0000aaaaaae19318 in get_frame_id (fi=0xaaaaab85a670) at ../../../repos/binutils-gdb/gdb/frame.c:582
> > > > > #5  0x0000aaaaaae13480 in value_of_register_lazy (frame=0xaaaaab85a730, regnum=30) at ../../../repos/binutils-gdb/gdb/findvar.c:296
> > > > > #6  0x0000aaaaaae16c00 in frame_unwind_got_register (frame=0xaaaaab85a730, regnum=30, new_regnum=30) at ../../../repos/binutils-gdb/gdb/frame-unwind.c:268
> > > > > #7  0x0000aaaaaad52604 in dwarf2_frame_prev_register (this_frame=0xaaaaab85a730, this_cache=0xaaaaab85a748, regnum=30)
> > > > >      at ../../../repos/binutils-gdb/gdb/dwarf2/frame.c:1296
> > > > > #8  0x0000aaaaaae1ae68 in frame_unwind_register_value (next_frame=0xaaaaab85a730, regnum=30) at ../../../repos/binutils-gdb/gdb/frame.c:1229
> > > > > #9  0x0000aaaaaae1b304 in frame_unwind_register_unsigned (next_frame=0xaaaaab85a730, regnum=30) at ../../../repos/binutils-gdb/gdb/frame.c:1320
> > > > > #10 0x0000aaaaaab76574 in aarch64_dwarf2_prev_register (this_frame=0xaaaaab85a730, this_cache=0xaaaaab85a748, regnum=32)
> > > > >      at ../../../repos/binutils-gdb/gdb/aarch64-tdep.c:1114
> > > > > #11 0x0000aaaaaad52724 in dwarf2_frame_prev_register (this_frame=0xaaaaab85a730, this_cache=0xaaaaab85a748, regnum=32)
> > > > >      at ../../../repos/binutils-gdb/gdb/dwarf2/frame.c:1316
> > > > > #12 0x0000aaaaaae1ae68 in frame_unwind_register_value (next_frame=0xaaaaab85a730, regnum=32) at ../../../repos/binutils-gdb/gdb/frame.c:1229
> > > > > #13 0x0000aaaaaae1b304 in frame_unwind_register_unsigned (next_frame=0xaaaaab85a730, regnum=32) at ../../../repos/binutils-gdb/gdb/frame.c:1320
> > > > > #14 0x0000aaaaaae16a84 in default_unwind_pc (gdbarch=0xaaaaab81edc0, next_frame=0xaaaaab85a730) at ../../../repos/binutils-gdb/gdb/frame-unwind.c:223
> > > > > #15 0x0000aaaaaae32124 in gdbarch_unwind_pc (gdbarch=0xaaaaab81edc0, next_frame=0xaaaaab85a730) at ../../../repos/binutils-gdb/gdb/gdbarch.c:3074
> > > > > #16 0x0000aaaaaad4f15c in dwarf2_tailcall_sniffer_first (this_frame=0xaaaaab85a730, tailcall_cachep=0xaaaaab85a830, entry_cfa_sp_offsetp=0x0)
> > > > >      at ../../../repos/binutils-gdb/gdb/dwarf2/frame-tailcall.c:388
> > > > > #17 0x0000aaaaaad520c0 in dwarf2_frame_cache (this_frame=0xaaaaab85a730, this_cache=0xaaaaab85a748) at ../../../repos/binutils-gdb/gdb/dwarf2/frame.c:1190
> > > > > #18 0x0000aaaaaad52204 in dwarf2_frame_this_id (this_frame=0xaaaaab85a730, this_cache=0xaaaaab85a748, this_id=0xaaaaab85a790)
> > > > >      at ../../../repos/binutils-gdb/gdb/dwarf2/frame.c:1218
> > > > > #19 0x0000aaaaaae1916c in compute_frame_id (fi=0xaaaaab85a730) at ../../../repos/binutils-gdb/gdb/frame.c:550
> > > > > #20 0x0000aaaaaae1c958 in get_prev_frame_if_no_cycle (this_frame=0xaaaaab85a670) at ../../../repos/binutils-gdb/gdb/frame.c:1927
> > > > > #21 0x0000aaaaaae1cc44 in get_prev_frame_always_1 (this_frame=0xaaaaab85a670) at ../../../repos/binutils-gdb/gdb/frame.c:2006
> > > > > FIRST CALL - #22 0x0000aaaaaae1d098 in get_prev_frame_always (this_frame=0xaaaaab85a670) at ../../../repos/binutils-gdb/gdb/frame.c:2124
> > > > > #23 0x0000aaaaaae18f68 in skip_artificial_frames (frame=0xaaaaab85a670) at ../../../repos/binutils-gdb/gdb/frame.c:495
> > > > > #24 0x0000aaaaaae193e8 in get_stack_frame_id (next_frame=0xaaaaab85a670) at ../../../repos/binutils-gdb/gdb/frame.c:596
> > > > > #25 0x0000aaaaaae87a54 in process_event_stop_test (ecs=0xffffffffefc8) at ../../../repos/binutils-gdb/gdb/infrun.c:6857
> > > > > #26 0x0000aaaaaae86bdc in handle_signal_stop (ecs=0xffffffffefc8) at ../../../repos/binutils-gdb/gdb/infrun.c:6381
> > > > > #27 0x0000aaaaaae84fd0 in handle_inferior_event (ecs=0xffffffffefc8) at ../../../repos/binutils-gdb/gdb/infrun.c:5578
> > > > > #28 0x0000aaaaaae81588 in fetch_inferior_event (client_data=0x0) at ../../../repos/binutils-gdb/gdb/infrun.c:4020
> > > > > #29 0x0000aaaaaae5f7fc in inferior_event_handler (event_type=INF_REG_EVENT, client_data=0x0) at ../../../repos/binutils-gdb/gdb/inf-loop.c:43
> > > > > #30 0x0000aaaaaae8d768 in infrun_async_inferior_event_handler (data=0x0) at ../../../repos/binutils-gdb/gdb/infrun.c:9377
> > > > > #31 0x0000aaaaaabff970 in check_async_event_handlers () at ../../../repos/binutils-gdb/gdb/async-event.c:291
> > > > > #32 0x0000aaaaab27cbec in gdb_do_one_event () at ../../../repos/binutils-gdb/gdbsupport/event-loop.cc:194
> > > > > #33 0x0000aaaaaaef1894 in start_event_loop () at ../../../repos/binutils-gdb/gdb/main.c:356
> > > > > #34 0x0000aaaaaaef1a04 in captured_command_loop () at ../../../repos/binutils-gdb/gdb/main.c:416
> > > > > #35 0x0000aaaaaaef3338 in captured_main (data=0xfffffffff1f0) at ../../../repos/binutils-gdb/gdb/main.c:1254
> > > > > #36 0x0000aaaaaaef33a0 in gdb_main (args=0xfffffffff1f0) at ../../../repos/binutils-gdb/gdb/main.c:1269
> > > > > #37 0x0000aaaaaab6e0dc in main (argc=6, argv=0xfffffffff348) at ../../../repos/binutils-gdb/gdb/gdb.c:32
> > > > > 
> > > > > The following patch addresses this by using a function that unwinds the PC
> > > > > from the next (inline) frame directly as opposed to creating a lazy value
> > > > > that is bound to the next frame's ID (still not computed).
> > > > > 
> > > > > I've validated this for aarch64-linux and x86_64-linux by running the
> > > > > testsuite.
> > > > > 
> > > > > Tromey, would you mind checking if this suits your problematic core file
> > > > > tailcall scenario?
> > > > > 
> > > > > gdb/ChangeLog:
> > > > > 
> > > > > 2020-04-14  Luis Machado  <luis.machado@linaro.org>
> > > > > 
> > > > > 	* dwarf2/frame-tailcall.c (dwarf2_tailcall_sniffer_first): Use
> > > > > 	get_frame_register instead of gdbarch_unwind_pc.
> > > > > ---
> > > > >   gdb/dwarf2/frame-tailcall.c | 4 +++-
> > > > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/gdb/dwarf2/frame-tailcall.c b/gdb/dwarf2/frame-tailcall.c
> > > > > index 2d219f13f9..01bb134a5c 100644
> > > > > --- a/gdb/dwarf2/frame-tailcall.c
> > > > > +++ b/gdb/dwarf2/frame-tailcall.c
> > > > > @@ -385,7 +385,9 @@ dwarf2_tailcall_sniffer_first (struct frame_info *this_frame,
> > > > >         prev_gdbarch = frame_unwind_arch (this_frame);
> > > > >         /* Simulate frame_unwind_pc without setting this_frame->prev_pc.p.  */
> > > > > -      prev_pc = gdbarch_unwind_pc (prev_gdbarch, this_frame);
> > > > > +      get_frame_register (this_frame, gdbarch_pc_regnum (prev_gdbarch),
> > > > > +			  (gdb_byte *) &prev_pc);
> > > > > +      prev_pc = gdbarch_addr_bits_remove (prev_gdbarch, prev_pc);
> > > > >         /* call_site_find_chain can throw an exception.  */
> > > > >         chain = call_site_find_chain (prev_gdbarch, prev_pc, this_pc);
> > > > 
> > > > I'm now no longer convinced that this patch is correct, and I'd like
> > > > to reopen the discussion.
> > > > 
> > > > Here's what concerns me, we used to make the following call-chain:
> > > > 
> > > >    gdbarch_unwind_pc --> frame_unwind_register_unsigned --> frame_unwind_register_value
> > > > 
> > > > Now we do this:
> > > > 
> > > >    get_frame_register --> frame_unwind_register --> frame_register_unwind --> frame_unwind_register_value
> > > > 
> > > > The problem is that gdbarch_unwind_pc' takes an argument 'next_frame',
> > > > while, get_frame_register takes an argument called frame', but is
> > > > really 'this_frame', it then passes 'frame->next' to
> > > > 'frame_unwind_register'.
> > > > 
> > > > What this means is that if we have a call stack like this:
> > > > 
> > > >    #3 --> #2 --> #1 --> #0
> > > > 
> > > > And we invoke the tail-call sniffer in frame #1, previously we figured
> > > > out the $pc value in frame #2, while now we figure out the $pc value
> > > > in frame #1.
> > > > 
> > > > I'm even more convinced that this is an error based on the fix patch
> > > > you applied later:
> > > > 
> > > >    commit 991a3e2e9944a4b3a27bd989ac03c18285bd545d
> > > >    Date:   Sat Apr 25 00:32:44 2020 -0300
> > > > 
> > > >        Fix remaining inline/tailcall unwinding breakage for x86_64
> > > > 
> > 
> > After this commit the code looks like this:
> > 
> >        /* Simulate frame_unwind_pc without setting this_frame->prev_pc.p.  */
> >        if (next_frame != nullptr && get_frame_type (next_frame) == INLINE_FRAME
> > 	  && !frame_id_computed_p (next_frame))
> > 	{
> > 	  /* The next frame is an inline frame and its frame id has not been
> > 	     computed yet.  */
> > 	  get_frame_register (this_frame, gdbarch_pc_regnum (prev_gdbarch),
> > 			      (gdb_byte *) &prev_pc);
> > 	  prev_pc = gdbarch_addr_bits_remove (prev_gdbarch, prev_pc);
> > 	}
> >        else
> > 	prev_pc = gdbarch_unwind_pc (prev_gdbarch, this_frame);
> > 
> > Now the point of this function is to decide if the frame we are in
> > _right_now_ was tail called into by it's "apparent" caller.  That is
> > when we unwind we have:
> > 
> >    main --> foo --> bar
> > 
> > And in foo we ask, can we build a tailcall chain that gets us from
> > main to foo.  So in this case, we get the answer yes, and the chain
> > returned represents:
> > 
> >    main --> test_func --> foo
> > 
> > However, my thinking is that we know we're not in the position of
> > 'foo' (that is reached by a tail call) if, foo is an inline frame -
> > right?
> > 
> > So, I wonder if it's as simple as saying:
> > 
> >    /* We know that THIS_FRAME was not reached by a tail call if
> >       THIS_FRAME is an inline frame.  */
> >    if (get_frame_type (this_frame) == INLINE_FRAME)
> >      return;
> > 
> > That's totally untested, just a random thought....  I wonder if such a
> > change would fix the original failures you saw?
> > 
> > Thanks,
> > Andrew
> > 
> 
> I can attest your testcase fails for aarch64. But your proposed early return
> runs into the known frame level assertion, inside get_frame_id:
> 
> gdb_assert (fi->level == 0);
> 
> This happens because when we're unwinding the frame for foo, the current
> frame is no longer an inline frame. So we proceed to try and fetch PC from
> the next (inline) frame. And then we run into the same problem with the lack
> of a frame id.
> 
> When we're unwinding an inline frame whose next frame is the sentinel one,
> we're ok. We always have register access through the sentinel frame.
> 
> I wonder how likely it is to break if there are inline frames at other
> positions in the call chain. So your suggestion may fix those cases. But it
> won't fix the remaining cases that attempt to unwind registers from the
> inline frame.
> 
> Does that make sense?

I think so.

The more I stare at this code the more I'm convinced that trying to
run the tail-call detection (not going to call it sniffing) from an
actual frame sniffer, is just a bad idea.

Think about it, we have

  test_func --> foo

we're sniffing for foo, which means we don't yet know what type of
frame foo is, and how to unwind it; yet to do tail call detection we
need to ask foo to unwind its $pc.

I think the real fix here is to move tail call detection later in the
process, hopefully Tom will offer some more details of the original
bug that motivated moving tail call detection.

Thanks,
Andrew


More information about the Gdb-patches mailing list