Bug 19288 - need a way to see if an address is covered by some existing object
Summary: need a way to see if an address is covered by some existing object
Status: NEW
Alias: None
Product: gdb
Classification: Unclassified
Component: python (show other bugs)
Version: unknown
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 19923
  Show dependency treegraph
 
Reported: 2015-11-24 19:22 UTC by Tom Tromey
Modified: 2016-11-23 11:03 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Tromey 2015-11-24 19:22:29 UTC
I'm writing an unwinder in Python.

From the docs it seems that Python unwinders are run first.
But, I don't want that -- I only want my unwinder to be considered
when a PC is not covered by some existing objfile.

So, I tried to implement an early exit.  This turns out to be
hard to do.

solib_name does not work for things in the main executable.
(Maybe fixing this one is the way to go.)

block_for_pc does not work for libraries without debuginfo.

find_pc_line doesn't differentiate between not-covered and no-debuginfo.

Even the desperation try fails:

    return not gdb.execute('info symbol 0x%x' % pc, to_string = True).startswith('No symbol matches')

...causing a crash in value_of_register_lazy.
Comment 1 Tom Tromey 2015-11-24 21:38:54 UTC
This worked ok for me, but it's a big hack.

static PyObject *
gdbpy_text_address_claimed (PyObject *self, PyObject *args)
{
  gdb_py_ulongest pc;
  struct objfile *objfile;
  struct obj_section *osect;

  if (!PyArg_ParseTuple (args, GDB_PY_LLU_ARG, &pc))
    return NULL;

  ALL_OBJSECTIONS (objfile, osect)
  {
    /* Only process each object file once, even if there's a separate
       debug file.  */
    if (objfile->separate_debug_objfile_backlink)
      continue;

    if (obj_section_addr (osect) <= pc && pc < obj_section_endaddr (osect))
      {
	Py_RETURN_TRUE;
      }
  }

  Py_RETURN_FALSE;
}
Comment 2 Pedro Alves 2015-11-25 10:35:21 UTC
Did you consider exposing the list of Objfile sections to Python?
Something like Objfile.sections() to get the section list, and then each Section object could expose the addr/endaddr range.
Comment 3 Tom Tromey 2015-11-25 18:13:00 UTC
(In reply to Pedro Alves from comment #2)
> Did you consider exposing the list of Objfile sections to Python?
> Something like Objfile.sections() to get the section list, and then each
> Section object could expose the addr/endaddr range.

Yeah, that would work nicely.

I also thought about just having my python code parse /proc/$/maps :)
Comment 4 Tom Tromey 2015-12-03 18:21:01 UTC
I was hoping to parse "info proc maps" because that would work with
an unpatched gdb 7.10.  However, that fails from an unwinder
because info_proc_cmd_1 calls get_current_arch, which needs a frame.

I may try "remote get /proc/.../maps" instead, eww.
Or just require local debugging.
Comment 5 Tom Tromey 2016-04-08 13:57:20 UTC
This turns out to not work great in practice, because people here 
want to do core file debugging and also use 'rr' (and hence target remote).
We're accumulating hacks for the moment, but it would be great to
have an upstream fix.
Comment 6 Andrew Dinn 2016-04-08 16:01:17 UTC
this is also an issue for the OpenJDK unwinder and it is probably also the case that running the python unwinder after all the other unwinders have had a go would resolve the problem.

I also observed gdb crashes and also Python stack overflows in my initial implementation.

The stack overflow only happens for some unwind attempts, in particular it appears when the base frame for the pending frame has level -1. when the unwinder calls back in to certain gdb routines gdb tries to re-establish the frame stack. so the unwinder gets called again and makes the same callout and so on.

The crash seems to be able to happen with or without the recursion and again only happens for certain starting frames and only when the unwinder calls back in to certain gdb routines. the problem is that in these circumstances the gdb routine decides to call method reinit_frame_stack. even if the recursion is avoided this still ends up freeing the block that contains the base frame which backs the pending_frame passed to the unwinder's __call__ method i.e. the code which invoked __call__ is left with a dangling pointer. by the time the the unwinder returns the block has usually been reallocated and scribbled over. invariably, gdb asserts on return from __call__.

I bypassed the recursion problem by detecting recursive calls to my unwinder's __call__ method (using an association list keyed by thread to detect an outstanding call) and backing out of any call which results in recursion to return None. This appears to bypass the check of the invalid base frame and leaves the problematic frames for some other sniffer to deal with.

I bypassed cases where a crash was caused without recursive entry simply by tweaking my python code until it went away.

This is clearly a bit inadequate, especially the latter 'fix' which is really a time-bomb. It would be better if the gdb python API could i) detect recursive entry into an unwinder and ii)  detect that reinit_frame_stack (and, indeed, any other code which might inadvertently leave a dangling pointer) has been called below a frame sniffer. it should really throw an exception so the sniffer can back out rather than leaving potential for a gdb crash.
Comment 7 Pedro Alves 2016-04-08 17:40:15 UTC
Thanks Andrew.  Filed Bug 19927 for this recursion issue.
Comment 8 Pedro Alves 2016-04-13 22:29:33 UTC
> when the base frame for the pending frame has level -1.

level -1 sounds like the sentinel frame.  FYI, this is used as a starting point for creating the inner most frame from the thread's current registers.  See gdb/sentinel-frame.c.
Comment 9 Pedro Alves 2016-04-13 22:36:41 UTC
On the unwinder sniffer ordering, we could maybe try only the Python unwinders just before falling back to the heuristic/prologue-parsing-based arch unwinders.  So if we have DWARF debug/unwind info for the current PC, we never consult the Python unwinders at all.  Would there be a use case where we'd want Python unwinders to override DWARF debug/unwind info?

Even then, I assume you'd still want to be able to determine whether a PC is within the objfile you care about, so that your Python unwinder sniffer can say "nope, not mine", and then gdb falls back to the arch prologue unwinder.
I.e., we'd still need "a way to see if an address is covered by some existing object" even if the order of the frame sniffers is interrogated is changed.

Is that a good assumption?
Comment 10 Andrew Dinn 2016-04-14 08:24:48 UTC
(In reply to Pedro Alves from comment #9)
> On the unwinder sniffer ordering, we could maybe try only the Python
> unwinders just before falling back to the heuristic/prologue-parsing-based
> arch unwinders.  So if we have DWARF debug/unwind info for the current PC,
> we never consult the Python unwinders at all.  Would there be a use case
> where we'd want Python unwinders to override DWARF debug/unwind info?
> 
> Even then, I assume you'd still want to be able to determine whether a PC is
> within the objfile you care about, so that your Python unwinder sniffer can
> say "nope, not mine", and then gdb falls back to the arch prologue unwinder.
> I.e., we'd still need "a way to see if an address is covered by some
> existing object" even if the order of the frame sniffers is interrogated is
> changed.
> 
> Is that a good assumption?

First off, I think the problem with JITted code is that it is not associated with an object file. JITted code gets written into an anonymously mapped data segment. So, I don't think we can rely on any association between unwinders and objfiles to limit when our python unwinders get employed.

I think Tom and I are both assuming (well, at least, I am) that putting our python unwinders further down the chain will mean

  i) other standard unwinders will grab the problematic frames (such as level == -1 frames)

  ii) frames with JITted code pcs will be recognised as such by those standard unwinders and ignored -- so will eventually trickle down to the python unwinders

That's perhaps a questionable hypothesis.

a) It may be that one or more of the standard unwinders operates a fine sieve matching policy which might catch some JITted pcs. Those unwinders would have to follow the python unwinders.

b) If we are really unlucky then we might also find that those are the unwinders which handle the problematic frames.

So, it is probably worth trying to re-order the unwinders and see what happens (one quick way to test this would be to give all sniffers a hard-wired priority and keep the sniffer chains priority-ordered). If we hit problem b) then the unwinders will have to become smarter and, probably also, have to be able to access more details from the underlying frame  (e.g. its level) in order to be able to get smarter.
Comment 11 Pedro Alves 2016-04-14 10:09:23 UTC
Ah, I missed Tom's "I only want my unwinder to be considered when a PC is NOT covered by some existing objfile." (emphasis mine).
Comment 12 Pedro Alves 2016-04-14 10:14:39 UTC
> i) other standard unwinders will grab the problematic frames (such as level == -1 frames)

I'm not sure how you get that.  I tried "b pyuw_sniffer if this_frame->level == -1" (with no actual Python unwinder installed, alas) and it doesn't trigger.  Sounds like consequences from the recursion/reinit_frame_cache and then bad things happening.  AFAICS, you should never get the sentinel frame as a PendingFrame.
Comment 13 Pedro Alves 2016-04-14 10:30:30 UTC
> So, it is probably worth trying to re-order the unwinders and see what happens 
> (one quick way to test this would be to give all sniffers a hard-wired 
> priority and keep the sniffer chains priority-ordered).

Agreed, we need something like that.

Seems like the order should then be:

- The "Accurate unwinders"

  These would be the DWARF / x64 SEH based ones.

- JIT unwinders 

  Python/Guile unwind API unwinders, and also the C JIT-reader API unwinder, in jit.c.

- Fallback prologue unwinders

Grepping for frame_unwind_prepend_unwinder / frame_unwind_append_unwinder may find some odd case where some arch may want the prologue unwinder over the dwarf one.  But hopefully not.
Comment 14 Tom Tromey 2016-04-14 18:12:55 UTC
(In reply to Pedro Alves from comment #13)
> > So, it is probably worth trying to re-order the unwinders and see what happens 
> > (one quick way to test this would be to give all sniffers a hard-wired 
> > priority and keep the sniffer chains priority-ordered).
> 
> Agreed, we need something like that.
> 
> Seems like the order should then be:
> 
> - The "Accurate unwinders"
> 
>   These would be the DWARF / x64 SEH based ones.
> 
> - JIT unwinders 
> 
>   Python/Guile unwind API unwinders, and also the C JIT-reader API unwinder,
> in jit.c.
> 
> - Fallback prologue unwinders

There are two scenarios worth considering here.

One is, suppose there is an objfile without debuginfo.  In this case,
it seems that the frame will be presented to my Python unwinder.  However,
I already know my unwinder can't deal with this.  So I would still
appreciate some way of finding out whether a given PC is in some objfile.
Simply moving the Python unwinder lower won't work because the prologue
unwinder might "make sense" of a JIT frame, even though it really can't.

The other scenario is speculative: if someone wrote a caching JIT that
wrote out object code and dlopened it in a later invocation.  But I think
the above ordering suffices for that.
Comment 15 Pedro Alves 2016-04-14 18:22:45 UTC
> One is, suppose there is an objfile without debuginfo.  In this case,
> it seems that the frame will be presented to my Python unwinder.

Yes.

> However, I already know my unwinder can't deal with this.  So I would still
> appreciate some way of finding out whether a given PC is in some objfile.
> Simply moving the Python unwinder lower won't work because the prologue
> unwinder might "make sense" of a JIT frame, even though it really can't.

But wouldn't it be more robust to check whether it's an address you know you 
can unwind (by consulting whatever tables the JIT uses internally), instead of 
checking whether the program stopped at an address you can't unwind?

E.g., the PC may jump out to some other mmaped code (or some wild address) that 
falls out of any objfile, that is unrelated to your JIT's code (maybe another unrelated JIT is loaded in the process).  So I'm thinking that that would be an optimization, rather than a requirement.  Is  that a correct view? Just trying to understand things, not pushing back on the idea.

> The other scenario is speculative: if someone wrote a caching JIT that
> wrote out object code and dlopened it in a later invocation.  But I think
> the above ordering suffices for that.

Yeah, I'd assume that if you wanted those objects to be unwound with the python unwinder, it'd because you're not going to include dwarf info in them.
Comment 16 Tom Tromey 2016-04-14 18:34:11 UTC
(In reply to Pedro Alves from comment #15)

> But wouldn't it be more robust to check whether it's an address you know you 
> can unwind (by consulting whatever tables the JIT uses internally), instead
> of 
> checking whether the program stopped at an address you can't unwind?

For sure, but we ran into other problems when trying to do this:
https://bugzilla.mozilla.org/show_bug.cgi?id=1259867
(I couldn't reproduce this in the same way to debug it, instead I
got an infinite recursion in gdb -- basically doing anything in
an unwinder is extremely fiddly and difficult to get right)
Comment 17 Pedro Alves 2016-04-14 19:16:04 UTC
> Starting program: /home/nicolas/mozilla/_build/js/bugzil.la/1258397/commit/x64/gcc48/dbg/js/src/js ./foo.js
> linux-thread-db.c:1675: internal-error: find_new_threads_once: Assertion 
> `!target_has_execution || thread_db_use_events ()' failed.
> A problem internal to GDB has been detected,
> further debugging may prove unreliable.

FYI, this bug is fixed.  Bug 19676.
Comment 18 Tom Tromey 2016-04-14 19:29:27 UTC
Yeah, for me it failed in a different way:

https://bugzilla.mozilla.org/show_bug.cgi?id=1259867#c7

... maybe because we're lazily looking up symbols (and evaluating
some of them) in the unwinder; which is turn is a workaround
for some other problem I didn't diagnose.

It's kind of a long thread to pull to make this all work the
correct way.  I'm in favor of it all but it's not my main job
so I can't put much effort into it.
Comment 19 Andrew Dinn 2016-04-15 09:36:06 UTC
(In reply to Tom Tromey from comment #16)
> (In reply to Pedro Alves from comment #15)
> 
> > But wouldn't it be more robust to check whether it's an address you know you 
> > can unwind (by consulting whatever tables the JIT uses internally), instead
> > of 
> > checking whether the program stopped at an address you can't unwind?
> 
> For sure, but we ran into other problems when trying to do this:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1259867
> (I couldn't reproduce this in the same way to debug it, instead I
> got an infinite recursion in gdb -- basically doing anything in
> an unwinder is extremely fiddly and difficult to get right)

Indeed, my unwinder also ran into problems, albeit different ones. The first thing my OpenJDK unwinder tries to do is range check the PC it is presented. That sounds like the solution, right? Only, in order to do this the unwinder needs to lookup and cache some JVM state -- i.e. the start and end address of the region into which JITted code is generated. It has to do this one-off init lazily from the unwinder itself because this data is dynamically assigned so is not available when teh unwinder code gets loaded.

The result is that my unwinder gets called with some decidedly dodgy frames before lazy init has provided the data needed to range check and reject those frames. Catch 22! If the unwinder naively executes the Python to do the lazy init then - precisely because these early frame are dodgy -- the stack blows up (worse, a frame reinit trashes the pending frame leaving a pending gdb crash).

So, I found a hack to detect the potential blow-up and back out. That means I can now get to the point where I can do the lazy init with a safe frame. Clearly it would be better if:

  i) things did not blow up

and/or

  ii) the unwinder didn't have to see the dodgy frames

As Tom said elsewhere there is a cluster of different problems here that are all conniving to make it hard for unwinders to /get/ to the position where they can know what to do. I'm very happy now that I have found a way of ensuring /my/ unwinder bypasses the bad inputs. My concern now is not that it cannot be done but rather the fragility of the mechanism I and other implementors are relying on. It's hard to say whether this is just a question of fixing a few bugs or a problem with the current design until it becomes clearer what is actually breaking and why.
Comment 20 Pedro Alves 2016-04-15 09:43:32 UTC
Is there a dead-simple recipe somewhere one could use to see these issues trigger?  Some hack done to gdb's gdb/testsuite/gdb.python/py-unwind.py, perhaps?
Comment 21 Andrew Dinn 2016-04-15 09:58:18 UTC
(In reply to Pedro Alves from comment #20)
> Is there a dead-simple recipe somewhere one could use to see these issues
> trigger?  Some hack done to gdb's gdb/testsuite/gdb.python/py-unwind.py,
> perhaps?

If you make a minor change to my OpenJDK unwinder (commenting out a few lines) then you should be able to reliably trigger the stack blow up and frame reinit problems when you use it to debug an OpenJDK install. Do you want instructions here or offline?
Comment 22 Pedro Alves 2016-04-15 10:04:28 UTC
Maybe post them to Bug 19288 ?
Comment 23 Pedro Alves 2016-04-15 10:05:29 UTC
Bah, I meant Bug 19927 ...
Comment 24 Pedro Alves 2016-11-23 11:03:13 UTC
@Tromey, I'm wondering whether the recursion issues you saw that prevented the "Replace /proc/maps by TLS lookup of Jit informations" [1] strategy from being the default [2] still exist with GDB master.

[1] - https://bugzilla.mozilla.org/show_bug.cgi?id=1259867#c7
[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1261426