Bug 17837

Summary: python-injected silent breakpoints broken since 1a853c52
Product: gdb Reporter: Jan Kiszka <jan.kiszka>
Component: pythonAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: critical CC: pedro
Priority: P2    
Version: HEAD   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Jan Kiszka 2015-01-13 14:43:03 UTC
I've stumbled over a regression of gdb since commit 1a853c52 (make "permanent breakpoints" per location and disableable). My gdb python scripts [1] that load Linux kernel module symbols as the target loads the modules now fail.

The involved command is lx-symbols [2]. It installs a silent breakpoint on a kernel function that is called when a module is loaded. Before 1a853c52, the python callback was normally invoked and the target continued to run. Since af48d08f (1a853c52 is not testable), the int3 instruction (I'm testing with x86) is left in the target, and garbage instructions are executed, causing a kernel oops. The breakpoint is apparently not properly skipped (remove, single-step, re-insert) when resuming the target on return from LoadModuleBreakpoint.stop().

I can provide more details on how to set up a reproduction case but I
would only gather them when desired as that is not straightforward.

[1] https://lkml.org/lkml/2014/11/20/531
[2] http://git.kiszka.org/?p=linux.git;a=blob;f=scripts/gdb/linux/symbols.py;h=bf05e451c58666add299061046bf1ceb9e82f4ef;hb=d92098e7cf60d31ccd025e56d20c23917ccd0819
Comment 1 Pedro Alves 2015-01-13 15:22:35 UTC
Sounds like GDB is now considering your breakpoint a permanent breakpoint?  Is an 'int3' already in memory (at the address of the breakpoint) when your python script creates/installs the breakpoint?  Does "set debug infrun 1" show that GDB decides to skip a permanent breakpoint?
Comment 2 Jan Kiszka 2015-01-13 15:41:08 UTC
After running lx-symbols, which installs the breakpoint, the target memory still contains the original code. After resuming the target, int3 gets written.

Permanent breakpoint seems to be the right trace:

...
infrun: target_wait (-1, status) =
infrun:   42000 [Thread 1],
infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
infrun: TARGET_WAITKIND_STOPPED
infrun: stop_pc = 0xffffffff81512813
loading @0xffffffffa05eb000: /data/linux/build-dbg/crypto/xor.ko
infrun: BPSTAT_WHAT_SINGLE
infrun: no stepping, continue
infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 1] at 0xffffffff81512813
infrun: resume: skipping permanent breakpoint
infrun: prepare_to_wait
Comment 3 Pedro Alves 2015-01-13 16:07:33 UTC
That code in infrun.c is guarded by:

  if (breakpoint_here_p (aspace, pc) == permanent_breakpoint_here)
    {

I don't see anything obviously broken in breakpoint_here_p.

All seems like GDB does see an 'int3' in memory.  Could you debug GDB and put a breakpoint in add_location_to_breakpoint, in the line that does 'loc->permanent = 1' ?

I'm going to guess that as your python stop hook loads new symbols into GDB, which triggers a breakpoint re-set, when GDB recreates that breakpoint location (in response to new symbols), breakpoint_xfer_memory fails to hide the breakpoint from memory to higher layers, and thus add_location_to_breakpoint believes the int3 is a permanent program breakpoint, while it is really a breakpoint gdb itself has inserted.
Comment 4 Jan Kiszka 2015-01-13 16:13:38 UTC
Such a breakpoint is hit when my breakpoint fires. Anything I should look up in bp_loc_is_permanent as well?
Comment 5 Pedro Alves 2015-01-13 16:21:27 UTC
> Anything I should look up in bp_loc_is_permanent as well?

bp_loc_is_permanent is the function that reads memory off the target, and checks if a breakpoint is already there.  If there's one, but it's one that GDB itself has inserted, then target_read_memory -> breakpoint_xfer_memory is supposed to mask it off.  But for some reason, it sounds like that goes wrong, as if the previous instance of the gdb breakpoint location had been wiped from gdb's tables, but left behind installed on the target.
Comment 6 Jan Kiszka 2015-01-14 07:19:32 UTC
Symbol loading definitely plays a role: I've just commented out the part of my stop handler that invokes add-symbol-file, and the breakpoint is no longer handled as permanent.
Comment 7 Jan Kiszka 2015-01-26 18:15:44 UTC
Any news on this? Anything I can do (with my limited knowledge of gdb internals)? Just let me know.
Comment 8 Pedro Alves 2015-01-30 13:40:01 UTC
Sorry, I'm travelling.

We need to figure out why does bp_loc_is_permanent see a memory trap in memory.

I tried quickly playing with "set breakpoint always-inserted on", and "add-symbol-file", and I couldn't see anything wrong going on.  But then again, that was very limited testing.

We have:

static void
disable_breakpoints_in_unloaded_shlib (struct so_list *solib)
{
...
	/* At this point, we cannot rely on remove_breakpoint
	   succeeding so we must mark the breakpoint as not inserted
	   to prevent future errors occurring in remove_breakpoints.  */
	loc->inserted = 0;


and I could see that causing the issue at hand, given the breakpoint at that point might well be still in memory, but then breakpoint_xfer_memory would not mask it off (as it is marked as not inserted).

But I don't think add-symbol-file/remove-symbol-file ends up in that function,
instead it goes through disable_breakpoints_in_freed_objfile, which does not have that issue.

But it may be some other code in gdb is doing something similar (clearing the inserted flag when it shouldn't).
Comment 9 Pedro Alves 2015-02-03 13:06:21 UTC
I think it should be possible to see the same bug when debugging a userspace program, as long as the script does the same things.  E.g., have gdb debug a program that just does something like:

 void do_init_module(void)
 {
 }

 int main ()
 {
    do_init_module ();
    return 0;
 }

instead of debugging the kernel, and tweak/hack the lx-symbols command to set the breakpoint at that do_init_module, and have it do add-symbol-file/
remove-symbol-file to emulate module loading.

Could you do that?  An easy reproducer like would be very useful.
Comment 10 Jan Kiszka 2015-02-04 16:22:52 UTC
Tried this path but didn't succeed so far. While I was able to shrink down the kernel-based reproduction case on basically

    class LoadModuleBreakpoint(gdb.Breakpoint):
        ....
        def stop(self):
            cmdline = "add-symbol-file /path/module.ko 0xffffffffa05f0000"
            gdb.execute(cmdline, to_string=True)
            return False

and still get the bug, using a normal userspace application with the same binary added on breakpoint hit does not trigger it.

Could it be that the vmlinux layout makes the difference here?