I've stumbled over a regression of gdb since commit 1a853c52 (make "permanent breakpoints" per location and disableable). My gdb python scripts  that load Linux kernel module symbols as the target loads the modules now fail.
The involved command is lx-symbols . It installs a silent breakpoint on a kernel function that is called when a module is loaded. Before 1a853c52, the python callback was normally invoked and the target continued to run. Since af48d08f (1a853c52 is not testable), the int3 instruction (I'm testing with x86) is left in the target, and garbage instructions are executed, causing a kernel oops. The breakpoint is apparently not properly skipped (remove, single-step, re-insert) when resuming the target on return from LoadModuleBreakpoint.stop().
I can provide more details on how to set up a reproduction case but I
would only gather them when desired as that is not straightforward.
Sounds like GDB is now considering your breakpoint a permanent breakpoint? Is an 'int3' already in memory (at the address of the breakpoint) when your python script creates/installs the breakpoint? Does "set debug infrun 1" show that GDB decides to skip a permanent breakpoint?
After running lx-symbols, which installs the breakpoint, the target memory still contains the original code. After resuming the target, int3 gets written.
Permanent breakpoint seems to be the right trace:
infrun: target_wait (-1, status) =
infrun: 42000 [Thread 1],
infrun: status->kind = stopped, signal = GDB_SIGNAL_TRAP
infrun: stop_pc = 0xffffffff81512813
loading @0xffffffffa05eb000: /data/linux/build-dbg/crypto/xor.ko
infrun: no stepping, continue
infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 1] at 0xffffffff81512813
infrun: resume: skipping permanent breakpoint
That code in infrun.c is guarded by:
if (breakpoint_here_p (aspace, pc) == permanent_breakpoint_here)
I don't see anything obviously broken in breakpoint_here_p.
All seems like GDB does see an 'int3' in memory. Could you debug GDB and put a breakpoint in add_location_to_breakpoint, in the line that does 'loc->permanent = 1' ?
I'm going to guess that as your python stop hook loads new symbols into GDB, which triggers a breakpoint re-set, when GDB recreates that breakpoint location (in response to new symbols), breakpoint_xfer_memory fails to hide the breakpoint from memory to higher layers, and thus add_location_to_breakpoint believes the int3 is a permanent program breakpoint, while it is really a breakpoint gdb itself has inserted.
Such a breakpoint is hit when my breakpoint fires. Anything I should look up in bp_loc_is_permanent as well?
> Anything I should look up in bp_loc_is_permanent as well?
bp_loc_is_permanent is the function that reads memory off the target, and checks if a breakpoint is already there. If there's one, but it's one that GDB itself has inserted, then target_read_memory -> breakpoint_xfer_memory is supposed to mask it off. But for some reason, it sounds like that goes wrong, as if the previous instance of the gdb breakpoint location had been wiped from gdb's tables, but left behind installed on the target.
Symbol loading definitely plays a role: I've just commented out the part of my stop handler that invokes add-symbol-file, and the breakpoint is no longer handled as permanent.
Any news on this? Anything I can do (with my limited knowledge of gdb internals)? Just let me know.
Sorry, I'm travelling.
We need to figure out why does bp_loc_is_permanent see a memory trap in memory.
I tried quickly playing with "set breakpoint always-inserted on", and "add-symbol-file", and I couldn't see anything wrong going on. But then again, that was very limited testing.
disable_breakpoints_in_unloaded_shlib (struct so_list *solib)
/* At this point, we cannot rely on remove_breakpoint
succeeding so we must mark the breakpoint as not inserted
to prevent future errors occurring in remove_breakpoints. */
loc->inserted = 0;
and I could see that causing the issue at hand, given the breakpoint at that point might well be still in memory, but then breakpoint_xfer_memory would not mask it off (as it is marked as not inserted).
But I don't think add-symbol-file/remove-symbol-file ends up in that function,
instead it goes through disable_breakpoints_in_freed_objfile, which does not have that issue.
But it may be some other code in gdb is doing something similar (clearing the inserted flag when it shouldn't).
I think it should be possible to see the same bug when debugging a userspace program, as long as the script does the same things. E.g., have gdb debug a program that just does something like:
int main ()
instead of debugging the kernel, and tweak/hack the lx-symbols command to set the breakpoint at that do_init_module, and have it do add-symbol-file/
remove-symbol-file to emulate module loading.
Could you do that? An easy reproducer like would be very useful.
Tried this path but didn't succeed so far. While I was able to shrink down the kernel-based reproduction case on basically
cmdline = "add-symbol-file /path/module.ko 0xffffffffa05f0000"
and still get the bug, using a normal userspace application with the same binary added on breakpoint hit does not trigger it.
Could it be that the vmlinux layout makes the difference here?