Created attachment 10865 [details] test-case Hello. This bug is very illusive, so I spent quite some time creating this fully automated test case. Unpack tst.tar.gz. There is a Makefile with testing targets. "make pure" will compile and run the test-case w/o gdb. You should see the following: --- ./tst TEST PASSED! size=512 --- Now run "make gdb_good": --- Function "DynAlloc" not defined. Breakpoint 1 (DynAlloc) pending. warning: Corrupted shared library list: 0x613e90 != 0x7ffff7ffd990 Breakpoint 1, 0x00007ffff6caaff0 in DynAlloc(char const*, unsigned int, unsigned int)@plt () from ./libfdpp.so Breakpoint 1, DynAlloc (what=0x7ffff6cf6a9f "dosobj", num=1, size=512) at dyninit.cc:59 59 if (size != 512) TEST PASSED! size=512 [Inferior 1 (process 31269) exited normally] --- It will use the .gdbinit script that does not trigger the bug. It is needed to show what arguments the function "DynAlloc" is called with. "size=512" is a correct value. Now run "make gdb_bad": --- Function "DynAlloc" not defined. Breakpoint 1 (DynAlloc) pending. warning: Corrupted shared library list: 0x613e90 != 0x7ffff7ffd990 Breakpoint 1, 0x00007ffff6caaff0 in DynAlloc(char const*, unsigned int, unsigned int)@plt () from ./libfdpp.so #1 0x00007ffff6cf4560 in dosobj_init () at dosobj.cc:16 16 __FAR(void)fa = DynAlloc("dosobj", 1, size); $1 = 512 0x00007ffff6caaff6 in DynAlloc(char const*, unsigned int, unsigned int)@plt () from ./libfdpp.so Breakpoint 1, DynAlloc (what=0x7ffff6cf6a9f "dosobj", num=1, size=4140478454) at dyninit.cc:59 59 if (size != 512) TEST FAILED! size=-154488842 [Inferior 1 (process 31420) exited normally] --- This shows that the size argument is corrupted. There is a custom .gdbinit script here that is triggering the race condition in gdb. What it does is sets the breakpoint to DynAlloc, triggers this breakpoint, waits one second, then does "c". The argument that resides in $rcx, gets corrupted during this "wait one second" thing. The "good" gdbinit script just doesn't wait one second, so corruption does not happen. The test-case arms the SIGALRM handler which should come while gdb is waiting. Then the $rcx corruption happens. Note that the signal handler is empty, so by itself it can't corrupt $rcx. Also the corruption does not happen if the PLT for DynAlloc is disabled via visibility=hidden. So really it is something very tricky: PLT+signal+breakpoint - all come into the game.
Also it is very interesting to see the problem in interactive mode. Remove any .gdbinit, then do: --- $ make $ gdb ./tst ... Reading symbols from ./tst...done. (gdb) b DynAlloc Function "DynAlloc" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (DynAlloc) pending. (gdb) r Starting program: /home/stas/tests/fdpp/tst warning: Corrupted shared library list: 0x613e90 != 0x7ffff7ffd990 Breakpoint 1, 0x00007ffff6caaff0 in DynAlloc(char const*, unsigned int, unsigned int)@plt () from ./libfdpp.so Missing separate debuginfos, use: dnf debuginfo-install libgcc-7.2.1-2.fc27.x86_64 libstdc++-7.2.1-2.fc27.x86_64 (gdb) p /x $rcx $1 = 0x200 (gdb) disas Dump of assembler code for function _Z8DynAllocPKcjj@plt: => 0x00007ffff6caaff0 <+0>: jmpq *0x24d62a(%rip) # 0x7ffff6ef8620 0x00007ffff6caaff6 <+6>: pushq $0xc1 0x00007ffff6caaffb <+11>: jmpq 0x7ffff6caa3d0 End of assembler dump. (gdb) si 0x00007ffff6caaff6 in DynAlloc(char const*, unsigned int, unsigned int)@plt () from ./libfdpp.so (gdb) disas Dump of assembler code for function _Z8DynAllocPKcjj@plt: 0x00007ffff6caaff0 <+0>: jmpq *0x24d62a(%rip) # 0x7ffff6ef8620 => 0x00007ffff6caaff6 <+6>: pushq $0xc1 0x00007ffff6caaffb <+11>: jmpq 0x7ffff6caa3d0 End of assembler dump. (gdb) p /x $rcx $2 = 0x7ffff6caaff6 (gdb) --- In the above example I did "si" inside PLT, which just followed the "jmp" instruction. As can be seen, $rcx is valid before that "si" and is corrupted after. This only happens if SIGALRM comes in between.
Additionally, SIGALRM can be suppressed with "handle SIGALRM nopass". The corruption will happen nevertheless, meaning that it is exactly gdb who corrupts the register, not the signal handler. If the SIGALRM is not armed by the test-case - no corruption. But if it is armed but suppressed by gdb - corruption. So for corruption to happen, gdb should "see" the signal, but it doesn't matter whether the signal handler is called or not.
I know this is a long shot on such an old bug, but the tar file includes a pre-compiled libfdpp.so library, but not the source to reproduce it. Could that source be made available?
(In reply to Andrew Burgess from comment #3) > I know this is a long shot on such an old bug, but the tar file includes a > pre-compiled libfdpp.so library, but not the source to reproduce it. Could > that source be made available? Sources of libfdpp are of course available on github, but they have changed a lot in 5 years, plus I think I added an extra hacks there for the sake of this test-suit. All that is involved from libfdpp here, is a plt entry. Do you really think the sources will help? If you want, I can try to reduce this test-case even more, use the custom lib instead of libfdpp and so on. But will this help, if all we need is plt? Is there something particular you want to see from the source of libfdpp? The wrapper that sets up SIGALRM and does dlopen(), is included in a source form and I was quite confident this is all we need, no?
Created attachment 14677 [details] Minimal reproducer Decompiled the shared library in the original reproducer, and created a new, minimal example of this issue, this time in source only. To use: tar -xf pr-22921.tar.xz cd pr-22921 make make debug The test should report FAILED. Then: make run The test, run outside GDB, should report PASSED. I haven't completely discounted that this might be a program error, but if it is, I can't see it yet. Also, when run under GDB the cmd_apb.gdb script is used. You'll notice inside it this line: ##delete which deletes all breakpoints. if this line is un-commented then the test, when run under GDB starts working again. So it does appear that the very presence of the software breakpoint causes the test to fail. I've done no additional debuggging of the issue so far.
Thanks! Your test-case doesn't fail here though: +file tst2 +set breakpoint pending on +break apb_magic_func(void*, char const*, int, unsigned int)@plt Function "apb_magic_func(void*, char const*, int, unsigned int)@plt" not defined. Breakpoint 1 (apb_magic_func(void*, char const*, int, unsigned int)@plt) pending. +run This GDB supports auto-downloading debuginfo from the following URLs: https://debuginfod.ubuntu.com Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal] Debuginfod has been disabled. To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". ALARM ALARM Breakpoint 1, 0x00007ffff7fbb0f0 in apb_magic_func(void*, char const*, int, unsigned int)@plt () from ./libapb.so +shell sleep 1 +continue ALARM PASSED I run "make debug", is this how its supposed to run?
Created attachment 14678 [details] test case I checkouted the old sources from git and did the minimal reproducer because yours doesn't work for me. Unfortunately I couldn't get rid of the large C++ header, because if that func returns something other than the object it does return now, the reproducer stops working.
Also if lib is built with g++ instead of clang++, reproducer stops working. So in my new makefile it is built with clang++.
Created attachment 14679 [details] Updated minimal reproducer I understand the problem better now. The issue as originally reported, and as I had it in the first reproducer was very timing dependent. The bug is for a signal arriving during a displaced step. As such it's not surprising that Stas was unable to reproduce the issue with my first example. However, now I know what the issue is, I've updated the reproducer. Instead of using a timer to deliver the signal I wait until I'm in the PLT, then send a signal to the inferior using the shell. After that I continue, GDB sets up the displaced step, resumes the inferior, and immediately takes the signal. Some consequence of the displaced step means that the inferior's $rcx value becomes corrupted. We might be tempted to try using 'queue-signal' from the GDB prompt, rather than sending the signal via the shell, however, this will not work. The signal _HAS_ to arrive during the displaced step. Using queue-signal means that GDB doesn't perform the displaced step, and instead immediately sends the inferior the signal. To reproduce this bug the signal _must_ arrive during the displaced step. Finally, here's my GDB session showing the bug. I've got some debug turned on, and we can see where the corruption is coming from thanks to my annotated lines (look for "<======="): Breakpoint 2, 0x00007ffff7fc5070 in apb_magic_func(void*, char const*, int, unsigned int)@plt () from ./libapb.so 1: x/i $pc => 0x7ffff7fc5070 <_Z14apb_magic_funcPvPKcij@plt>: jmp *0x2fc2(%rip) # 0x7ffff7fc8038 <_Z14apb_magic_funcPvPKcij@got.plt> 2: /x $rcx = 0x200 +python cmd = "shell kill -s SIGALRM %d" % gdb.selected_inferior().pid +python gdb.execute(cmd) ++shell kill -s SIGALRM 2799375 +continue [displaced] displaced_step_prepare_throw: displaced-stepping 2799375.2799375.0 now [displaced] prepare: selected buffer at 0x401072 [displaced] prepare: saved 0x401072: 1e fa 31 ed 49 89 d1 5e 48 89 e2 48 83 e4 f0 50 [displaced] fixup_riprel: %rip-relative addressing used. [displaced] fixup_riprel: using temp reg 2, old value 0x200, new value 0x7ffff7fc5076 <============================== Note the "new" value here. [displaced] amd64_displaced_step_copy_insn: copy 0x7ffff7fc5070->0x401072: ff a1 c2 2f 00 00 68 04 00 00 00 e9 a0 ff ff ff [displaced] displaced_step_prepare_throw: prepared successfully thread=2799375.2799375.0, original_pc=0x7ffff7fc5070, displaced_pc=0x401072 [displaced] resume_1: run 0x401072: ff a1 c2 2f [displaced] finish: restored 2799375.2799375.0 0x401072 Program received signal SIGALRM, Alarm clock. [displaced] displaced_step_prepare_throw: displaced-stepping 2799375.2799375.0 now [displaced] prepare: selected buffer at 0x401072 [displaced] prepare: saved 0x401072: 1e fa 31 ed 49 89 d1 5e 48 89 e2 48 83 e4 f0 50 [displaced] fixup_riprel: %rip-relative addressing used. [displaced] fixup_riprel: using temp reg 2, old value 0x7ffff7fc5076, new value 0x7ffff7fc5076 [displaced] amd64_displaced_step_copy_insn: copy 0x7ffff7fc5070->0x401072: ff a1 c2 2f 00 00 68 04 00 00 00 e9 a0 ff ff ff [displaced] displaced_step_prepare_throw: prepared successfully thread=2799375.2799375.0, original_pc=0x7ffff7fc5070, displaced_pc=0x401072 [displaced] resume_1: run 0x401072: ff a1 c2 2f [displaced] finish: restored 2799375.2799375.0 0x401072 [displaced] amd64_displaced_step_fixup: fixup (0x7ffff7fc5070, 0x401072), insn = 0xff 0xa1 ... [displaced] amd64_displaced_step_fixup: restoring reg 2 to 0x7ffff7fc5076 FAILED: 4160508022 (0xf7fc5076) <============================== Look familiar? It's the lower 32-bits of the "new" value!
Comment on attachment 14677 [details] Minimal reproducer I've added a better reproducer now.
The problem is arising by GDB taking this code path: https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/displaced-stepping.c;h=06b32a80f6a0ec54284aff2ddd456ea1e2625ecf;hb=HEAD#l270 As a result of taking this route we don't perform any arch specific cleanup, which means that the $rcx value is left with its new temporary state.
So you are really at it, that's cool! :) Not that I can use your new test-case, as it now works like this: Program received signal SIGALRM, Alarm clock. [displaced] displaced_step_prepare_throw: displaced-stepping 2910780.2910780.0 now [displaced] prepare: selected buffer at 0x5555555550c2 [displaced] prepare: saved 0x5555555550c2: 1e fa 31 ed 49 89 d1 5e 48 89 e2 48 83 e4 f0 50 [displaced] amd64_displaced_step_copy_insn: copy 0x7ffff7fbb0d0->0x5555555550c2: f3 0f 1e fa f2 ff 25 45 2f 00 00 0f 1f 44 00 00 [displaced] displaced_step_prepare_throw: prepared successfully thread=2910780.2910780.0, original_pc=0x7ffff7fbb0d0, displaced_pc=0x5555555550c2 [displaced] resume_1: run 0x5555555550c2: f3 0f 1e fa [displaced] finish: restored 2910780.2910780.0 0x5555555550c2 [displaced] amd64_displaced_step_fixup: fixup (0x7ffff7fbb0d0, 0x5555555550c2), insn = 0xf3 0x0f ... [displaced] amd64_displaced_step_fixup: relocated %rip from 0x5555555550c6 to 0x7ffff7fbb0d4 PASSED Breakpoint 1, breakpt () at tst2.cc:18 18 } 1: x/i $pc => 0x5555555551de <_Z7breakptv+8>: nop 2: /x $rcx = 0x7ffff7d0cd94 (gdb) And falls inside the gdb prompt, with "PASSED" already written... But I guess I'll just try my own test-case on an already fixed gdb some time later. :)
I posted a patch for this bug: https://sourceware.org/pipermail/gdb-patches/2023-February/197316.html
The master branch has been updated by Andrew Burgess <aburgess@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=cf141dd8ccd36efe833aae3ccdb060b517cc1112 commit cf141dd8ccd36efe833aae3ccdb060b517cc1112 Author: Andrew Burgess <aburgess@redhat.com> Date: Wed Feb 22 12:15:34 2023 +0000 gdb: fix reg corruption from displaced stepping on amd64 This commit aims to address a problem that exists with the current approach to displaced stepping, and was identified in PR gdb/22921. Displaced stepping is currently supported on AArch64, ARM, amd64, i386, rs6000 (ppc), and s390. Of these, I believe there is a problem with the current approach which will impact amd64 and ARM, and can lead to random register corruption when the inferior makes use of asynchronous signals and GDB is using displaced stepping. The problem can be found in displaced_step_buffers::finish in displaced-stepping.c, and is this; after GDB tries to perform a displaced step, and the inferior stops, GDB classifies the stop into one of two states, either the displaced step succeeded, or the displaced step failed. If the displaced step succeeded then gdbarch_displaced_step_fixup is called, which has the job of fixing up the state of the current inferior as if the step had not been performed in a displaced manner. This all seems just fine. However, if the displaced step is considered to have not completed then GDB doesn't call gdbarch_displaced_step_fixup, instead GDB remains in displaced_step_buffers::finish and just performs a minimal fixup which involves adjusting the program counter back to its original value. The problem here is that for amd64 and ARM setting up for a displaced step can involve changing the values in some temporary registers. If the displaced step succeeds then this is fine; after the step the temporary registers are restored to their original values in the architecture specific code. But if the displaced step does not succeed then the temporary registers are never restored, and they retain their modified values. In this context a temporary register is simply any register that is not otherwise used by the instruction being stepped that the architecture specific code considers safe to borrow for the lifetime of the instruction being stepped. In the bug PR gdb/22921, the amd64 instruction being stepped is an rip-relative instruction like this: jmp *0x2fe2(%rip) When we displaced step this instruction we borrow a register, and modify the instruction to something like: jmp *0x2fe2(%rcx) with %rcx having its value adjusted to contain the original %rip value. Now if the displaced step does not succeed, then %rcx will be left with a corrupted value. Obviously corrupting any register is bad; in the bug report this problem was spotted because %rcx is used as a function argument register. And finally, why might a displaced step not succeed? Asynchronous signals provides one reason. GDB sets up for the displaced step and, at that precise moment, the OS delivers a signal (SIGALRM in the bug report), the signal stops the inferior at the address of the displaced instruction. GDB cancels the displaced instruction, handles the signal, and then tries again with the displaced step. But it is that first cancellation of the displaced step that causes the problem; in that case GDB (correctly) sees the displaced step as having not completed, and so does not perform the architecture specific fixup, leaving the register corrupted. The reason why I think AArch64, rs600, i386, and s390 are not effected by this problem is that I don't believe these architectures make use of any temporary registers, so when a displaced step is not completed successfully, the minimal fix up is sufficient. On amd64 we use at most one temporary register. On ARM, looking at arm_displaced_step_copy_insn_closure, we could modify up to 16 temporary registers, and the instruction being displaced stepped could be expanded to multiple replacement instructions, which increases the chances of this bug triggering. This commit only aims to address the issue on amd64 for now, though I believe that the approach I'm proposing here might be applicable for ARM too. What I propose is that we always call gdbarch_displaced_step_fixup. We will now pass an extra argument to gdbarch_displaced_step_fixup, this a boolean that indicates whether GDB thinks the displaced step completed successfully or not. When this flag is false this indicates that the displaced step halted for some "other" reason. On ARM GDB can potentially read the inferior's program counter in order figure out how far through the sequence of replacement instructions we got, and from that GDB can figure out what fixup needs to be performed. On targets like amd64 the problem is slightly easier as displaced stepping only uses a single replacement instruction. If the displaced step didn't complete the GDB knows that the single instruction didn't execute. The point is that by always calling gdbarch_displaced_step_fixup, each architecture can now ensure that the inferior state is fixed up correctly in all cases, not just the success case. On amd64 this ensures that we always restore the temporary register value, and so bug PR gdb/22921 is resolved. In order to move all architectures to this new API, I have moved the minimal roll-back version of the code inside the architecture specific fixup functions for AArch64, rs600, s390, and ARM. For all of these except ARM I think this is good enough, as no temporaries are used all that's needed is the program counter restore anyway. For ARM the minimal code is no worse than what we had before, though I do consider this architecture's displaced-stepping broken. I've updated the gdb.arch/amd64-disp-step.exp test to cover the 'jmpq*' instruction that was causing problems in the original bug, and also added support for testing the displaced step in the presence of asynchronous signal delivery. I've also added two new tests (for amd64 and i386) that check that GDB can correctly handle displaced stepping over a single instruction that branches to itself. I added these tests after a first version of this patch relied too much on checking the program-counter value in order to see if the displaced instruction had executed. This works fine in almost all cases, but when an instruction branches to itself a pure program counter check is not sufficient. The new tests expose this problem. Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=22921 Approved-By: Pedro Alves <pedro@palves.net>
This issue should now be resolved with current HEAD of master branch.