22921 – breakpoint in PLT corrupts function argument in $rcx

Bug 22921 - breakpoint in PLT corrupts function argument in $rcx

Summary: breakpoint in PLT corrupts function argument in $rcx

Status:	RESOLVED FIXED

Alias:	None

Product:	gdb
Classification:	Unclassified
Component:	breakpoints (show other bugs)
Version:	8.0.1

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-03-03 23:56 UTC by Stas Sergeev
Modified:	2023-04-06 13:29 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
test-case (1.04 MB, application/gzip) 2018-03-03 23:56 UTC, Stas Sergeev	Details
Minimal reproducer (1.05 KB, application/x-xz) 2023-02-12 08:10 UTC, Andrew Burgess	Details
test case (4.66 KB, application/gzip) 2023-02-12 10:17 UTC, Stas Sergeev	Details
Updated minimal reproducer (1.11 KB, application/x-xz) 2023-02-12 15:26 UTC, Andrew Burgess	Details
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stas Sergeev 2018-03-03 23:56:08 UTC

Created attachment 10865 [details]
test-case

Hello.

This bug is very illusive, so I spent quite some
time creating this fully automated test case.

Unpack tst.tar.gz.
There is a Makefile with testing targets.
"make pure" will compile and run the test-case w/o gdb.
You should see the following:
---
./tst
TEST PASSED! size=512
---

Now run "make gdb_good":
---
Function "DynAlloc" not defined.
Breakpoint 1 (DynAlloc) pending.
warning: Corrupted shared library list: 0x613e90 != 0x7ffff7ffd990

Breakpoint 1, 0x00007ffff6caaff0 in DynAlloc(char const*, unsigned int, unsigned int)@plt () from ./libfdpp.so

Breakpoint 1, DynAlloc (what=0x7ffff6cf6a9f "dosobj", num=1, size=512) at dyninit.cc:59
59	  if (size != 512)
TEST PASSED! size=512
[Inferior 1 (process 31269) exited normally]
---

It will use the .gdbinit script that does not trigger
the bug. It is needed to show what arguments the function
"DynAlloc" is called with. "size=512" is a correct value.

Now run "make gdb_bad":
---
Function "DynAlloc" not defined.
Breakpoint 1 (DynAlloc) pending.
warning: Corrupted shared library list: 0x613e90 != 0x7ffff7ffd990

Breakpoint 1, 0x00007ffff6caaff0 in DynAlloc(char const*, unsigned int, unsigned int)@plt () from ./libfdpp.so
#1  0x00007ffff6cf4560 in dosobj_init () at dosobj.cc:16
16	    __FAR(void)fa = DynAlloc("dosobj", 1, size);
$1 = 512
0x00007ffff6caaff6 in DynAlloc(char const*, unsigned int, unsigned int)@plt () from ./libfdpp.so

Breakpoint 1, DynAlloc (what=0x7ffff6cf6a9f "dosobj", num=1, size=4140478454) at dyninit.cc:59
59	  if (size != 512)
TEST FAILED! size=-154488842
[Inferior 1 (process 31420) exited normally]
---

This shows that the size argument is corrupted.
There is a custom .gdbinit script here that is triggering
the race condition in gdb. What it does is sets the
breakpoint to DynAlloc, triggers this breakpoint, waits
one second, then does "c".
The argument that resides in $rcx, gets corrupted during
this "wait one second" thing. The "good" gdbinit script
just doesn't wait one second, so corruption does not happen.
The test-case arms the SIGALRM handler which should come
while gdb is waiting. Then the $rcx corruption happens.
Note that the signal handler is empty, so by itself it
can't corrupt $rcx. Also the corruption does not happen
if the PLT for DynAlloc is disabled via visibility=hidden.
So really it is something very tricky: PLT+signal+breakpoint -
all come into the game.

Comment 1 Stas Sergeev 2018-03-04 00:27:38 UTC

Also it is very interesting to see the problem
in interactive mode. Remove any .gdbinit, then do:

---
$ make
$ gdb ./tst
...
Reading symbols from ./tst...done.
(gdb) b DynAlloc
Function "DynAlloc" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (DynAlloc) pending.
(gdb) r
Starting program: /home/stas/tests/fdpp/tst 
warning: Corrupted shared library list: 0x613e90 != 0x7ffff7ffd990

Breakpoint 1, 0x00007ffff6caaff0 in DynAlloc(char const*, unsigned int, unsigned int)@plt () from ./libfdpp.so
Missing separate debuginfos, use: dnf debuginfo-install libgcc-7.2.1-2.fc27.x86_64 libstdc++-7.2.1-2.fc27.x86_64
(gdb) p /x $rcx
$1 = 0x200
(gdb) disas
Dump of assembler code for function _Z8DynAllocPKcjj@plt:
=> 0x00007ffff6caaff0 <+0>:	jmpq   *0x24d62a(%rip)        # 0x7ffff6ef8620
   0x00007ffff6caaff6 <+6>:	pushq  $0xc1
   0x00007ffff6caaffb <+11>:	jmpq   0x7ffff6caa3d0
End of assembler dump.
(gdb) si
0x00007ffff6caaff6 in DynAlloc(char const*, unsigned int, unsigned int)@plt ()
   from ./libfdpp.so
(gdb) disas
Dump of assembler code for function _Z8DynAllocPKcjj@plt:
   0x00007ffff6caaff0 <+0>:	jmpq   *0x24d62a(%rip)        # 0x7ffff6ef8620
=> 0x00007ffff6caaff6 <+6>:	pushq  $0xc1
   0x00007ffff6caaffb <+11>:	jmpq   0x7ffff6caa3d0
End of assembler dump.
(gdb) p /x $rcx
$2 = 0x7ffff6caaff6
(gdb) 
---

In the above example I did "si" inside PLT, which
just followed the "jmp" instruction. As can be seen,
$rcx is valid before that "si" and is corrupted after.
This only happens if SIGALRM comes in between.

Comment 2 Stas Sergeev 2018-03-04 00:43:38 UTC

Additionally, SIGALRM can be suppressed with "handle SIGALRM nopass".
The corruption will happen nevertheless, meaning that it is
exactly gdb who corrupts the register, not the signal handler.
If the SIGALRM is not armed by the test-case - no corruption.
But if it is armed but suppressed by gdb - corruption.
So for corruption to happen, gdb should "see" the signal,
but it doesn't matter whether the signal handler is called or not.

Comment 3 Andrew Burgess 2023-02-12 06:57:06 UTC

I know this is a long shot on such an old bug, but the tar file includes a pre-compiled libfdpp.so library, but not the source to reproduce it.  Could that source be made available?

Comment 4 Stas Sergeev 2023-02-12 07:35:15 UTC

(In reply to Andrew Burgess from comment #3)
> I know this is a long shot on such an old bug, but the tar file includes a
> pre-compiled libfdpp.so library, but not the source to reproduce it.  Could
> that source be made available?

Sources of libfdpp are of course available on github,
but they have changed a lot in 5 years, plus I think
I added an extra hacks there for the sake of this test-suit.
All that is involved from libfdpp here, is a plt entry.
Do you really think the sources will help?
If you want, I can try to reduce this test-case even
more, use the custom lib instead of libfdpp and so on.
But will this help, if all we need is plt?
Is there something particular you want to see from
the source of libfdpp?
The wrapper that sets up SIGALRM and does dlopen(), is
included in a source form and I was quite confident this
is all we need, no?

Comment 5 Andrew Burgess 2023-02-12 08:10:00 UTC

Created attachment 14677 [details]
Minimal reproducer

Decompiled the shared library in the original reproducer, and created a new, minimal example of this issue, this time in source only.

To use:

  tar -xf pr-22921.tar.xz
  cd pr-22921
  make
  make debug

The test should report FAILED.  Then:

  make run

The test, run outside GDB, should report PASSED.

I haven't completely discounted that this might be a program error, but if it is, I can't see it yet.

Also, when run under GDB the cmd_apb.gdb script is used.  You'll notice inside it this line:

  ##delete

which deletes all breakpoints.  if this line is un-commented then the test, when run under GDB starts working again.  So it does appear that the very presence of the software breakpoint causes the test to fail.  I've done no additional debuggging of the issue so far.

Comment 6 Stas Sergeev 2023-02-12 08:17:48 UTC

Thanks!
Your test-case doesn't fail here though:

+file tst2
+set breakpoint pending on
+break apb_magic_func(void*, char const*, int, unsigned int)@plt
Function "apb_magic_func(void*, char const*, int, unsigned int)@plt" not defined.
Breakpoint 1 (apb_magic_func(void*, char const*, int, unsigned int)@plt) pending.
+run

This GDB supports auto-downloading debuginfo from the following URLs:
https://debuginfod.ubuntu.com 
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
ALARM

ALARM


Breakpoint 1, 0x00007ffff7fbb0f0 in apb_magic_func(void*, char const*, int, unsigned int)@plt () from ./libapb.so
+shell sleep 1
+continue
ALARM

PASSED


I run "make debug", is this how its supposed to run?

Comment 7 Stas Sergeev 2023-02-12 10:17:08 UTC

Created attachment 14678 [details]
test case

I checkouted the old sources from git and did the minimal
reproducer because yours doesn't work for me.
Unfortunately I couldn't get rid of the large C++ header,
because if that func returns something other than the object
it does return now, the reproducer stops working.

Comment 8 Stas Sergeev 2023-02-12 10:20:47 UTC

Also if lib is built with g++ instead of clang++,
reproducer stops working. So in my new makefile it
is built with clang++.

Comment 9 Andrew Burgess 2023-02-12 15:26:28 UTC

Created attachment 14679 [details]
Updated minimal reproducer

I understand the problem better now.  The issue as originally reported, and as I had it in the first reproducer was very timing dependent.  The bug is for a signal arriving during a displaced step.  As such it's not surprising that Stas was unable to reproduce the issue with my first example.

However, now I know what the issue is, I've updated the reproducer.  Instead of using a timer to deliver the signal I wait until I'm in the PLT, then send a signal to the inferior using the shell.  After that I continue, GDB sets up the displaced step, resumes the inferior, and immediately takes the signal.

Some consequence of the displaced step means that the inferior's $rcx value becomes corrupted.

We might be tempted to try using 'queue-signal' from the GDB prompt, rather than sending the signal via the shell, however, this will not work.  The signal _HAS_ to arrive during the displaced step.  Using queue-signal means that GDB doesn't perform the displaced step, and instead immediately sends the inferior the signal.  To reproduce this bug the signal _must_ arrive during the displaced step.

Finally, here's my GDB session showing the bug.  I've got some debug turned on, and we can see where the corruption is coming from thanks to my annotated lines (look for "<======="):

Breakpoint 2, 0x00007ffff7fc5070 in apb_magic_func(void*, char const*, int, unsigned int)@plt () from ./libapb.so
1: x/i $pc
=> 0x7ffff7fc5070 <_Z14apb_magic_funcPvPKcij@plt>:	
    jmp    *0x2fc2(%rip)        # 0x7ffff7fc8038 <_Z14apb_magic_funcPvPKcij@got.plt>
2: /x $rcx = 0x200
+python cmd = "shell kill -s SIGALRM %d" % gdb.selected_inferior().pid
+python gdb.execute(cmd)
++shell kill -s SIGALRM 2799375
+continue
[displaced] displaced_step_prepare_throw: displaced-stepping 2799375.2799375.0 now
[displaced] prepare: selected buffer at 0x401072
[displaced] prepare: saved 0x401072: 1e fa 31 ed 49 89 d1 5e 48 89 e2 48 83 e4 f0 50
[displaced] fixup_riprel: %rip-relative addressing used.
[displaced] fixup_riprel: using temp reg 2, old value 0x200, new value 0x7ffff7fc5076   <============================== Note the "new" value here.
[displaced] amd64_displaced_step_copy_insn: copy 0x7ffff7fc5070->0x401072: ff a1 c2 2f 00 00 68 04 00 00 00 e9 a0 ff ff ff
[displaced] displaced_step_prepare_throw: prepared successfully thread=2799375.2799375.0, original_pc=0x7ffff7fc5070, displaced_pc=0x401072
[displaced] resume_1: run 0x401072: ff a1 c2 2f
[displaced] finish: restored 2799375.2799375.0 0x401072

Program received signal SIGALRM, Alarm clock.
[displaced] displaced_step_prepare_throw: displaced-stepping 2799375.2799375.0 now
[displaced] prepare: selected buffer at 0x401072
[displaced] prepare: saved 0x401072: 1e fa 31 ed 49 89 d1 5e 48 89 e2 48 83 e4 f0 50
[displaced] fixup_riprel: %rip-relative addressing used.
[displaced] fixup_riprel: using temp reg 2, old value 0x7ffff7fc5076, new value 0x7ffff7fc5076
[displaced] amd64_displaced_step_copy_insn: copy 0x7ffff7fc5070->0x401072: ff a1 c2 2f 00 00 68 04 00 00 00 e9 a0 ff ff ff
[displaced] displaced_step_prepare_throw: prepared successfully thread=2799375.2799375.0, original_pc=0x7ffff7fc5070, displaced_pc=0x401072
[displaced] resume_1: run 0x401072: ff a1 c2 2f
[displaced] finish: restored 2799375.2799375.0 0x401072
[displaced] amd64_displaced_step_fixup: fixup (0x7ffff7fc5070, 0x401072), insn = 0xff 0xa1 ...
[displaced] amd64_displaced_step_fixup: restoring reg 2 to 0x7ffff7fc5076
FAILED: 4160508022 (0xf7fc5076)  <============================== Look familiar?  It's the lower 32-bits of the "new" value!

Comment 10 Andrew Burgess 2023-02-12 15:27:43 UTC

Comment on attachment 14677 [details]
Minimal reproducer

I've added a better reproducer now.

Comment 11 Andrew Burgess 2023-02-12 15:32:27 UTC

The problem is arising by GDB taking this code path:

  https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/displaced-stepping.c;h=06b32a80f6a0ec54284aff2ddd456ea1e2625ecf;hb=HEAD#l270

As a result of taking this route we don't perform any arch specific cleanup, which means that the $rcx value is left with its new temporary state.

Comment 12 Stas Sergeev 2023-02-12 15:38:49 UTC

So you are really at it, that's cool! :)
Not that I can use your new test-case, as it now works like this:

Program received signal SIGALRM, Alarm clock.
[displaced] displaced_step_prepare_throw: displaced-stepping 2910780.2910780.0 now
[displaced] prepare: selected buffer at 0x5555555550c2
[displaced] prepare: saved 0x5555555550c2: 1e fa 31 ed 49 89 d1 5e 48 89 e2 48 83 e4 f0 50
[displaced] amd64_displaced_step_copy_insn: copy 0x7ffff7fbb0d0->0x5555555550c2: f3 0f 1e fa f2 ff 25 45 2f 00 00 0f 1f 44 00 00
[displaced] displaced_step_prepare_throw: prepared successfully thread=2910780.2910780.0, original_pc=0x7ffff7fbb0d0, displaced_pc=0x5555555550c2
[displaced] resume_1: run 0x5555555550c2: f3 0f 1e fa
[displaced] finish: restored 2910780.2910780.0 0x5555555550c2
[displaced] amd64_displaced_step_fixup: fixup (0x7ffff7fbb0d0, 0x5555555550c2), insn = 0xf3 0x0f ...
[displaced] amd64_displaced_step_fixup: relocated %rip from 0x5555555550c6 to 0x7ffff7fbb0d4
PASSED

Breakpoint 1, breakpt () at tst2.cc:18
18	}
1: x/i $pc
=> 0x5555555551de <_Z7breakptv+8>:	nop
2: /x $rcx = 0x7ffff7d0cd94
(gdb) 

And falls inside the gdb prompt, with "PASSED" already written...
But I guess I'll just try my own test-case on an already fixed
gdb some time later. :)

Comment 13 Andrew Burgess 2023-02-28 15:03:38 UTC

I posted a patch for this bug:

  https://sourceware.org/pipermail/gdb-patches/2023-February/197316.html

Comment 14 Sourceware Commits 2023-04-06 13:27:30 UTC

The master branch has been updated by Andrew Burgess <aburgess@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=cf141dd8ccd36efe833aae3ccdb060b517cc1112

commit cf141dd8ccd36efe833aae3ccdb060b517cc1112
Author: Andrew Burgess <aburgess@redhat.com>
Date:   Wed Feb 22 12:15:34 2023 +0000

    gdb: fix reg corruption from displaced stepping on amd64
    
    This commit aims to address a problem that exists with the current
    approach to displaced stepping, and was identified in PR gdb/22921.
    
    Displaced stepping is currently supported on AArch64, ARM, amd64,
    i386, rs6000 (ppc), and s390.  Of these, I believe there is a problem
    with the current approach which will impact amd64 and ARM, and can
    lead to random register corruption when the inferior makes use of
    asynchronous signals and GDB is using displaced stepping.
    
    The problem can be found in displaced_step_buffers::finish in
    displaced-stepping.c, and is this; after GDB tries to perform a
    displaced step, and the inferior stops, GDB classifies the stop into
    one of two states, either the displaced step succeeded, or the
    displaced step failed.
    
    If the displaced step succeeded then gdbarch_displaced_step_fixup is
    called, which has the job of fixing up the state of the current
    inferior as if the step had not been performed in a displaced manner.
    This all seems just fine.
    
    However, if the displaced step is considered to have not completed
    then GDB doesn't call gdbarch_displaced_step_fixup, instead GDB
    remains in displaced_step_buffers::finish and just performs a minimal
    fixup which involves adjusting the program counter back to its
    original value.
    
    The problem here is that for amd64 and ARM setting up for a displaced
    step can involve changing the values in some temporary registers.  If
    the displaced step succeeds then this is fine; after the step the
    temporary registers are restored to their original values in the
    architecture specific code.
    
    But if the displaced step does not succeed then the temporary
    registers are never restored, and they retain their modified values.
    
    In this context a temporary register is simply any register that is
    not otherwise used by the instruction being stepped that the
    architecture specific code considers safe to borrow for the lifetime
    of the instruction being stepped.
    
    In the bug PR gdb/22921, the amd64 instruction being stepped is
    an rip-relative instruction like this:
    
      jmp    *0x2fe2(%rip)
    
    When we displaced step this instruction we borrow a register, and
    modify the instruction to something like:
    
      jmp    *0x2fe2(%rcx)
    
    with %rcx having its value adjusted to contain the original %rip
    value.
    
    Now if the displaced step does not succeed, then %rcx will be left
    with a corrupted value.  Obviously corrupting any register is bad; in
    the bug report this problem was spotted because %rcx is used as a
    function argument register.
    
    And finally, why might a displaced step not succeed?  Asynchronous
    signals provides one reason.  GDB sets up for the displaced step and,
    at that precise moment, the OS delivers a signal (SIGALRM in the bug
    report), the signal stops the inferior at the address of the displaced
    instruction.  GDB cancels the displaced instruction, handles the
    signal, and then tries again with the displaced step.  But it is that
    first cancellation of the displaced step that causes the problem; in
    that case GDB (correctly) sees the displaced step as having not
    completed, and so does not perform the architecture specific fixup,
    leaving the register corrupted.
    
    The reason why I think AArch64, rs600, i386, and s390 are not effected
    by this problem is that I don't believe these architectures make use
    of any temporary registers, so when a displaced step is not completed
    successfully, the minimal fix up is sufficient.
    
    On amd64 we use at most one temporary register.
    
    On ARM, looking at arm_displaced_step_copy_insn_closure, we could
    modify up to 16 temporary registers, and the instruction being
    displaced stepped could be expanded to multiple replacement
    instructions, which increases the chances of this bug triggering.
    
    This commit only aims to address the issue on amd64 for now, though I
    believe that the approach I'm proposing here might be applicable for
    ARM too.
    
    What I propose is that we always call gdbarch_displaced_step_fixup.
    
    We will now pass an extra argument to gdbarch_displaced_step_fixup,
    this a boolean that indicates whether GDB thinks the displaced step
    completed successfully or not.
    
    When this flag is false this indicates that the displaced step halted
    for some "other" reason.  On ARM GDB can potentially read the
    inferior's program counter in order figure out how far through the
    sequence of replacement instructions we got, and from that GDB can
    figure out what fixup needs to be performed.
    
    On targets like amd64 the problem is slightly easier as displaced
    stepping only uses a single replacement instruction.  If the displaced
    step didn't complete the GDB knows that the single instruction didn't
    execute.
    
    The point is that by always calling gdbarch_displaced_step_fixup, each
    architecture can now ensure that the inferior state is fixed up
    correctly in all cases, not just the success case.
    
    On amd64 this ensures that we always restore the temporary register
    value, and so bug PR gdb/22921 is resolved.
    
    In order to move all architectures to this new API, I have moved the
    minimal roll-back version of the code inside the architecture specific
    fixup functions for AArch64, rs600, s390, and ARM.  For all of these
    except ARM I think this is good enough, as no temporaries are used all
    that's needed is the program counter restore anyway.
    
    For ARM the minimal code is no worse than what we had before, though I
    do consider this architecture's displaced-stepping broken.
    
    I've updated the gdb.arch/amd64-disp-step.exp test to cover the
    'jmpq*' instruction that was causing problems in the original bug, and
    also added support for testing the displaced step in the presence of
    asynchronous signal delivery.
    
    I've also added two new tests (for amd64 and i386) that check that GDB
    can correctly handle displaced stepping over a single instruction that
    branches to itself.  I added these tests after a first version of this
    patch relied too much on checking the program-counter value in order
    to see if the displaced instruction had executed.  This works fine in
    almost all cases, but when an instruction branches to itself a pure
    program counter check is not sufficient.  The new tests expose this
    problem.
    
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=22921
    
    Approved-By: Pedro Alves <pedro@palves.net>

Comment 15 Andrew Burgess 2023-04-06 13:29:22 UTC

This issue should now be resolved with current HEAD of master branch.