Created attachment 14848 [details] This app can produce gdbserver assert error on arm platform Hi, Attached gdbserver-test-app.tar can produce gdbserver assert error on arm platform. This issue happens on gdb any version until last git tree master branch. Below is version and error message. I will add produce step specification in next comment. Best Regards. Zhiyong root@xilinx-zynq:~# gdb -v GNU gdb (GDB) 14.0.50.20230421-git Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. root@xilinx-zynq:~# root@xilinx-zynq:~# gdbserver --version GNU gdbserver (GDB) 14.0.50.20230421-git Copyright (C) 2023 Free Software Foundation, Inc. gdbserver is free software, covered by the GNU General Public License. This gdbserver was configured as "arm-wrs-linux-gnueabi" 2389.242477 [threads] resume_one_lwp_throw: Resuming lwp 448 (continue, signal 0, stop not expected) ../../git/gdbserver/linux-low.cc:2448: A problem internal to GDBserver has been detected. maybe_hw_step: Assertion `has_single_step_breakpoints (thread)' failed. Aborted
Produce step (1) tar xvf gdbserver-test-app.tar on a host which can do arm-cross compile. (2) In osm.service, modify ExecStart path according to your running environment. (3) make (4) Please refer to "make install", install osm systemd service on target board. [On target board] systemctl daemon-reload systemctl start osm gdbserver --debug --debug-format=all --remote-debug --event-loop-debug --once --attach :1234 $(pgrep osm) [On pc host] your-arm-gdb ./osm(this is test app build out as above) -x ~/gdbx2 gdbx2 can be found in the attachment, please modify target-remote pointing to your target board's gdbserver in gdbx2. When gdb executes gdbx2, gdbserver will assert on target board.
Created attachment 14849 [details] gdb test script
Created attachment 14850 [details] patch file for gdbserver/linux-low.cc
After patch 0001-arm-Install-single-step-software-breakpoing.patch, gdb server doesn't assert any more.
Thanks for the report. Let me try to reproduce this one.
Confirmed. I managed to reproduce this. Might be related to targets with software single stepping and forks happening during stepping. Sometimes you can step a few more times than the reproducer. Other times it happens just like the reproducer. Would you mind sending the patch upstream (gdb-patches@sourceware.org) for further discussion?
Hi Luis, I already sent patch file at https://savannah.gnu.org/patch/index.php?10337. Do you mean I must send the patch to gdb-patches@sourceware.org by mail ? What information should I provide in mail ? Best Regards Zhiyong
Yes, the patches should be sent by e-mail to the gdb-patches@sourceware.org mailing list, where it will go through review / discussion / approval.
Created attachment 14862 [details] patch review mail is attached. patch review mail is attached.
Hi Luis, I have sent patch file and debug log with a brief analyze to gdb-patches@sourceware.org. I also attached this mail in this PR. Best Regards. Zhiyong
Hi Louis, I sent patch to gdb-patches@sourceware.org 5 days ago. But I have not received reply. Do you know how it is going ? Best Regards. Zhiyong
Hi. There isn't an ETA for when it will get reviewed, so a little patience is required. Given it is a change to generic code, the global maintainers need to go through it to make sure it is suitable. I'm still giving this a try to understand if the patch is the right way to go. I haven't forgotten about it.
I wonder if this also reproduces on x86. If so, that would make a better case for urgency of fixing this.
(In reply to Luis Machado from comment #13) > I wonder if this also reproduces on x86. If so, that would make a better > case for urgency of fixing this. This issue can't be produced on x86. I think it is because x86 supports hardware breakpoint. If make gdbserver use software breakpoint on x86, this issue can be produced. I am not urgent for merging the patch to upstream, I just try make the patch proceed in formal process. As out customer hopes the fixes can be carried by gdb formal release.
(In reply to Yan, Zhiyong from comment #14) > (In reply to Luis Machado from comment #13) > > I wonder if this also reproduces on x86. If so, that would make a better > > case for urgency of fixing this. > > This issue can't be produced on x86. I think it is because x86 supports > hardware breakpoint. If make gdbserver use software breakpoint on x86, this > issue can be produced. I think you mean hardware single stepping, not hardware breakpoint. GDB uses software breakpoints for x86 as well as ARM. But GDB uses hardware single stepping for x86, and software single stepping for ARM (meaning it inserts a breakpoint at the next instruction(s) and resumes when it wants to single step).
(In reply to Simon Marchi from comment #15) > (In reply to Yan, Zhiyong from comment #14) > > (In reply to Luis Machado from comment #13) > > > I wonder if this also reproduces on x86. If so, that would make a better > > > case for urgency of fixing this. > > > > This issue can't be produced on x86. I think it is because x86 supports > > hardware breakpoint. If make gdbserver use software breakpoint on x86, this > > issue can be produced. > > I think you mean hardware single stepping, not hardware breakpoint. GDB > uses software breakpoints for x86 as well as ARM. But GDB uses hardware > single stepping for x86, and software single stepping for ARM (meaning it > inserts a breakpoint at the next instruction(s) and resumes when it wants to > single step). Yes. It is supports_hardware_single_step not hardware breakpoint. I made a mistake.
hi guys, What's the status of this PR? And do you guys confirm that this PR is blocking for a GDB 14.1 release? If yes, can you explain what the rationale for it is? (impact, is it a regression, etc). Thank you!
Hi, The internal error blocks further debugging. It doesn't always happen, but happens reliably enough that we should fix it. Yan Zhiyong sent the patch to gdb-patches and is waiting for feedback. It is a change to generic code, so it would require one of the global maintainers to OK it.
Thanks for explaining the situation, Luis. So let's indeed leave things as they are, then.
The most recent patch & gdb test case can be found here: https://sourceware.org/pipermail/gdb-patches/2023-August/201314.html I plan to wait several more days for comments prior to pushing it.
Hi, Thanks for the patch. A few comments. When running this test on a 32-bit docker instance in 64-bit hardware, against the native-gdbserver board, I'm seeing FAIL's: # of expected passes 3074 # of unexpected failures 13 The FAIL's are of this kind: FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=fork: target-non-stop=on: non-stop=off: displaced-stepping=off: i=0: next to other line They fail for both a patched and an unpatched gdb. But I see 12 unexpected core files for the unpatched gdb. It passes for both patched and unpatched native gdb though. So it might not be exercising the bug there. Also, I gave the testcase a try, and I noticed it takes a reasonably long time to run, both on 32-bit and 64-bit. Is it a timing issue of some kind that we need a lot of iterations to run to get it to show up? From the testcase description, it looks like this is mostly a software single-step issue and on gdbserver. Should we isolate the tests to gdbserver and known targets using software single-step instead of making all targets run the test only to potentially PASS every time? So, in summary: * For native gdb (32-bit or 64-bit), this doesn't seem to be exercising the bug. * For native-gdbserver on 64-bit, it doesn't seem to exercise the bug. * For native-gdbserver on 32-bit, I see unexpected core files, which indicates it does exercise the bug, but I also see unexpected failures, which may be something off with the testcase patterns.
(In reply to Luis Machado from comment #21) Hi Luis, Regarding the length of time that it takes to run this test case: I've seen some instances where it took in excess of 100 iterations to observe the software single-step related failure. The number of iterations is currently set to 200, but we might drop it to 150, which should decrease the time needed to run the test by about 25%. We could also eliminate one or more of the outer loops, but this would decrease coverage of (e.g.) the target-non-stop=on case. Doing so, however, would likely eliminate the (other) failures that you've observed. Regarding whether or not to run the test for native targets: It is true that this test was written after finding a bug in the software single-step related code in gdbserver. However, it seems to me that it'd be possible to introduce code causing buggy behavior for native gdb, whether it requires software single-step or not. Therefore, I'd prefer to run it for all targets. Regarding unexpected failures for a patched gdbserver: I see these too, and do not know why they occur. I've only seen them for the "target-non-stop=on" case(s), and they do seem to be somewhat racy in nature. I recall one test run where there were no failures, but I usually see 2 when testing on a Raspberry Pi. I think it's likely that there's another bug that needs fixing. We could setup a KFAIL for that particular case. Or, if we were to eliminate testing target-non-stop=on, those failures would also go away. (I didn't set up a KFAIL for it because leaving them as FAIL means that someone is more likely to try to fix the bug...)
Kevin, Thanks for the info. As long as the tests are covering/testing possible breakages, that sounds fine to me then. My take on it is that non-stop mode causes outputs to be somewhat chaotic in their order, so it is hard to account for those in a deterministic way. Well, at least with the current infrastructure.
I did some testing and found that with the loop count (for the 'next' commands) set to 200, it would take my Raspberry Pi, using --target_board=native-gdbserver, nearly seven minutes to run the new test, gdb.threads/next-fork-exec-other-thread.exp. My macbook running F38 would take nearly a minute and a half while an x86-64 VM also running F38 would run the test in a little under a minute. After a bunch of testing, I settled on changing that loop count to 30. This would still reliably reproduce the bug that Zhiyong had reported, but also finished considerably more quickly. The Raspberry Pi would finish in under a minute and a half while the macbook and the x86-64 VM would finish in around 15 seconds. Native testing for these targets completes in less than 10 seconds. Therefore, in the interest of not causing overall testing to slow down too much, I've reduced the loop count from 200 to 30. My complete findings are below - stop here for the TLDR version!... Command used for native-gdbserver: time make check RUNTESTFLAGS="--target_board=native-gdbserver" TESTS=gdb.threads/next-fork-exec-other-thread.exp Command used for native: time make check RUNTESTFLAGS="" TESTS=gdb.threads/next-fork-exec-other-thread.exp 200 iterations: rpi, unpatched, native-gdbserver: 2m51.976s (14 failures reported) rpi, patched, native-gdbserver: 6m48.753s (2 failures reported) rpi, unpatched, native: 2m11.673s rpi, patched, native: 2m9.436s macbook, unpatched, native-gdbserver: 1m24.372s macbook, patched, native-gdbserver: 1m26.793s macbook, unpatched, native: 0m17.314s macbook, patched, native: 0m18.017s f38-1, unpatched, native-gdbserver: 0m55.265s f38-1, patched, native-gdbserver: 0m52.767s f38-1, unpatched, native: 0m23.419s f38-1, patched, native: 0m23.119s 150 iterations: rpi, unpatched, native-gdbserver: 2m31.826s (13 failures reported) rpi, patched, native-gdbserver: 5m3.467s (2 failures reported) rpi, unpatched, native: 1m43.856s rpi, patched, native: 1m41.656s macbook, unpatched, native-gdbserver: 1m9.472s macbook, patched, native-gdbserver: 1m13.066s macbook, unpatched, native: 0m13.646s macbook, patched, native: 0m13.460s f38-1, unpatched, native-gdbserver: 0m42.931s f38-1, patched, native-gdbserver: 0m42.484s f38-1, unpatched, native: 0m20.117s f38-1, patched, native: 0m19.720s 100 iterations: rpi, unpatched, native-gdbserver: 2m4.937s (13 failures reported) rpi, patched, native-gdbserver: 3m56.154s (1 failure reported) rpi, unpatched, native: 1m16.185s rpi, patched, native: 1m14.161s macbook, unpatched, native-gdbserver: 0m44.304s macbook, patched, native-gdbserver: 0m41.998s macbook, unpatched, native: 0m9.782s macbook, patched, native: 0m10.400s f38-1, unpatched, native-gdbserver: 0m30.188s f38-1, patched, native-gdbserver: 0m30.122s f38-1, unpatched, native: 0m15.375s f38-1, patched, native: 0m15.306s 50 iterations: rpi, unpatched, native-gdbserver: 1m22.541s (13 failures reported) rpi, patched, native-gdbserver: 1m4.468s ( 1 failure reported) rpi, unpatched, native: 0m48.767s rpi, patched, native: 0m46.831s macbook, unpatched, native-gdbserver: 0m25.266s macbook, patched, native-gdbserver: 0m24.684s macbook, unpatched, native: 0m6.302s macbook, patched, native: 0m6.542s f38-1, unpatched, native-gdbserver: 0m19.392s f38-1, patched, native-gdbserver: 0m19.449s f38-1, unpatched, native: 0m11.191s f38-1, patched, native: 0m11.409s 30 iterations: rpi, unpatched, native-gdbserver: 1m3.633s (12 failures reported) rpi, patched, native-gdbserver: 1m27.072s (0 failures reported!) rpi, unpatched, native: 0m37.846s rpi, patched, native: 0m35.950s macbook, unpatched, native-gdbserver: 0m14.870s macbook, patched, native-gdbserver: 0m14.537s macbook, unpatched, native: 0m4.605s macbook, patched, native: 0m4.770s f38-1, unpatched, native-gdbserver: 0m15.249s f38-1, patched, native-gdbserver: 0m14.830s f38-1, unpatched, native: 0m9.762s f38-1, patched, native: 0m9.674s 20 iterations: rpi, unpatched, native-gdbserver: 0m53.582s (11 failures reported) rpi, patched, native-gdbserver: 1m5.585s (0 failures reported) rpi, unpatched, native: 0m32.360s rpi, patched, native: 0m30.554s macbook, unpatched, native-gdbserver: 0m10.432s macbook, patched, native-gdbserver: 0m10.776s macbook, unpatched, native: 0m4.029s macbook, patched, native: 0m4.189s f38-1, unpatched, native-gdbserver: 0m12.492s f38-1, patched, native-gdbserver: 0m12.477s f38-1, unpatched, native: 0m8.801s f38-1, patched, native: 0m8.729s Back to 30 iterations, only rpi w/ native-gdbserver, multiple runs: rpi, unpatched, native-gdbserver: 0m51.597s : 13 failures, 12 core files 0m54.998s : 13 failures, 12 core files 1m0.335s : 12 failures, 12 core files 0m54.722s : 12 failures, 11 core files 0m55.992s : 12 failures, 12 core files rpi, patched, native-gdbserver: 1m27.186s : no failures 1m27.660s : no failures 1m28.207s : no failures 1m26.833s : no failures 1m27.291s : no failures But note that no failures noted above doesn't mean that there isn't a problem! We just haven't iterated through enough GDB 'next' commands to see it.
The master branch has been updated by Kevin Buettner <kevinb@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=b6d8d612d30dcdfc8ba8edfb15b4cd1753b0b8a2 commit b6d8d612d30dcdfc8ba8edfb15b4cd1753b0b8a2 Author: Kevin Buettner <kevinb@redhat.com> Date: Tue Aug 1 13:33:24 2023 -0700 gdbserver: Reinstall software single-step breakpoints in resume_stopped_resumed_lwps At the moment, while performing a software single-step, gdbserver fails to reinsert software single-step breakpoints for a LWP when interrupted by a signal in another thread. This commit fixes this problem by reinstalling software single-step breakpoints in linux_process_target::resume_stopped_resumed_lwps in gdbserver/linux-low.cc. This bug was discovered due to a failing assert in maybe_hw_step() in gdbserver/linux-low.cc. Looking at the backtrace revealed that the caller was linux_process_target::resume_stopped_resumed_lwps. I was uncertain whether the assert should still be valid when called from that method, so I tried hoisting the assert from maybe_hw_step to all callers except resume_stopped_resumed_lwps. But running the new test case, described below, showed that merely eliminating the assert for this case was NOT a good fix - a study of the log file for the test showed that the single-step operation failed to occur. Instead GDB (via gdbserver) stopped at the next breakpoint that was hit. Zhiyong Yan had proposed a fix which resinserted software single-step breakpoints, albeit at a different location in linux-low.cc. Testing revealed that, while running gdb.threads/pending-fork-event-detach, the executable associated with that test would die due to a SIGTRAP after the test program was detached. Examination of the core file(s) showed that a breakpoint instruction had been left in program memory. Test results were otherwise very good, so Zhiyong was definitely on the right track! This commit causes software single-step breakpoint(s) to be inserted before the call to maybe_hw_step in resume_stopped_resumed_lwps. This will cause 'has_single_step_breakpoints (thread)' to be true, so that the assert in maybe_hw_step... /* GDBserver must insert single-step breakpoint for software single step. */ gdb_assert (has_single_step_breakpoints (thread)); ...will no longer fail. And better still, the single-step breakpoints are reinstalled, so that stepping will actually work, even when interrupted. The C code for the test case was loosely adapted from the reproducer provided in Zhiyong's bug report for this problem. The .exp file was copied from next-fork-other-thread.exp and then tweaked slightly. As noted in a comment in next-fork-exec-other-thread.exp, I had to remove "on" from the loop for non-stop as it was failing on all architectures (including x86-64) that I tested. I have a feeling that it ought to work, but this can be investigated separately and (re)enabled once it works. I also increased the number of iterations for the loop running the "next" commands. I've had some test runs which don't show the bug until the loop counter exceeded 100 iterations. The C file for the new test uses shorter delays than next-fork-other-thread.c though, so it doesn't take overly long (IMO) to run this new test. Running the new test on a Raspberry Pi w/ a 32-bit (Arm) kernel and userland using a gdbserver build without the fix in this commit shows the following results: FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=fork: target-non-stop=auto: non-stop=off: displaced-stepping=auto: i=12: next to other line FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=fork: target-non-stop=auto: non-stop=off: displaced-stepping=on: i=9: next to other line FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=fork: target-non-stop=auto: non-stop=off: displaced-stepping=off: i=18: next to other line FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=fork: target-non-stop=off: non-stop=off: displaced-stepping=auto: i=3: next to other line FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=fork: target-non-stop=off: non-stop=off: displaced-stepping=on: i=11: next to other line FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=fork: target-non-stop=off: non-stop=off: displaced-stepping=off: i=1: next to other line FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=auto: non-stop=off: displaced-stepping=auto: i=1: next to break here FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=auto: non-stop=off: displaced-stepping=on: i=3: next to break here FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=auto: non-stop=off: displaced-stepping=off: i=1: next to break here FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=on: non-stop=off: displaced-stepping=auto: i=47: next to other line FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=on: non-stop=off: displaced-stepping=on: i=57: next to other line FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=off: non-stop=off: displaced-stepping=auto: i=1: next to break here FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=off: non-stop=off: displaced-stepping=on: i=10: next to break here FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=off: non-stop=off: displaced-stepping=off: i=1: next to break here === gdb Summary === # of unexpected core files 12 # of expected passes 3011 # of unexpected failures 14 Each of the 12 core files were caused by the failed assertion in maybe_hw_step in linux-low.c. These correspond to 12 of the unexpected failures. When the tests are run using a gdbserver build which includes the fix in this commit, the results are significantly better, but not perfect: FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=on: non-stop=off: displaced-stepping=auto: i=143: next to other line FAIL: gdb.threads/next-fork-exec-other-thread.exp: fork_func=vfork: target-non-stop=on: non-stop=off: displaced-stepping=on: i=25: next to other line === gdb Summary === # of expected passes 10178 # of unexpected failures 2 I think that the two remaining failures are due to some different problem. They are also racy - I've seen runs with no failures or only one failure, but never more than two. Also, those runs were conducted with the loop count in next-fork-exec-other-thread.exp set to 200. During his testing of this fix and the new test case, Luis Machado found that this test was taking a long time and asked about ways to speed it up. I then conducted additional tests in which I gradually reduced the loop count, timing each one, also noting the number of failures. With the loop count set to 30, I found that I could still reliably reproduce the failures that Zhiyong reported (in which, with the proper settings, core files are created). But, with the loop count set to 30, the other failures noted above were much less likely to show up. Anyone wishing to investigate those other failures should set the loop count back up to 200. Running the new test on x86-64 and aarch64, both native and native-gdbserver shows no failures. Also, I see no regressions when running the entire test suite for armv7l-unknown-linux-gnueabihf (i.e. the Raspberry Pi w/ 32-bit kernel+userland) with --target_board=native-gdbserver. Additionally, using --target_board=native-gdbserver, I also see no regressions for the entire test suite for x86-64 and aarch64 running Fedora 38. Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30387 Co-Authored-By: Zhiyong Yan <zhiyong.yan@windriver.com> Tested-By: Zhiyong Yan <zhiyong.yan@windriver.com> Tested-By: Luis Machado <luis.machado@arm.com>
It should be fixed now, so I'm closing this bug.