Bug 31713 - race conditions in target simulation
Summary: race conditions in target simulation
Status: UNCONFIRMED
Alias: None
Product: gdb
Classification: Unclassified
Component: gdb (show other bugs)
Version: HEAD
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 31751
  Show dependency treegraph
 
Reported: 2024-05-09 05:31 UTC by Bernd Edlinger
Modified: 2024-05-18 07:18 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
tsan output from gdb.dwarf2/fission-multi-cu.exp (2.89 KB, text/plain)
2024-05-09 05:42 UTC, Bernd Edlinger
Details
tsan output from gdb.dwarf2/loclists-multiple-cus.exp (3.13 KB, text/plain)
2024-05-09 05:47 UTC, Bernd Edlinger
Details
random data race in bfd_reinit (3.02 KB, text/plain)
2024-05-11 06:03 UTC, Bernd Edlinger
Details
random data race in bfd_section_init (3.00 KB, text/plain)
2024-05-11 06:05 UTC, Bernd Edlinger
Details
random signal-unsafe call inside of a signal (1.55 KB, text/plain)
2024-05-11 06:19 UTC, Bernd Edlinger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bernd Edlinger 2024-05-09 05:31:18 UTC
I have built a gdb/sim for riscv, and it is mostly working now,
but I tried to run test with thread sanitizer and it encountered race conditons.
That are not there for target=host build:

with gdb.dwarf2/fission-multi-cu.exp I see
SUMMARY: ThreadSanitizer: data race ../../binutils-gdb/bfd/section.c:849 in bfd_section_init

and with gdb.dwarf2/loclists-multiple-cus.exp I see twice a:
SUMMARY: ThreadSanitizer: data race ../../binutils-gdb/bfd/cache.c:283 in bfd_cache_lookup_worker

likewise with gdb.dwarf2/loclists-start-end.exp.
Comment 1 Bernd Edlinger 2024-05-09 05:42:59 UTC
Created attachment 15499 [details]
tsan output from gdb.dwarf2/fission-multi-cu.exp
Comment 2 Bernd Edlinger 2024-05-09 05:47:47 UTC
Created attachment 15500 [details]
tsan output from gdb.dwarf2/loclists-multiple-cus.exp

This race condition happens twice in gdb.dwarf2/loclists-multiple-cus.exp
and almost completely identical twice in gdb.dwarf2/loclists-start-end.exp
therefore I only upload one tsan error report.
Comment 3 Bernd Edlinger 2024-05-11 05:57:12 UTC
I have experimented with the following patch:

--- a/gdb/dwarf2/cooked-index.c
+++ b/gdb/dwarf2/cooked-index.c
@@ -644,8 +644,6 @@ cooked_index::set_contents (vec_type &&vec, deferred_warnings *warn,
   gdb_assert (m_vector.empty ());
   m_vector = std::move (vec);
 
-  m_state->set (cooked_state::MAIN_AVAILABLE);
-
   /* This is run after finalization is done -- but not before.  If
      this task were submitted earlier, it would have to wait for
      finalization.  However, that would take a slot in the global
@@ -653,6 +651,7 @@ cooked_index::set_contents (vec_type &&vec, deferred_warnings *warn,
      would cause a livelock.  */
   gdb::task_group finalizers ([=] ()
   {
+    m_state->set (cooked_state::MAIN_AVAILABLE);
     m_state->set (cooked_state::FINALIZED);
     m_state->write_to_cache (index_for_writing (), warn);
     m_state->set (cooked_state::CACHE_DONE);
~

The results are quite interesting.
As it stands it fixes all issues from #31715 and both
failures with the riscv simulation from this PR.
It does however not fix #31716.
BUT there are new failures, with the riscv simulation target,
but I am not able to pinpoint a single test case that triggers
them, they are not easy to reproduce, and happen randomly.
Comment 4 Bernd Edlinger 2024-05-11 06:03:36 UTC
Created attachment 15511 [details]
random data race in bfd_reinit

This happens randomly somewhere near gdb.dwarf2/fission*.exp
Comment 5 Bernd Edlinger 2024-05-11 06:05:54 UTC
Created attachment 15512 [details]
random data race in bfd_section_init

This happens also randomly near gdb.dwarf2/fission*.exp
Comment 6 Bernd Edlinger 2024-05-11 06:19:07 UTC
Created attachment 15513 [details]
random signal-unsafe call inside of a signal

This happens rarely in gdb.base/*.exp, this is actually two errors at once:
First there is a signal unsafe malloc/free happening in the signal handler,
but the reason is a segmentation fault in the "target sim" command.
I think this can only happen, because something with the bfd parser is not
stable, due to the underlying thread issue, that can make the simulator crash,
at least that is my theory.
The attachment contains only the TSAN output, but the gdb/testsuite/gdb.log
around the TSAN report reveals the reason why the signal was sent:

(gdb) target sim ^M 
^M 
^M 
[...many tsan errors reported here...]
Fatal signal: Segmentation fault^M
---- Backtrace -----^M
0x5608f46ad811 gdb_internal_backtrace_1^M
        ../../binutils-gdb/gdb/bt-utils.c:121^M
0x5608f46ad811 _Z22gdb_internal_backtracev^M
        ../../binutils-gdb/gdb/bt-utils.c:167^M
0x5608f48604e6 handle_fatal_signal^M
        ../../binutils-gdb/gdb/event-top.c:916^M
0x5608f486072c handle_sigsegv^M
        ../../binutils-gdb/gdb/event-top.c:989^M
0x7f84e843b3dc CallUserSignalHandler^M
        ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:2005^M
0x7f84e843b837 _Z10sighandleriPN11__sanitizer19__sanitizer_siginfoEPv^M
        ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:2085^M
0x7f84e7e5b04f ???^M
        ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0^M
0x5608f4c6b1e3 initialize_cpu^M
        ../../binutils-gdb/sim/riscv/sim-main.c:1584^M
0x5608f4c23227 sim_open^M
        ../../binutils-gdb/sim/riscv/interp.c:129^M
0x5608f4a226a7 gdbsim_target_open^M
        ../../binutils-gdb/gdb/remote-sim.c:744^M
0x5608f4b54744 open_target^M
        ../../binutils-gdb/gdb/target.c:837^M
0x5608f47000db _Z8cmd_funcP16cmd_list_elementPKci^M
        ../../binutils-gdb/gdb/cli/cli-decode.c:2741^M
0x5608f4b6f1ee _Z15execute_commandPKci^M
        ../../binutils-gdb/gdb/top.c:569^M
0x5608f4861e6a _Z15command_handlerPKc^M
        ../../binutils-gdb/gdb/event-top.c:579^M
0x5608f4863ed4 _Z20command_line_handlerOSt10unique_ptrIcN3gdb13xfree_deleterIcEEE^M
        ../../binutils-gdb/gdb/event-top.c:815^M
0x5608f486172c gdb_rl_callback_handler^M
        ../../binutils-gdb/gdb/event-top.c:271^M
0x5608f4cba39b rl_callback_read_char^M
        ../../../binutils-gdb/readline/readline/callback.c:290^M
0x5608f48610dd gdb_rl_callback_read_char_wrapper_noexcept^M
        ../../binutils-gdb/gdb/event-top.c:196^M
0x5608f4861453 gdb_rl_callback_read_char_wrapper^M
        ../../binutils-gdb/gdb/event-top.c:235^M
0x5608f4b9767f stdin_event_handler^M
        ../../binutils-gdb/gdb/ui.c:154^M
0x5608f4e4a545 handle_file_event^M
        ../../binutils-gdb/gdbsupport/event-loop.cc:551^M
0x5608f4e4ac4a gdb_wait_for_event^M
        ../../binutils-gdb/gdbsupport/event-loop.cc:672^M
0x5608f4e4b66b _Z16gdb_do_one_eventi^M
        ../../binutils-gdb/gdbsupport/event-loop.cc:263^M
0x5608f4960069 start_event_loop^M
        ../../binutils-gdb/gdb/main.c:401^M
0x5608f4960069 captured_command_loop^M
        ../../binutils-gdb/gdb/main.c:465^M
0x5608f4963f04 captured_main^M
        ../../binutils-gdb/gdb/main.c:1339^M
0x5608f4963f04 _Z8gdb_mainP18captured_main_args^M
        ../../binutils-gdb/gdb/main.c:1358^M
0x5608f45abab4 main^M
        ../../binutils-gdb/gdb/gdb.c:38^M
---------------------^M
[...more identical tsan errors reported here...]
A fatal error internal to GDB has been detected, further^M
debugging is not possible.  GDB will now terminate.^M
^M
  For instructions, see:^M
<https://www.gnu.org/software/gdb/bugs/>.^M
^M
ERROR: : spawn id exp9 not open
    while executing
"expect {
-i exp9 -timeout 2400
        -re ".*$gdb_prompt $" {
            if {$verbose > 1} {
                send_user "Loaded $arg into $GDB\n"
            }
            return 0
        }
        -re "..."
    ("uplevel" body line 1)
    invoked from within
"uplevel $body" NONE : spawn id exp9 not open
Couldn't send delete breakpoints to GDB.
UNRESOLVED: gdb.base/utf8-identifiers.exp: delete all breakpoints, watchpoints, tracepoints, and catchpoints in delete_breakpoints
Comment 7 Bernd Edlinger 2024-05-11 06:24:31 UTC
note in the above callstack, I used a couple of patches
to make the risv simulator run at all, so especially
the line number sim-main.c:1584 might be dependent on my patches.
in my source code it is this line here:

void
initialize_cpu (SIM_DESC sd, SIM_CPU *cpu, int mhartid)
{
  struct riscv_sim_cpu *riscv_cpu = RISCV_SIM_CPU (cpu);
  const char *extensions;
  int i;

  memset (riscv_cpu->regs, 0, sizeof (riscv_cpu->regs)); //<-- line 1584

  CPU_PC_FETCH (cpu) = pc_get;
  CPU_PC_STORE (cpu) = pc_set;
  CPU_REG_FETCH (cpu) = reg_fetch;
  CPU_REG_STORE (cpu) = reg_store;
Comment 8 Tom Tromey 2024-05-17 15:21:18 UTC
Can you try the dwarf synchronous change for this one?
https://sourceware.org/pipermail/gdb-patches/2024-May/209267.html
Comment 9 Bernd Edlinger 2024-05-18 07:18:27 UTC
Confirmed. The races are no longer reproducible with your patch.

However the signal unsafe call is potentially still there.
I debugged the signal handler, and I think this issue
is triggered by these statements:
  if (bt_on_fatal_signal)
    {
      sig_write ("\n\n");
      sig_write (_("Fatal signal: "));
      sig_write (strsignal (sig));
      sig_write ("\n");

the _() is not signal safe, but that is hard to detect,
because it does not trigger when on LANG=C is used,
and the test runs do this probably most of the time, to be
able to parse the output.
But when LANG=something else, even "en_US.UTF-8"
the signal unsafe warnings do occur.
just start gdb with tsan instrumentation, and
do kill -11 <pid> in a different terminal.
Do the same with LANG=C and the issue is not there.

This _() expands to a call to gettext, note that this
function calls stuff like pthread_rwlock_rdlock
which is not detected by tsan, but is obviously
not a good idea for a signal handler either.

But even strsignal is not thread safe, and uses _()
to translate the signal name.  The strsignal from libibery
is not used, but instead the glibc variant,
which calls _() to translate known signal names, and uses
malloc for signals from SIGRTMIN..SIGRTMAX.