Bug 31275

Summary: [gdb/dap, gdb.dap/pause.exp] ERROR: timeout reading json header
Product: gdb Reporter: Tom de Vries <vries>
Component: dapAssignee: Tom Tromey <tromey>
Status: RESOLVED FIXED    
Severity: normal CC: tromey
Priority: P2    
Version: HEAD   
Target Milestone: 15.1   
Host: Target:
Build: Last reconfirmed:
Attachments: gdb.log
dap.log.1

Description Tom de Vries 2024-01-22 19:05:14 UTC
Created attachment 15318 [details]
gdb.log

On powerpc64le-linux I run into:
...
{"type": "event", "event": "output", "body": {"category": "stdout", "output": "KeyboardInterrupt\n"}, "seq": 50}ERROR: timeout reading json header
...
Comment 1 Tom de Vries 2024-01-22 19:07:22 UTC
FTR, this is at commit:
...
commit 46bdb641642d20307521f5db2e1f1c7dcc49d5a7
Author: Tom Tromey <tromey@adacore.com>
Date:   Wed Jan 10 12:51:10 2024 -0700

    Handle EOF more gracefully in DAP
...
on cfarm120.
Comment 2 Tom de Vries 2024-01-23 11:55:05 UTC
Created attachment 15324 [details]
dap.log.1
Comment 3 Tom Tromey 2024-01-23 16:42:17 UTC
I wonder what's going on here:

"During handling of the above exception, another exception occurred:\n"

What I think should happen is that the work on the main
thread should be interrupted by a KeyboardInterrupt; then
this should be forwarded to the server thread (via the result queue).
Then finally the server thread should post the "cancelled" response.

From the stack trace this seems to be partly correct but,
for some reason, we get this "another exception" error -- but
without details really.
Comment 4 Tom de Vries 2024-02-08 16:06:31 UTC
I investigated this by aborting at exception, and investigation the coredump, using:
...
diff --git a/gdb/python/lib/gdb/dap/startup.py b/gdb/python/lib/gdb/dap/startup.py
index 7cd1ef2fba1..54bc3d00950 100644
--- a/gdb/python/lib/gdb/dap/startup.py
+++ b/gdb/python/lib/gdb/dap/startup.py
@@ -21,6 +21,7 @@ import queue
 import threading
 import traceback
 import sys
+import os
 
 from enum import IntEnum, auto
 
@@ -179,6 +180,10 @@ def log_stack(level=LogLevel.DEFAULT):
     if dap_log.log_file is not None and level <= _log_level.value:
         traceback.print_exc(file=dap_log.log_file)
 
+def dap_excepthook(exc_type, exc_value, exc_traceback):
+    os.abort()
+
+sys.excepthook = dap_excepthook
 
 @in_gdb_thread
 def exec_and_log(cmd):
...

The coredump is not too usable due to PR31352, but by using fprintf in sig_write we get a usable one:
(gdb) bt
#0  0x00007f7ca054e76b in raise () from /lib64/libpthread.so.0
#1  0x00000000007a3695 in handle_fatal_signal (sig=6) at /data/vries/gdb/src/gdb/event-top.c:926
#2  <signal handler called>
#3  0x00007f7c9d853d2b in raise () from /lib64/libc.so.6
#4  0x00007f7c9d8553e5 in abort () from /lib64/libc.so.6
#5  0x00007f7c9f7717e9 in os_abort_impl (module=<optimized out>) at ./Modules/posixmodule.c:10430
#6  os_abort (module=<optimized out>, _unused_ignored=<optimized out>) at ./Modules/clinic/posixmodule.c.h:5000
#7  0x00007f7c9f6b90ef in _PyCFunction_FastCallDict (
    func_obj=<built-in method abort of module object at remote 0x7f7c9f46cea8>, args=0x7f7c9d5afc30, nargs=0, 
    kwargs=<optimized out>) at Objects/methodobject.c:192
#8  0x00007f7c9f72aa72 in call_function (pp_stack=pp_stack@entry=0x7ffe3802e4f8, oparg=<optimized out>, 
    kwnames=kwnames@entry=0x0) at Python/ceval.c:4851
#9  0x00007f7c9f72eed5 in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3335
#10 0x00007f7c9f72a67f in _PyFunction_FastCall (co=co@entry=0x7f7c9e8e09c0, args=<optimized out>, 
    args@entry=0x7ffe3802e6c0, nargs=nargs@entry=3, 
    globals=globals@entry={'__name__': 'gdb.dap.startup', '__doc__': None, '__package__': 'gdb.dap', '__loader__': <SourceFileLoader(name='gdb.dap.startup', path='/data/vries/gdb/leap-15-4/build/gdb/data-directory/python/gdb/dap/startup.py') at remote 0x7f7c9e8e6710>, '__spec__': <ModuleSpec(name='gdb.dap.startup', loader=<...>, origin='/data/vries/gdb/leap-15-4/build/gdb/data-directory/python/gdb/dap/startup.py', loader_state=None, submodule_search_locations=None, _set_fileattr=True, _cached='/data/vries/gdb/leap-15-4/build/gdb/data-directory/python/gdb/dap/__pycache__/startup.cpython-36.pyc', _initializing=False) at remote 0x7f7c9e8e6748>, '__file__': '/data/vries/gdb/leap-15-4/build/gdb/data-directory/python/gdb/dap/startup.py', '__cached__': '/data/vries/gdb/leap-15-4/build/gdb/data-directory/python/gdb/dap/__pycache__/startup.cpython-36.pyc', '__builtins__': {'__name__': 'builtins', '__doc__': "Built-in functions, exceptions, and other objects.\n\nNoteworthy: None is the `nil' object; Ellipsis represents `...' in slices.", '__pac...(truncated)) at Python/ceval.c:4933
#11 0x00007f7c9f732870 in _PyFunction_FastCallDict (func=func@entry=<function at remote 0x7f7c9e8ed840>, 
    args=args@entry=0x7ffe3802e6c0, nargs=3, kwargs=kwargs@entry=0x0) at Python/ceval.c:5035
#12 0x00007f7c9f670b7e in _PyObject_FastCallDict (func=<function at remote 0x7f7c9e8ed840>, 
    args=args@entry=0x7ffe3802e6c0, nargs=nargs@entry=3, kwargs=kwargs@entry=0x0) at Objects/abstract.c:2310
#13 0x00007f7c9f75537c in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:669
#14 0x0000000000aaabd3 in gdbpy_print_stack () at /data/vries/gdb/src/gdb/python/python.c:1541
#15 0x0000000000aada09 in gdbpy_event::operator() (this=0x7f7c68004a60)
    at /data/vries/gdb/src/gdb/python/python.c:1063
#16 0x0000000000aadfda in std::_Function_handler<void (), gdbpy_event>::_M_invoke(std::_Any_data const&) (
    __functor=...) at /usr/include/c++/7/bits/std_function.h:316
#17 0x0000000000622a48 in std::function<void ()>::operator()() const (this=0x7f7c68004940)
    at /usr/include/c++/7/bits/std_function.h:706
#18 0x0000000000b6159a in run_events (error=0, client_data=0x0) at /data/vries/gdb/src/gdb/run-on-main-thread.c:76
#19 0x00000000014f9eec in handle_file_event (file_ptr=0x2c936e0, ready_mask=1)
    at /data/vries/gdb/src/gdbsupport/event-loop.cc:573
#20 0x00000000014fa482 in gdb_wait_for_event (block=1) at /data/vries/gdb/src/gdbsupport/event-loop.cc:694
#21 0x00000000014f9365 in gdb_do_one_event (mstimeout=-1) at /data/vries/gdb/src/gdbsupport/event-loop.cc:264
#22 0x00000000009364a2 in start_event_loop () at /data/vries/gdb/src/gdb/main.c:401
#23 0x0000000000936603 in captured_command_loop () at /data/vries/gdb/src/gdb/main.c:465
#24 0x0000000000937eb4 in captured_main (data=0x7ffe3802eae0) at /data/vries/gdb/src/gdb/main.c:1335
#25 0x0000000000937f4e in gdb_main (args=0x7ffe3802eae0) at /data/vries/gdb/src/gdb/main.c:1354
#26 0x000000000041a1c3 in main (argc=14, argv=0x7ffe3802ebf8) at /data/vries/gdb/src/gdb/gdb.c:39
...

I can reliably make the test-case fail by adding a time.sleep(1) either before or after the try-catch in here:
...
@@ -226,6 +232,7 @@ def send_gdb_with_response(fn):
     result_q = DAPQueue()
 
     def message():
+        time.sleep(1)
         try:
             val = fn()
             result_q.put(val)
...
and observe the same backtrace in the corefile.

So, my theory is that there's a race between one thread calling PyErr_SetInterrupt, and another one calling message.

If the interrupt is handled while "val = fn()" is executing, cancellation is successful and a KeyboardInterrupt is thrown which is caught by the try-catch.

But if the interrupt is handled before or after the try-catch, it escapes and is printed by the gdbpy_event::operator.

I suppose a partial fix could be that you'd block interrupts before calling into python, and unblock them only once you're in the try-catch.

That would take care of delaying too-early interrupts.

I suppose interrupts may be dealt with by ignoring them, if we can identify that the request that it was supposed to cancel already has finished.
Comment 5 Tom de Vries 2024-02-08 17:00:10 UTC
The test-case in the current state actually does not use the python script (see PR31354), but sources the gdb script twice.
Comment 6 Tom de Vries 2024-02-19 15:21:09 UTC
FTR, this still reproduces with https://sourceware.org/pipermail/gdb-patches/2024-February/206626.html .
Comment 8 Sourceware Commits 2024-02-29 20:29:27 UTC
The master branch has been updated by Tom de Vries <vries@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=032d23a6db097840b7969867df0e2ba035f03f6f

commit 032d23a6db097840b7969867df0e2ba035f03f6f
Author: Tom de Vries <tdevries@suse.de>
Date:   Thu Feb 29 21:29:34 2024 +0100

    [gdb/dap] Fix stray KeyboardInterrupt after cancel
    
    When running test-case gdb.dap/pause.exp 100 times in a loop, it passes
    100/100.
    
    But if we remove the two "sleep 0.2" from the test-case, we run into
    (copied from dap.log and edited for readability):
    ...
    Traceback (most recent call last):
      File "startup.py", line 251, in message
        def message():
    
    KeyboardInterrupt
    Quit
    ...
    
    This happens as follows.
    
    CancellationHandler.cancel calls gdb.interrupt to cancel a request in flight.
    
    The idea is that this interrupt triggers while in fn here in message (a nested
    function of send_gdb_with_response):
    ...
        def message():
            try:
                val = fn()
                result_q.put(val)
            except (Exception, KeyboardInterrupt) as e:
                result_q.put(e)
    ...
    but instead it triggers outside the try/except.
    
    Fix this by:
    - in CancellationHandler, renaming variable in_flight to in_flight_dap_thread,
      and adding a variable in_flight_gdb_thread to be able to distinguish when
      a request is in flight in the dap thread or the gdb thread.
    - adding a wrapper Cancellable to to deal with cancelling the wrapped
      event
    - using Cancellable in send_gdb and send_gdb_with_response to wrap the posted
      event
    - in CancellationHandler.cancel, only call gdb.interrupt if
      req == self.in_flight_gdb_thread.
    
    This makes the test-case pass 100/100, also when adding the extra stressor of
    "taskset -c 0", which makes the fail more likely without the patch.
    
    Tested on aarch64-linux.
    
    Approved-By: Tom Tromey <tom@tromey.com>
    
    PR dap/31275
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31275
Comment 9 Tom de Vries 2024-02-29 20:30:27 UTC
Fixed.