In OBS, on openSUSE Tumbleweed for armv7hl, I run into: ... (gdb) ^M show mi-async^M &"show mi-async\n"^M ~"Whether MI is in asynchronous mode is off.\n"^M ^done^M (gdb) ^M 104-environment-directory -r^M 104^done,source-path="$cdir:$cwd"^M (gdb) ERROR: Dir reinitialization failed (timeout) ... I see similar fails for openSUSE Tumbleweed aarch64: ... 103-break-list^M 103^done,BreakpointTable={nr_rows="0",nr_cols="6",hdr=[{width="7",alignment="-1",col_name="number",colhdr="Num"},{width="14",alignment="-1",col_name="type",colhdr="Type"},{width="4",alignment="-1",col_name="disp",colhdr="Disp"},{width="3",alignment="-1",col_name="enabled",colhdr="Enb"},{width="10",alignment="-1",col_name="addr",colhdr="Address"},{width="40",alignment="2",col_name="what",colhdr="What"}],body=[]}^M (gdb) ERROR: -break-list (timeout) ^M ... AFAIU, gdb produces the required output, but the test-suite emits an error just before the end of the mi prompt, just before '\r\n'. This could be a system or system load problem of some sort. I do find it curious though that the timeout always happens at the exact same location, so I thought it worthy of filing.
Created attachment 13766 [details] log for first error
Created attachment 13767 [details] log for second error
Do you know what kind of aarch64 system this is? Is the failure mode a random missing newline by any chance?
(In reply to Luis Machado from comment #3) > Do you know what kind of aarch64 system this is? Well, it's OBS, so for the attached logs I cannot say. For the most recent one I ran into on aarch64 openSUSE Leap 15.3: ... (gdb) set height 0^M (gdb) set width 0^M (gdb) set build-id-verbose 0^M (gdb) builtin_spawn -pty^M new-ui mi /dev/pts/5^M New UI allocated^M (gdb) =thread-group-added,id="i1"^M (gdb) ERROR: MI channel failed warning: Error detected on fd 11^M thread 1.1^M Unknown thread 1.1.^M (gdb) UNRESOLVED: gdb.mi/user-selected-context-sync.exp: mode=non-stop: test_cli_inferior: reset selection to thread 1.1 frame 0^M No registers.^M (gdb) PASS: gdb.mi/user-selected-context-sync.exp: mode=non-stop: test_cli_inferior: reset selection to frame 0 ... the cpuinfo is: ... processor : 0^M BogoMIPS : 50.00^M Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs^M CPU implementer : 0x41^M CPU architecture: 8^M CPU variant : 0x3^M CPU part : 0xd0c^M CPU revision : 1^M ... > Is the failure mode a > random missing newline by any chance? The failure mode is that expect times out trying to read a gdb mi prompt, which seems to be available apart from the terminating newline.
Ok. That seems very similar to a strange behavior I've been chasing for a while, and it doesn't affect only MI prompts, but regular GDB prompts as well. Randomly a newline won't be output or won't be read by expect, and things will timeout even though the input is there. It does come and go depending on load. It happens more often when the machine is under heavy load.
Seeing the same issue in the GNU Toolchain buildbot (but for now only on armhf): https://builder.sourceware.org/buildbot/#/builders/gdb-debian-armhf It doesn't happen all the time, but it has failed twice this week: https://builder.sourceware.org/buildbot/#/builders/72/builds/183/steps/5/logs/gdb_log https://builder.sourceware.org/buildbot/#/builders/72/builds/181/steps/5/logs/gdb_log
This bug isn't so easy to reproduce. I've been trying for the past few days on aarch64 machines, and I was able to observe it on these systems: 1. KVM guest running openSUSE Leap 15.3 2. Container with Ubuntu 20.04 userspace inside KVM guest with openSUSE Leap 15.3 3. Container with Ubuntu 22.04 userspace inside KVM guest with openSUSE Leap 15.3 I was not able to reproduce the problem on these systems: 4. KVM guest running Ubuntu 20.04 5. KVM guest running Ubuntu 22.04 6. KVM guest running Debian 10 (buster) 7. Bare metal Ubuntu 20.04 8. Container with Ubuntu 22.04 userspace on bare metal Ubuntu 20.04 9. Container with openSUSE Leap 15.3 userspace on bare metal Ubuntu 22.04 To me, this indicates that it's a kernel issue since there were three kernels tested and only one of them reproduced the problem. And of the three userspaces tested, they all reproduced the problem when used with the problematic kernel, and not when used with the other two kernels. Kernel I tested where the bug happens: - openSUSE Leap 15.3 5.3.18-150300.59.68-default Kernels I tested where the bug doesn't happen: - Ubuntu 20.04 5.4.0-110-generic - Ubuntu 22.04 5.15.0-30-generic - Debian 10 5.10.0-0.bpo.12-arm64 My hypothesis is that there was some kernel patch fixing the bug that was backported to the Ubuntu and Debian kernels but not to the openSUSE Leap kernel. I will try testing some upstream kernels to see if I can test the hypothesis and narrow down upstream kernel versions. The bug reproduces only after running the affected testcases for hundreds of iterations. It was easiest to hit with gdb.mi/user-selected-context-sync.exp. I also saw the problem happen with gdb.mi/mi-break.exp a couple of times. I did see gdb.gdb/unittest.exp (which is the test that fails in the GNU Toolchain buildbot) fail on an Ubuntu 22.04 KVM guest, so perhaps my “it's the kernel” theory is wrong. But this failure is an outlier in the general pattern I'm seeing since it's much rarer to see this testcase fail (it fails once every several thousands of iterations rather than several hundreds), so I'll keep pursuing the kernel route before digging deeper into what's going on with gdb.gdb/unittest.exp.
Note that we are also seeing testsuite failures on the new opensuse-leap container: https://builder.sourceware.org/buildbot/#/builders/gdb-opensuseleap-x86_64 I am not sure they are related though. The opensuse-tumbleweed container seems fine though: https://builder.sourceware.org/buildbot/#/builders/gdb-opensusetw-x86_64 Both are running in a VM running Linux 5.17.5-300.fc36.x86_64
(In reply to Mark Wielaard from comment #8) > Note that we are also seeing testsuite failures on the new opensuse-leap > container: > https://builder.sourceware.org/buildbot/#/builders/gdb-opensuseleap-x86_64 > > I am not sure they are related though. > Filed that one as PR29202.
From past experience, the easiest way to reproduce this is to have one of the high-output-volume testcases executing in a loop. For example, "gdb.base/all-architectures*.exp". If you run those in parallel with all the cores, it will eventually reproduce. It doesn't on some machines though. Notably the Taishan's. I theorized this was some subtle kernel issue, but the fact I was able to see clean results for the same kernel on different hardware made me suspicious of that reasoning. The Ampere Altra-based systems are highly parallel though, so it could be a problem with a high core count or some NUMA configuration.