28561 – [gdb/testsuite] Error due to not reading \r\n at end of mi prompt

Bug 28561 - [gdb/testsuite] Error due to not reading \r\n at end of mi prompt

Summary: [gdb/testsuite] Error due to not reading \r\n at end of mi prompt

Status:	NEW

Alias:	None

Product:	gdb
Classification:	Unclassified
Component:	testsuite (show other bugs)
Version:	11.1

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-11-08 12:54 UTC by Tom de Vries
Modified:	2022-05-31 09:52 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
log for first error (1.69 KB, text/x-log) 2021-11-08 13:05 UTC, Tom de Vries	Details
log for second error (5.80 KB, text/x-log) 2021-11-08 13:09 UTC, Tom de Vries	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tom de Vries 2021-11-08 12:54:00 UTC

In OBS, on openSUSE Tumbleweed for armv7hl, I run into:
...
(gdb) ^M
show mi-async^M
&"show mi-async\n"^M
~"Whether MI is in asynchronous mode is off.\n"^M
^done^M
(gdb) ^M
104-environment-directory -r^M
104^done,source-path="$cdir:$cwd"^M
(gdb) ERROR: Dir reinitialization failed (timeout)
...

I see similar fails for openSUSE Tumbleweed aarch64:
...
103-break-list^M
103^done,BreakpointTable={nr_rows="0",nr_cols="6",hdr=[{width="7",alignment="-1",col_name="number",colhdr="Num"},{width="14",alignment="-1",col_name="type",colhdr="Type"},{width="4",alignment="-1",col_name="disp",colhdr="Disp"},{width="3",alignment="-1",col_name="enabled",colhdr="Enb"},{width="10",alignment="-1",col_name="addr",colhdr="Address"},{width="40",alignment="2",col_name="what",colhdr="What"}],body=[]}^M
(gdb) ERROR: -break-list (timeout)
^M
...

AFAIU, gdb produces the required output, but the test-suite emits an error just before the end of the mi prompt, just before '\r\n'.

This could be a system or system load problem of some sort.  I do find it curious though that the timeout always happens at the exact same location, so I thought it worthy of filing.

Comment 1 Tom de Vries 2021-11-08 13:05:46 UTC

Created attachment 13766 [details]
log for first error

Comment 2 Tom de Vries 2021-11-08 13:09:25 UTC

Created attachment 13767 [details]
log for second error

Comment 3 Luis Machado 2022-05-09 07:18:59 UTC

Do you know what kind of aarch64 system this is? Is the failure mode a random missing newline by any chance?

Comment 4 Tom de Vries 2022-05-09 08:36:57 UTC

(In reply to Luis Machado from comment #3)
> Do you know what kind of aarch64 system this is?

Well, it's OBS, so for the attached logs I cannot say.

For the most recent one I ran into on aarch64 openSUSE Leap 15.3:
...
(gdb) set height 0^M
(gdb) set width 0^M
(gdb) set build-id-verbose 0^M
(gdb) builtin_spawn -pty^M
new-ui mi /dev/pts/5^M
New UI allocated^M
(gdb) =thread-group-added,id="i1"^M
(gdb) ERROR: MI channel failed
warning: Error detected on fd 11^M
thread 1.1^M
Unknown thread 1.1.^M
(gdb) UNRESOLVED: gdb.mi/user-selected-context-sync.exp: mode=non-stop: test_cli_inferior: reset selection to thread 1.1
frame 0^M
No registers.^M
(gdb) PASS: gdb.mi/user-selected-context-sync.exp: mode=non-stop: test_cli_inferior: reset selection to frame 0
...
the cpuinfo is:
...
processor       : 0^M
BogoMIPS        : 50.00^M
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs^M
CPU implementer : 0x41^M
CPU architecture: 8^M
CPU variant     : 0x3^M
CPU part        : 0xd0c^M
CPU revision    : 1^M
...

> Is the failure mode a
> random missing newline by any chance?

The failure mode is that expect times out trying to read a gdb mi prompt, which seems to be available apart from the terminating newline.

Comment 5 Luis Machado 2022-05-09 10:15:41 UTC

Ok. That seems very similar to a strange behavior I've been chasing for a while, and it doesn't affect only MI prompts, but regular GDB prompts as well. Randomly a newline won't be output or won't be read by expect, and things will timeout even though the input is there.

It does come and go depending on load. It happens more often when the machine is under heavy load.

Comment 6 Mark Wielaard 2022-05-16 09:25:05 UTC

Seeing the same issue in the GNU Toolchain buildbot (but for now only on armhf):
https://builder.sourceware.org/buildbot/#/builders/gdb-debian-armhf

It doesn't happen all the time, but it has failed twice this week:
https://builder.sourceware.org/buildbot/#/builders/72/builds/183/steps/5/logs/gdb_log
https://builder.sourceware.org/buildbot/#/builders/72/builds/181/steps/5/logs/gdb_log

Comment 7 Thiago Jung Bauermann 2022-05-30 16:18:49 UTC

This bug isn't so easy to reproduce. I've been trying for the past few days
on aarch64 machines, and I was able to observe it on these systems:

1. KVM guest running openSUSE Leap 15.3
2. Container with Ubuntu 20.04 userspace inside KVM guest with openSUSE Leap 15.3
3. Container with Ubuntu 22.04 userspace inside KVM guest with openSUSE Leap 15.3

I was not able to reproduce the problem on these systems:

4. KVM guest running Ubuntu 20.04
5. KVM guest running Ubuntu 22.04
6. KVM guest running Debian 10 (buster)
7. Bare metal Ubuntu 20.04
8. Container with Ubuntu 22.04 userspace on bare metal Ubuntu 20.04
9. Container with openSUSE Leap 15.3 userspace on bare metal Ubuntu 22.04

To me, this indicates that it's a kernel issue since there were three
kernels tested and only one of them reproduced the problem. And of the three
userspaces tested, they all reproduced the problem when used with the
problematic kernel, and not when used with the other two kernels.

Kernel I tested where the bug happens:
- openSUSE Leap 15.3 5.3.18-150300.59.68-default

Kernels I tested where the bug doesn't happen:
- Ubuntu 20.04 5.4.0-110-generic
- Ubuntu 22.04 5.15.0-30-generic
- Debian 10 5.10.0-0.bpo.12-arm64

My hypothesis is that there was some kernel patch fixing the bug that was
backported to the Ubuntu and Debian kernels but not to the openSUSE Leap
kernel.

I will try testing some upstream kernels to see if I can test the hypothesis
and narrow down upstream kernel versions.
 
The bug reproduces only after running the affected testcases for hundreds of
iterations. It was easiest to hit with gdb.mi/user-selected-context-sync.exp.
I also saw the problem happen with gdb.mi/mi-break.exp a couple of times.

I did see gdb.gdb/unittest.exp (which is the test that fails in the GNU
Toolchain buildbot) fail on an Ubuntu 22.04 KVM guest, so perhaps my “it's
the kernel” theory is wrong. But this failure is an outlier in the general
pattern I'm seeing since it's much rarer to see this testcase fail (it fails
once every several thousands of iterations rather than several hundreds), so
I'll keep pursuing the kernel route before digging deeper into what's going
on with gdb.gdb/unittest.exp.

Comment 8 Mark Wielaard 2022-05-31 08:40:56 UTC

Note that we are also seeing testsuite failures on the new opensuse-leap container:
https://builder.sourceware.org/buildbot/#/builders/gdb-opensuseleap-x86_64

I am not sure they are related though.

The opensuse-tumbleweed container seems fine though:
https://builder.sourceware.org/buildbot/#/builders/gdb-opensusetw-x86_64

Both are running in a VM running Linux 5.17.5-300.fc36.x86_64

Comment 9 Tom de Vries 2022-05-31 09:07:03 UTC

(In reply to Mark Wielaard from comment #8)
> Note that we are also seeing testsuite failures on the new opensuse-leap
> container:
> https://builder.sourceware.org/buildbot/#/builders/gdb-opensuseleap-x86_64
> 
> I am not sure they are related though.
> 

Filed that one as PR29202.

Comment 10 Luis Machado 2022-05-31 09:52:07 UTC

From past experience, the easiest way to reproduce this is to have one of the high-output-volume testcases executing in a loop. For example, "gdb.base/all-architectures*.exp".

If you run those in parallel with all the cores, it will eventually reproduce. It doesn't on some machines though. Notably the Taishan's.

I theorized this was some subtle kernel issue, but the fact I was able to see clean results for the same kernel on different hardware made me suspicious of that reasoning.

The Ampere Altra-based systems are highly parallel though, so it could be a problem with a high core count or some NUMA configuration.