Bug 30618 - warning: while parsing threads: not well-formed (invalid token) - in non-stop + remote mode
Summary: warning: while parsing threads: not well-formed (invalid token) - in non-stop...
Status: RESOLVED FIXED
Alias: None
Product: gdb
Classification: Unclassified
Component: remote (show other bugs)
Version: unknown
: P2 normal
Target Milestone: 15.1
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-05 18:52 UTC by Jonah Graham
Modified: 2023-11-15 13:53 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2023-07-06 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jonah Graham 2023-07-05 18:52:40 UTC
Create an empty main method in a file containing unicode characters and compile it with gcc, start gdbserver and connect to it with gdb in non-stop mode and the connection sequence fails (full log below):

(gdb) set non-stop on
(gdb) target remote :3333
Remote debugging using :3333
warning: while parsing threads: not well-formed (invalid token)
The target is not running (try extended-remote?)


With remote debugging on this is the output (run in MI mode because the characters are escaped better):

&"  [remote] Sending packet: $QNonStop:1#8d\n"
&"  [remote] Packet received: OK\n"
&"  [remote] Sending packet: $qXfer:threads:read::0,1000#92\n"
&"  [remote] Packet received: l<threads>\\n<thread id=\"p10883.10883\" core=\"8\" name=\"issue-275-\\346\\265\\213\\350\\257\"/>\\n</threads>\\n\n"
&"warning: while parsing threads: not well-formed (invalid token)\n"
&"  [remote] Sending packet: $qTStatus#49\n"
&"  [remote] Packet received: T0;tnotrun:0;tframes:0;tcreated:0;tfree:500000;tsize:500000;circular:0;disconn:0;starttime:0;stoptime:0;username:;notes::\n"
&"  [remote] packet_ok: Packet qTStatus (trace-status) is supported\n"
&"  [remote] Sending packet: $qTfV#81\n"
&"  [remote] Packet received: 1:0:1:74726163655f74696d657374616d70\n"
&"  [remote] Sending packet: $qTsV#8e\n"
&"  [remote] Packet received: l\n"
=tsv-created,name="trace_timestamp",initial="0"
&"  [remote] Sending packet: $?#3f\n"
&"  [remote] Packet received: T0506:0000000000000000;07:90daffffff7f0000;10:b032fef7ff7f0000;thread:p10883.10883;core:8;\n"
&"  [remote] Sending packet: $vStopped#55\n"
&"  [remote] Packet received: OK\n"
&"[remote] start_remote_1: exit\n"


Here is the source and versions I am using:

$ cat src/integration-tests/test-programs/issue-275-测试.c 
int main(int argc, char *argv[])
{
    return 0;
}
$ gcc -o src/integration-tests/test-programs/issue-275-测试 -g src/integration-tests/test-programs/issue-275-测试.c
$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ gdb --version
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

In case the encoding in bugzilla corrupt it, the 测试 is "test" (https://translate.google.ca/?sl=auto&tl=en&text=%E6%B5%8B%E8%AF%95&op=translate) and is encoded in UTF-8 as \xe6\xb5\x8b\xe8\xaf\x95 or \346\265\213\350\257\225
Comment 1 Tom Tromey 2023-07-06 22:22:50 UTC
I debugged this a little, and the issue is that the Linux kernel
truncates the 'comm' file at 16 bytes.  This truncates the final
character in the name -- yielding an invalid UTF-8 sequence, which
gdbserver dutifully passes back to gdb.

I am not sure how to handle this.

One idea is to convert all non-ASCII characters to hex.
Or just drop them.
Comment 2 Tom Tromey 2023-07-13 16:20:40 UTC
Since this is Linux-specific we could probably just rely
directly on iconv here -- iconv the 'comm' contents to
UTF-8 and drop / substitute anything that gives an error.
Comment 3 Tom Tromey 2023-07-13 21:26:57 UTC
One other issue here is knowing the correct encoding to use.
gdb itself can pass in target_charset().
I guess gdbserver could use the prevailing encoding from the locale.

I wonder if we even care about non-ASCII characters here.
What if we substitute ? for those instead.
Comment 5 Jonah Graham 2023-07-19 17:40:56 UTC
> This truncates the final
> character in the name -- yielding an invalid UTF-8 sequence, which
> gdbserver dutifully passes back to gdb.

Thanks Tom - with this explanation I was able to craft my test in cdt-gdb-adapter to avoid this bug where I am trying to improve unicode support https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/pull/276.
Comment 6 Sourceware Commits 2023-11-14 16:14:21 UTC
The master branch has been updated by Tom Tromey <tromey@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=07b3255c3bae7126a0d679f957788560351eb236

commit 07b3255c3bae7126a0d679f957788560351eb236
Author: Tom Tromey <tom@tromey.com>
Date:   Thu Jul 13 17:28:48 2023 -0600

    Filter invalid encodings from Linux thread names
    
    On Linux, a thread can only be 16 bytes (including the trailing \0).
    A user sent in a test case where this causes a truncated UTF-8
    sequence, causing gdbserver to create invalid XML.
    
    I went back and forth about different ways to solve this, and in the
    end decided to fix it in gdbserver, with the reason being that it
    seems important to generate correct XML for the <thread> response.
    
    I am not totally sure whether the call to setlocale could have
    unplanned consequences.  This is needed, though, for nl_langinfo to
    return the correct result.
    
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30618
Comment 7 Tom Tromey 2023-11-15 13:53:24 UTC
Fixed.