Create an empty main method in a file containing unicode characters and compile it with gcc, start gdbserver and connect to it with gdb in non-stop mode and the connection sequence fails (full log below): (gdb) set non-stop on (gdb) target remote :3333 Remote debugging using :3333 warning: while parsing threads: not well-formed (invalid token) The target is not running (try extended-remote?) With remote debugging on this is the output (run in MI mode because the characters are escaped better): &" [remote] Sending packet: $QNonStop:1#8d\n" &" [remote] Packet received: OK\n" &" [remote] Sending packet: $qXfer:threads:read::0,1000#92\n" &" [remote] Packet received: l<threads>\\n<thread id=\"p10883.10883\" core=\"8\" name=\"issue-275-\\346\\265\\213\\350\\257\"/>\\n</threads>\\n\n" &"warning: while parsing threads: not well-formed (invalid token)\n" &" [remote] Sending packet: $qTStatus#49\n" &" [remote] Packet received: T0;tnotrun:0;tframes:0;tcreated:0;tfree:500000;tsize:500000;circular:0;disconn:0;starttime:0;stoptime:0;username:;notes::\n" &" [remote] packet_ok: Packet qTStatus (trace-status) is supported\n" &" [remote] Sending packet: $qTfV#81\n" &" [remote] Packet received: 1:0:1:74726163655f74696d657374616d70\n" &" [remote] Sending packet: $qTsV#8e\n" &" [remote] Packet received: l\n" =tsv-created,name="trace_timestamp",initial="0" &" [remote] Sending packet: $?#3f\n" &" [remote] Packet received: T0506:0000000000000000;07:90daffffff7f0000;10:b032fef7ff7f0000;thread:p10883.10883;core:8;\n" &" [remote] Sending packet: $vStopped#55\n" &" [remote] Packet received: OK\n" &"[remote] start_remote_1: exit\n" Here is the source and versions I am using: $ cat src/integration-tests/test-programs/issue-275-测试.c int main(int argc, char *argv[]) { return 0; } $ gcc -o src/integration-tests/test-programs/issue-275-测试 -g src/integration-tests/test-programs/issue-275-测试.c $ gcc --version gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ gdb --version GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1 Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. In case the encoding in bugzilla corrupt it, the 测试 is "test" (https://translate.google.ca/?sl=auto&tl=en&text=%E6%B5%8B%E8%AF%95&op=translate) and is encoded in UTF-8 as \xe6\xb5\x8b\xe8\xaf\x95 or \346\265\213\350\257\225
I debugged this a little, and the issue is that the Linux kernel truncates the 'comm' file at 16 bytes. This truncates the final character in the name -- yielding an invalid UTF-8 sequence, which gdbserver dutifully passes back to gdb. I am not sure how to handle this. One idea is to convert all non-ASCII characters to hex. Or just drop them.
Since this is Linux-specific we could probably just rely directly on iconv here -- iconv the 'comm' contents to UTF-8 and drop / substitute anything that gives an error.
One other issue here is knowing the correct encoding to use. gdb itself can pass in target_charset(). I guess gdbserver could use the prevailing encoding from the locale. I wonder if we even care about non-ASCII characters here. What if we substitute ? for those instead.
https://sourceware.org/pipermail/gdb-patches/2023-July/200971.html
> This truncates the final > character in the name -- yielding an invalid UTF-8 sequence, which > gdbserver dutifully passes back to gdb. Thanks Tom - with this explanation I was able to craft my test in cdt-gdb-adapter to avoid this bug where I am trying to improve unicode support https://github.com/eclipse-cdt-cloud/cdt-gdb-adapter/pull/276.
The master branch has been updated by Tom Tromey <tromey@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=07b3255c3bae7126a0d679f957788560351eb236 commit 07b3255c3bae7126a0d679f957788560351eb236 Author: Tom Tromey <tom@tromey.com> Date: Thu Jul 13 17:28:48 2023 -0600 Filter invalid encodings from Linux thread names On Linux, a thread can only be 16 bytes (including the trailing \0). A user sent in a test case where this causes a truncated UTF-8 sequence, causing gdbserver to create invalid XML. I went back and forth about different ways to solve this, and in the end decided to fix it in gdbserver, with the reason being that it seems important to generate correct XML for the <thread> response. I am not totally sure whether the call to setlocale could have unplanned consequences. This is needed, though, for nl_langinfo to return the correct result. Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30618
Fixed.