[PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting
Andrew Burgess
aburgess@redhat.com
Wed Jan 19 17:44:59 GMT 2022
Simon,
I wonder if you have any thoughts on this latest iteration?
Thanks,
Andrew
* Andrew Burgess <aburgess@redhat.com> [2022-01-12 15:59:38 +0000]:
> * Tom Tromey <tom@tromey.com> [2022-01-12 08:32:18 -0700]:
>
> > >>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:
> >
> > Andrew> - return highlight(contents, lexer, formatter)
> > Andrew> + return highlight(contents, lexer, formatter).encode(
> > Andrew> + _gdb.host_charset(), "backslashreplace"
> >
> > I think the rest of this file doesn't use the '_gdb.' prefix, because
> > there's a star import of _gdb.
> >
> > This approach isn't super (it defeats Python analyzers like flake8) but
> > it seems more clear to be consistent.
> >
> > Andrew> + In an ideal world we should be able to use the host_charset here, but
> > Andrew> + the host_charset is most likely to represent the capabilities of the
> > Andrew> + users terminal, rather than the encoding of a particular source file.
> >
> > This comment mentions not using host_charset, but then the code does use
> > it.
> >
> > I think host_charset really does represent both the expected encoding of
> > things -- file names, file contents, etc -- and also the terminal
> > capabilities. At least, that's my mental model of how the locale system
> > works. It's not a great setup of course, since in reality it's probably
> > pretty normal to have some files in some encoding and other files in
> > other ones.
> >
> > Andrew> + Another possibility would be to use Python's ability to escape
> > Andrew> + unknown characters when converting the bytes into unicode. However,
> >
> > This also seems to be done, but the comment makes it sound as though it
> > is not.
> >
> > The code changes seem fine to me.
>
> Thanks for your feedback. An updated version of the patch is
> included below. I'm hoping Simon will take a look at this too, as
> he's had some really constructive feedback on Python/unicode stuff.
>
> Thanks,
> Andrew
>
> ---
>
> commit 099d6a5168b43fbfa767ee0992a6975d9e3f651e
> Author: Andrew Burgess <aburgess@redhat.com>
> Date: Fri Nov 26 13:15:28 2021 +0000
>
> gdb/python: handle non utf-8 characters when source highlighting
>
> This commit adds support for source files that contain non utf-8
> characters when performing source styling using the Python pygments
> package. This does not change the behaviour of GDB when the GNU
> Source Highlight library is used.
>
> For the following problem description, assume that either GDB is built
> without GNU Source Highlight support, of that this has been disabled
> using 'maintenance set gnu-source-highlight enabled off'.
>
> The initial problem reported was that a source file containing non
> utf-8 characters would cause GDB to print a Python exception, and then
> display the source without styling, e.g.:
>
> Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
> /* Source code here, without styling... */
>
> Further, as the user steps through different source files, each time
> the problematic source file was evicted from the source cache, and
> then later reloaded, the exception would be printed again.
>
> Finally, this problem is only present when using Python 3, this issue
> is not present for Python 2.
>
> What makes this especially frustrating is that GDB can clearly print
> the source file contents, they're right there... If we disable
> styling completely, or make use of the GNU Source Highlight library,
> then everything is fine. So why is there an error when we try to
> apply styling using Python?
>
> The problem is the use of PyString_FromString (which is an alias for
> PyUnicode_FromString in Python 3), this function converts a C string
> into a either a Unicode object (Py3) or a str object (Py2). For
> Python 2 there is no unicode encoding performed during this function
> call, but for Python 3 the input is assumed to be a uft-8 encoding
> string for the purpose of the conversion. And here of course, is the
> problem, if the source file contains non utf-8 characters, then it
> should not be treated as utf-8, but that's what we do, and that's why
> we get an error.
>
> My first thought when looking at this was to spot when the
> PyString_FromString call failed with a UnicodeDecodeError and silently
> ignore the error. This would mean that GDB would print the source
> without styling, but would also avoid the annoying exception message.
>
> However, I also make use of `pygmentize`, a command line wrapper
> around the Python pygments module, which I use to apply syntax
> highlighting in the output of `less`. And this command line wrapper
> is quite happy to syntax highlight my source file that contains non
> utf-8 characters, so it feels like the problem should be solvable.
>
> It turns out that inside the pygments module there is already support
> for guessing the encoding of the incoming file content, if the
> incoming content is not already a Unicode string. This is what
> happens for Python 2 where the incoming content is of `str` type.
>
> We could try and make GDB smarter when it comes to converting C
> strings into Python Unicode objects; this would probably require us to
> just try a couple of different encoding schemes rather than just
> giving up after utf-8.
>
> However, I figure, why bother? The pygments module already does this
> for us, and the colorize API is not part of the documented external
> API of GDB. So, why not just change the colorize API, instead of the
> content being a Unicode string (for Python 3), lets just make the
> content be a bytes object. The pygments module can then take
> responsibility for guessing the encoding.
>
> So, currently, the colorize API receives a unicode object, and returns
> a unicode object. I propose that the colorize API receive a bytes
> object, and return a bytes object.
>
> diff --git a/gdb/python/lib/gdb/__init__.py b/gdb/python/lib/gdb/__init__.py
> index 11a1b444bfd..7880ed7564e 100644
> --- a/gdb/python/lib/gdb/__init__.py
> +++ b/gdb/python/lib/gdb/__init__.py
> @@ -239,10 +239,13 @@ try:
> try:
> lexer = lexers.get_lexer_for_filename(filename, stripnl=False)
> formatter = formatters.TerminalFormatter()
> - return highlight(contents, lexer, formatter)
> + return highlight(contents, lexer, formatter).encode(
> + host_charset(), "backslashreplace"
> + )
> except:
> return None
>
> +
> except:
>
> def colorize(filename, contents):
> diff --git a/gdb/python/python.c b/gdb/python/python.c
> index 4dcda53d9ab..b4fa882ab78 100644
> --- a/gdb/python/python.c
> +++ b/gdb/python/python.c
> @@ -1157,13 +1157,24 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
> gdbpy_print_stack ();
> return {};
> }
> - gdbpy_ref<> contents_arg (PyString_FromString (contents.c_str ()));
> +
> + /* The pygments library, which is what we currently use for applying
> + styling, is happy to take input as a bytes object, and to figure out
> + the encoding for itself. This removes the need for us to figure out
> + (guess?) at how the content is encoded, which is probably a good
> + thing. */
> + gdbpy_ref<> contents_arg (PyBytes_FromStringAndSize (contents.c_str (),
> + contents.size ()));
> if (contents_arg == nullptr)
> {
> gdbpy_print_stack ();
> return {};
> }
>
> + /* Calling gdb.colorize passing in the filename (a string), and the file
> + contents (a bytes object). This function should return either a bytes
> + object, the same contents with styling applied, or None to indicate
> + that no styling should be performed. */
> gdbpy_ref<> result (PyObject_CallFunctionObjArgs (hook.get (),
> fname_arg.get (),
> contents_arg.get (),
> @@ -1174,25 +1185,17 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
> return {};
> }
>
> - if (!gdbpy_is_string (result.get ()))
> + if (result == Py_None)
> return {};
> -
> - gdbpy_ref<> unic = python_string_to_unicode (result.get ());
> - if (unic == nullptr)
> - {
> - gdbpy_print_stack ();
> - return {};
> - }
> - gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
> - host_charset (),
> - nullptr));
> - if (host_str == nullptr)
> + else if (!PyBytes_Check (result.get ()))
> {
> + PyErr_SetString (PyExc_TypeError,
> + _("Return value from gdb.colorize should be a bytes object or None."));
> gdbpy_print_stack ();
> return {};
> }
>
> - return std::string (PyBytes_AsString (host_str.get ()));
> + return std::string (PyBytes_AsString (result.get ()));
> }
>
>
> diff --git a/gdb/testsuite/gdb.python/py-source-styling.c b/gdb/testsuite/gdb.python/py-source-styling.c
> new file mode 100644
> index 00000000000..6a1a9d59a8c
> --- /dev/null
> +++ b/gdb/testsuite/gdb.python/py-source-styling.c
> @@ -0,0 +1,29 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> + Copyright 2022 Free Software Foundation, Inc.
> +
> + This program is free software; you can redistribute it and/or modify
> + it under the terms of the GNU General Public License as published by
> + the Free Software Foundation; either version 3 of the License, or
> + (at your option) any later version.
> +
> + This program is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + GNU General Public License for more details.
> +
> + You should have received a copy of the GNU General Public License
> + along with this program. If not, see <http://www.gnu.org/licenses/>. */
> +
> +int
> +main ()
> +{
> + int some_variable = 1234;
> +
> + /* The following line contains a character that is non-utf-8. This is a
> + critical part of the test as Python 3 can't convert this into a string
> + using its default mechanism. */
> + char c = '�'; /* List this line. */
> +
> + return 0;
> +}
> diff --git a/gdb/testsuite/gdb.python/py-source-styling.exp b/gdb/testsuite/gdb.python/py-source-styling.exp
> new file mode 100644
> index 00000000000..68bbc9f9758
> --- /dev/null
> +++ b/gdb/testsuite/gdb.python/py-source-styling.exp
> @@ -0,0 +1,64 @@
> +# Copyright (C) 2022 Free Software Foundation, Inc.
> +
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program. If not, see <http://www.gnu.org/licenses/>.
> +
> +# This file is part of the GDB testsuite. It checks for memory leaks
> +# associated with allocating and deallocation gdb.Inferior objects.
> +
> +load_lib gdb-python.exp
> +
> +standard_testfile
> +
> +save_vars { env(TERM) } {
> + # We need an ANSI-capable terminal to get the output, additionally
> + # we need to set LC_ALL so GDB knows the terminal is UTF-8
> + # capable, otherwise we'll get a UnicodeEncodeError trying to
> + # encode the output.
> + setenv TERM ansi
> +
> + if { [prepare_for_testing "failed to prepare" ${testfile} ${srcfile}] } {
> + return -1
> + }
> +
> + if { [skip_python_tests] } { continue }
> +
> + if { ![gdb_py_module_available "pygments"] } {
> + unsupported "pygments module not available"
> + return -1
> + }
> +
> + if ![runto_main] {
> + return
> + }
> +
> + gdb_test_no_output "maint set gnu-source-highlight enabled off"
> +
> + gdb_test "maint flush source-cache" "Source cache flushed\\."
> +
> + set seen_style_escape false
> + set line_number [gdb_get_line_number "List this line."]
> + gdb_test_multiple "list ${line_number}" "" {
> + -re "Python Exception.*" {
> + fail $gdb_test_name
> + }
> + -re "\033" {
> + set seen_style_escape true
> + exp_continue
> + }
> + -re "$gdb_prompt $" {
> + gdb_assert { $seen_style_escape }
> + pass $gdb_test_name
> + }
> + }
> +}
More information about the Gdb-patches
mailing list