[PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting

Wed Jan 19 17:44:59 GMT 2022

Simon,

I wonder if you have any thoughts on this latest iteration?

Thanks,
Andrew

* Andrew Burgess <aburgess@redhat.com> [2022-01-12 15:59:38 +0000]:

> * Tom Tromey <tom@tromey.com> [2022-01-12 08:32:18 -0700]:
> 
> > >>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:
> > 
> > Andrew> -            return highlight(contents, lexer, formatter)
> > Andrew> +            return highlight(contents, lexer, formatter).encode(
> > Andrew> +                _gdb.host_charset(), "backslashreplace"
> > 
> > I think the rest of this file doesn't use the '_gdb.' prefix, because
> > there's a star import of _gdb.
> > 
> > This approach isn't super (it defeats Python analyzers like flake8) but
> > it seems more clear to be consistent.
> > 
> > Andrew> +     In an ideal world we should be able to use the host_charset here, but
> > Andrew> +     the host_charset is most likely to represent the capabilities of the
> > Andrew> +     users terminal, rather than the encoding of a particular source file.
> > 
> > This comment mentions not using host_charset, but then the code does use
> > it.
> > 
> > I think host_charset really does represent both the expected encoding of
> > things -- file names, file contents, etc -- and also the terminal
> > capabilities.  At least, that's my mental model of how the locale system
> > works.  It's not a great setup of course, since in reality it's probably
> > pretty normal to have some files in some encoding and other files in
> > other ones.
> > 
> > Andrew> +     Another possibility would be to use Python's ability to escape
> > Andrew> +     unknown characters when converting the bytes into unicode.  However,
> > 
> > This also seems to be done, but the comment makes it sound as though it
> > is not.
> > 
> > The code changes seem fine to me.
> 
> Thanks for your feedback.  An updated version of the patch is
> included below.  I'm hoping Simon will take a look at this too, as
> he's had some really constructive feedback on Python/unicode stuff.
> 
> Thanks,
> Andrew
> 
> ---
> 
> commit 099d6a5168b43fbfa767ee0992a6975d9e3f651e
> Author: Andrew Burgess <aburgess@redhat.com>
> Date:   Fri Nov 26 13:15:28 2021 +0000
> 
>     gdb/python: handle non utf-8 characters when source highlighting
>     
>     This commit adds support for source files that contain non utf-8
>     characters when performing source styling using the Python pygments
>     package.  This does not change the behaviour of GDB when the GNU
>     Source Highlight library is used.
>     
>     For the following problem description, assume that either GDB is built
>     without GNU Source Highlight support, of that this has been disabled
>     using 'maintenance set gnu-source-highlight enabled off'.
>     
>     The initial problem reported was that a source file containing non
>     utf-8 characters would cause GDB to print a Python exception, and then
>     display the source without styling, e.g.:
>     
>       Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
>       /* Source code here, without styling...  */
>     
>     Further, as the user steps through different source files, each time
>     the problematic source file was evicted from the source cache, and
>     then later reloaded, the exception would be printed again.
>     
>     Finally, this problem is only present when using Python 3, this issue
>     is not present for Python 2.
>     
>     What makes this especially frustrating is that GDB can clearly print
>     the source file contents, they're right there...  If we disable
>     styling completely, or make use of the GNU Source Highlight library,
>     then everything is fine.  So why is there an error when we try to
>     apply styling using Python?
>     
>     The problem is the use of PyString_FromString (which is an alias for
>     PyUnicode_FromString in Python 3), this function converts a C string
>     into a either a Unicode object (Py3) or a str object (Py2).  For
>     Python 2 there is no unicode encoding performed during this function
>     call, but for Python 3 the input is assumed to be a uft-8 encoding
>     string for the purpose of the conversion.  And here of course, is the
>     problem, if the source file contains non utf-8 characters, then it
>     should not be treated as utf-8, but that's what we do, and that's why
>     we get an error.
>     
>     My first thought when looking at this was to spot when the
>     PyString_FromString call failed with a UnicodeDecodeError and silently
>     ignore the error.  This would mean that GDB would print the source
>     without styling, but would also avoid the annoying exception message.
>     
>     However, I also make use of `pygmentize`, a command line wrapper
>     around the Python pygments module, which I use to apply syntax
>     highlighting in the output of `less`.  And this command line wrapper
>     is quite happy to syntax highlight my source file that contains non
>     utf-8 characters, so it feels like the problem should be solvable.
>     
>     It turns out that inside the pygments module there is already support
>     for guessing the encoding of the incoming file content, if the
>     incoming content is not already a Unicode string.  This is what
>     happens for Python 2 where the incoming content is of `str` type.
>     
>     We could try and make GDB smarter when it comes to converting C
>     strings into Python Unicode objects; this would probably require us to
>     just try a couple of different encoding schemes rather than just
>     giving up after utf-8.
>     
>     However, I figure, why bother?  The pygments module already does this
>     for us, and the colorize API is not part of the documented external
>     API of GDB.  So, why not just change the colorize API, instead of the
>     content being a Unicode string (for Python 3), lets just make the
>     content be a bytes object.  The pygments module can then take
>     responsibility for guessing the encoding.
>     
>     So, currently, the colorize API receives a unicode object, and returns
>     a unicode object.  I propose that the colorize API receive a bytes
>     object, and return a bytes object.
> 
> diff --git a/gdb/python/lib/gdb/__init__.py b/gdb/python/lib/gdb/__init__.py
> index 11a1b444bfd..7880ed7564e 100644
> --- a/gdb/python/lib/gdb/__init__.py
> +++ b/gdb/python/lib/gdb/__init__.py
> @@ -239,10 +239,13 @@ try:
>          try:
>              lexer = lexers.get_lexer_for_filename(filename, stripnl=False)
>              formatter = formatters.TerminalFormatter()
> -            return highlight(contents, lexer, formatter)
> +            return highlight(contents, lexer, formatter).encode(
> +                host_charset(), "backslashreplace"
> +            )
>          except:
>              return None
>  
> +
>  except:
>  
>      def colorize(filename, contents):
> diff --git a/gdb/python/python.c b/gdb/python/python.c
> index 4dcda53d9ab..b4fa882ab78 100644
> --- a/gdb/python/python.c
> +++ b/gdb/python/python.c
> @@ -1157,13 +1157,24 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
>        gdbpy_print_stack ();
>        return {};
>      }
> -  gdbpy_ref<> contents_arg (PyString_FromString (contents.c_str ()));
> +
> +  /* The pygments library, which is what we currently use for applying
> +     styling, is happy to take input as a bytes object, and to figure out
> +     the encoding for itself.  This removes the need for us to figure out
> +     (guess?) at how the content is encoded, which is probably a good
> +     thing.  */
> +  gdbpy_ref<> contents_arg (PyBytes_FromStringAndSize (contents.c_str (),
> +						       contents.size ()));
>    if (contents_arg == nullptr)
>      {
>        gdbpy_print_stack ();
>        return {};
>      }
>  
> +  /* Calling gdb.colorize passing in the filename (a string), and the file
> +     contents (a bytes object).  This function should return either a bytes
> +     object, the same contents with styling applied, or None to indicate
> +     that no styling should be performed.  */
>    gdbpy_ref<> result (PyObject_CallFunctionObjArgs (hook.get (),
>  						    fname_arg.get (),
>  						    contents_arg.get (),
> @@ -1174,25 +1185,17 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
>        return {};
>      }
>  
> -  if (!gdbpy_is_string (result.get ()))
> +  if (result == Py_None)
>      return {};
> -
> -  gdbpy_ref<> unic = python_string_to_unicode (result.get ());
> -  if (unic == nullptr)
> -    {
> -      gdbpy_print_stack ();
> -      return {};
> -    }
> -  gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
> -						   host_charset (),
> -						   nullptr));
> -  if (host_str == nullptr)
> +  else if (!PyBytes_Check (result.get ()))
>      {
> +      PyErr_SetString (PyExc_TypeError,
> +		       _("Return value from gdb.colorize should be a bytes object or None."));
>        gdbpy_print_stack ();
>        return {};
>      }
>  
> -  return std::string (PyBytes_AsString (host_str.get ()));
> +  return std::string (PyBytes_AsString (result.get ()));
>  }
>  
>  

> diff --git a/gdb/testsuite/gdb.python/py-source-styling.c b/gdb/testsuite/gdb.python/py-source-styling.c
> new file mode 100644
> index 00000000000..6a1a9d59a8c
> --- /dev/null
> +++ b/gdb/testsuite/gdb.python/py-source-styling.c
> @@ -0,0 +1,29 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> +   Copyright 2022 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +int
> +main ()
> +{
> +  int some_variable = 1234;
> +
> +  /* The following line contains a character that is non-utf-8.  This is a
> +     critical part of the test as Python 3 can't convert this into a string
> +     using its default mechanism.  */
> +  char c = '�';		/* List this line.  */
> +
> +  return 0;
> +}
> diff --git a/gdb/testsuite/gdb.python/py-source-styling.exp b/gdb/testsuite/gdb.python/py-source-styling.exp
> new file mode 100644
> index 00000000000..68bbc9f9758
> --- /dev/null
> +++ b/gdb/testsuite/gdb.python/py-source-styling.exp
> @@ -0,0 +1,64 @@
> +# Copyright (C) 2022 Free Software Foundation, Inc.
> +
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +
> +# This file is part of the GDB testsuite.  It checks for memory leaks
> +# associated with allocating and deallocation gdb.Inferior objects.
> +
> +load_lib gdb-python.exp
> +
> +standard_testfile
> +
> +save_vars { env(TERM) } {
> +    # We need an ANSI-capable terminal to get the output, additionally
> +    # we need to set LC_ALL so GDB knows the terminal is UTF-8
> +    # capable, otherwise we'll get a UnicodeEncodeError trying to
> +    # encode the output.
> +    setenv TERM ansi
> +
> +    if { [prepare_for_testing "failed to prepare" ${testfile} ${srcfile}] } {
> +	return -1
> +    }
> +
> +    if { [skip_python_tests] } { continue }
> +
> +    if { ![gdb_py_module_available "pygments"] } {
> +	unsupported "pygments module not available"
> +	return -1
> +    }
> +
> +    if ![runto_main] {
> +	return
> +    }
> +
> +    gdb_test_no_output "maint set gnu-source-highlight enabled off"
> +
> +    gdb_test "maint flush source-cache" "Source cache flushed\\."
> +
> +    set seen_style_escape false
> +    set line_number [gdb_get_line_number "List this line."]
> +    gdb_test_multiple "list ${line_number}" "" {
> +	-re "Python Exception.*" {
> +	    fail $gdb_test_name
> +	}
> +	-re "\033" {
> +	    set seen_style_escape true
> +	    exp_continue
> +	}
> +	-re "$gdb_prompt $" {
> +	    gdb_assert { $seen_style_escape }
> +	    pass $gdb_test_name
> +	}
> +    }
> +}