[PATCH] gdb/python: make more use of host_string_to_python_string

Thu Dec 23 02:19:41 GMT 2021

On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote:
> We have a function host_string_to_python_string, which is a wrapper
> around PyString_Decode, which, on Python 3, is an alias for
> PyUnicode_Decode.
> 
> However, there are a few places where we still call PyUnicode_Decode
> directly.
> 
> This commit replaces all uses of PyUnicode_Decode with calls to
> host_string_to_python_string instead.
> 
> To make the use of host_string_to_python_string easier, I've added a
> couple of overloads for this function in python-internal.h, these all
> just forward their calls onto the single base implementation.  The
> signatures of all host_string_to_python_string overloads are:
> 
>   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
>   gdbpy_ref<> host_string_to_python_string (const char *str)
>   gdbpy_ref<> host_string_to_python_string (const string_file &str)
> 
> For Python 3 PyString_Decode is setup (in python-internal.h) as an
> alias for PyUnicode_Decode, so there should certainly be no user
> visible changes in this case.
> 
> For Python 2 this commit will change the behaviour.  Previously by
> calling PyUnicode_Decode we would have been returning a Unicode
> object.  Now, after calling host_string_to_python_string, we will have
> a str object.
> 
> I've checked the GDB documentation, and, as far as I can tell, the
> methods I've touched were all documented as returning a string, or in
> the gdb.Command case, take a string as an argument, and my
> understanding is that for Python 2, string generally means str.  So I
> think the new behaviour would be more expected.
> 
> A different solution, that would also make things more consistent in
> the Python 2 world, would be to have host_string_to_python_string
> always return a Unicode object.  However, I've been reading this page:
> 
>   https://pythonhosted.org/kitchen/unicode-frustrations.html
> 
> item #3 recommends that unicode strings should be converted to str
> objects before being printed (in Python 2).  That would mean that
> users should have been adding .encode() calls to the output of the
> routines I've changed in this commit (if they wanted to print that
> output), which is not something I think is made clear from the GDB
> documentation.

I don't think we can (nor want) to change things currently returning
unicode objects to return str objects.  It changes significantly what
the user code receives.  If a given method/function has been returning a
unicode object since the dawn of time and we now return a string object,
user code expecting a unicode object will break.

Also, changing a PyUnicode_Decode call to host_string_to_python_string
(and therefore PyString_Decode) means that if the string contains
something not encodable in ascii, things will now fail  So that sounds
like a regression to me: where we could handle UTF-8 strings before, we
don't now.  See below for examples.

So my suggestion is to leave things as-is, avoid changing the behavior
for Python 2 until we finally remove Python 2 support (for which anybody
willing to do it has my automatic +1).

> ---
>  gdb/python/py-cmd.c          |  9 +++------
>  gdb/python/py-frame.c        |  5 ++---
>  gdb/python/py-type.c         |  3 +--
>  gdb/python/py-utils.c        |  5 ++---
>  gdb/python/py-value.c        |  4 ++--
>  gdb/python/python-internal.h | 11 ++++++++++-
>  gdb/python/python.c          |  4 ++--
>  7 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> index 94608e0bbcf..b2cafeba320 100644
> --- a/gdb/python/py-cmd.c
> +++ b/gdb/python/py-cmd.c
> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
>  
>    if (! args)
>      args = "";
> -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
> -					NULL));
> +  gdbpy_ref<> argobj = host_string_to_python_string (args);
>    if (argobj == NULL)
>      {
>        gdbpy_print_stack ();

Try with this in "script.py", where a user correctly called encode (as
you mentioned in the commit log) before printing args, a 'unicode'
object:

---
class MyCmd(gdb.Command):
    def __init__(self):
        super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER)

    def invoke(self, args, from_tty):
        print(args.encode('utf-8'))

MyCmd()
---

$ ./gdb-old -q -nx --data-directory=data-directory -x script.py -ex 'my-cmd lél' -batch
lél
$ ./gdb-new -q -nx --data-directory=data-directory -x script.py -ex 'my-cmd lél' -batch
Python Exception <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Could not convert arguments to Python string.

> @@ -181,8 +180,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>        return NULL;
>      }
>  
> -  gdbpy_ref<> textobj (PyUnicode_Decode (text, strlen (text), host_charset (),
> -					 NULL));
> +  gdbpy_ref<> textobj = host_string_to_python_string (text);
>    if (textobj == NULL)
>      error (_("Could not convert argument to Python string."));
>  
> @@ -194,8 +192,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>      }
>    else
>      {
> -      wordobj.reset (PyUnicode_Decode (word, strlen (word), host_charset (),
> -				       NULL));
> +      wordobj = host_string_to_python_string (word);
>        if (wordobj == NULL)
>  	error (_("Could not convert argument to Python string."));
>      }
> diff --git a/gdb/python/py-frame.c b/gdb/python/py-frame.c
> index ee57eb10576..b507ff0794f 100644
> --- a/gdb/python/py-frame.c
> +++ b/gdb/python/py-frame.c
> @@ -131,8 +131,7 @@ frapy_name (PyObject *self, PyObject *args)
>  
>    if (name)
>      {
> -      result = PyUnicode_Decode (name.get (), strlen (name.get ()),
> -				 host_charset (), NULL);
> +      result = host_string_to_python_string (name.get ()).release ();

Same here, you can have a frame with a non-ascii name, that would now
break.

>      }
>    else
>      {
> @@ -658,7 +657,7 @@ gdbpy_frame_stop_reason_string (PyObject *self, PyObject *args)
>      }
>  
>    str = unwind_stop_reason_to_string ((enum unwind_stop_reason) reason);
> -  return PyUnicode_Decode (str, strlen (str), host_charset (), NULL);
> +  return host_string_to_python_string (str).release ();

Stop reasons in english probably don't contain non-ascii characters, but
they are translated, so could, in theory, contain non-ascii characters.
So that would break here.

	$ ./gdb-old -nx -q --data-directory=data-directory a.out -ex start -ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))'
	Reading symbols from a.out...
	Temporary breakpoint 1 at 0x1124: file test.cpp, line 7.
	Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out 

	Temporary breakpoint 1, main () at test.cpp:7
	7         lél();
	lél () at test.cpp:3
	3       }
	lél
	(gdb) 

	$ ./gdb-new -nx -q --data-directory=data-directory a.out -ex start -ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))'
	Reading symbols from a.out...
	Temporary breakpoint 1 at 0x1124: file test.cpp, line 7.
	Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out 

	Temporary breakpoint 1, main () at test.cpp:7
	7         lél();
	lél () at test.cpp:3
	3       }
	Traceback (most recent call last):
	  File "<string>", line 1, in <module>
	UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
	Error while executing Python code.

>  }
>  
>  /* Implements the equality comparison for Frame objects.
> diff --git a/gdb/python/py-type.c b/gdb/python/py-type.c
> index 8b17b70fbe3..a178c6a4ab2 100644
> --- a/gdb/python/py-type.c
> +++ b/gdb/python/py-type.c
> @@ -1033,8 +1033,7 @@ typy_str (PyObject *self)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (thetype.c_str (), thetype.size (),
> -			   host_charset (), NULL);
> +  return host_string_to_python_string (thetype).release ();

Here, if the type name contains some non ascii characters, old GDB
would still break, as Python will try to call str() on the unicode
object, which (I guess) will try to encode it as ascii.  With new GDB,
the same will happen, but earlier, inside host_string_to_python_string.
If somebody called Type.__str__ directly, they would see the difference
though.

>  }
>  
>  /* Implement the richcompare method.  */
> diff --git a/gdb/python/py-utils.c b/gdb/python/py-utils.c
> index 10c4173efcd..2eb1ed2a09e 100644
> --- a/gdb/python/py-utils.c
> +++ b/gdb/python/py-utils.c
> @@ -152,10 +152,9 @@ python_string_to_host_string (PyObject *obj)
>  /* Convert a host string to a python string.  */
>  
>  gdbpy_ref<>
> -host_string_to_python_string (const char *str)
> +host_string_to_python_string (const char *str, size_t len)
>  {
> -  return gdbpy_ref<> (PyString_Decode (str, strlen (str), host_charset (),
> -				       NULL));
> +  return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
>  }
>  
>  /* Return true if OBJ is a Python string or unicode object, false
> diff --git a/gdb/python/py-value.c b/gdb/python/py-value.c
> index c843c2c3072..8bd30729454 100644
> --- a/gdb/python/py-value.c
> +++ b/gdb/python/py-value.c
> @@ -743,7 +743,7 @@ valpy_format_string (PyObject *self, PyObject *args, PyObject *kw)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
> +  return host_string_to_python_string (stb).release ();

I guess that a value here could easily be a string with non-ascii
characters.  Also, note what the comment of the function says:

  /* Implementation of gdb.Value.format_string (...) -> string.
     Return Unicode string with value contents formatted using the
     keyword-only arguments.  */

Simon