[PATCH] gdb/python: make more use of host_string_to_python_string

Fri Dec 24 12:20:42 GMT 2021

* Simon Marchi <simon.marchi@polymtl.ca> [2021-12-23 22:10:00 -0500]:

> 
> 
> On 2021-12-23 06:07, Andrew Burgess wrote:
> > * Simon Marchi <simon.marchi@polymtl.ca> [2021-12-22 21:19:41 -0500]:
> > 
> >>
> >>
> >> On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote:
> >>> We have a function host_string_to_python_string, which is a wrapper
> >>> around PyString_Decode, which, on Python 3, is an alias for
> >>> PyUnicode_Decode.
> >>>
> >>> However, there are a few places where we still call PyUnicode_Decode
> >>> directly.
> >>>
> >>> This commit replaces all uses of PyUnicode_Decode with calls to
> >>> host_string_to_python_string instead.
> >>>
> >>> To make the use of host_string_to_python_string easier, I've added a
> >>> couple of overloads for this function in python-internal.h, these all
> >>> just forward their calls onto the single base implementation.  The
> >>> signatures of all host_string_to_python_string overloads are:
> >>>
> >>>   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
> >>>   gdbpy_ref<> host_string_to_python_string (const char *str)
> >>>   gdbpy_ref<> host_string_to_python_string (const string_file &str)
> >>>
> >>> For Python 3 PyString_Decode is setup (in python-internal.h) as an
> >>> alias for PyUnicode_Decode, so there should certainly be no user
> >>> visible changes in this case.
> >>>
> >>> For Python 2 this commit will change the behaviour.  Previously by
> >>> calling PyUnicode_Decode we would have been returning a Unicode
> >>> object.  Now, after calling host_string_to_python_string, we will have
> >>> a str object.
> >>>
> >>> I've checked the GDB documentation, and, as far as I can tell, the
> >>> methods I've touched were all documented as returning a string, or in
> >>> the gdb.Command case, take a string as an argument, and my
> >>> understanding is that for Python 2, string generally means str.  So I
> >>> think the new behaviour would be more expected.
> >>>
> >>> A different solution, that would also make things more consistent in
> >>> the Python 2 world, would be to have host_string_to_python_string
> >>> always return a Unicode object.  However, I've been reading this page:
> >>>
> >>>   https://pythonhosted.org/kitchen/unicode-frustrations.html
> >>>
> >>> item #3 recommends that unicode strings should be converted to str
> >>> objects before being printed (in Python 2).  That would mean that
> >>> users should have been adding .encode() calls to the output of the
> >>> routines I've changed in this commit (if they wanted to print that
> >>> output), which is not something I think is made clear from the GDB
> >>> documentation.
> >>
> >> I don't think we can (nor want) to change things currently returning
> >> unicode objects to return str objects.  It changes significantly what
> >> the user code receives.  If a given method/function has been returning a
> >> unicode object since the dawn of time and we now return a string object,
> >> user code expecting a unicode object will break.
> > 
> > ACK.  I think I disagree about the significance of the change, but I
> > understand your concerns.
> 
> I fail to see how you don't find it significant: this is a public API,
> we change the type of the returned value, doesn't it break existing
> users?

I have two reasons, (1) the behaviour doesn't match the documented
API, which I think gives some scope for fixing either the
implementation or the documentation, and (2) in reality, for many
purposes, a unicode can be used just like a str, e.g. I guess most
folk are not calling encode before printing the unicode object, they
just pass them to print.

Please do remember though, I say the above just to give you insight to
my thinking, I've already ACK'd the above as a blocker for the patch I
proposed.

>
> >> Also, changing a PyUnicode_Decode call to host_string_to_python_string
> >> (and therefore PyString_Decode) means that if the string contains
> >> something not encodable in ascii, things will now fail  So that sounds
> >> like a regression to me: where we could handle UTF-8 strings before, we
> >> don't now.  See below for examples.
> > 
> > Yeah, you're absolutely correct.  I don't understand why we would want
> > to call PyString_Decode here for Py2 at all, maybe you understand why
> > that's a good thing?
> > 
> > If I rewrite the function like this:
> > 
> >   /* Convert a host string to a python string.  */
> > 
> >   gdbpy_ref<>
> >   host_string_to_python_string (const char *str, size_t len)
> >   {
> >   #ifdef IS_PY3K
> >     return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
> >   #else
> >     return gdbpy_ref<> (PyString_FromStringAndSize(str, len));
> >   #endif
> >   }
> > 
> > Then I think this would achieve everything I actually intended this
> > patch to achieve.
> 
> Can you clarify what the patch is trying to achieve?  I don't think I
> understand properly.

I started working on a slightly related issue, and as I started
digging around I was running into places where the API docs said
function XXX returns a string, which isn't the case on Py2.  That got
me looking the places where we called PyUnicode_Decode directly rather
than passing through host_string_to_python_string, and, well, here we
are...

> 
> Given that Python 2's string type is equivalent to Python 3's bytes
> type, and Python 2's unicode type is equivalent to Python 3's string
> type, I find it odd to have host_string_to_python_string (and some
> Python API functions) return a "string" both in Python 2 and Python 3,
> as they are both fundamentally different types.

I understand you position.

>                                                   Returning a unicode in
> Python 2 and a string in Python 3 makes sense to me, they are basically
> the same  thing.

I thought in Py3 unicode and str were the same, so its less
"basically the same thing", and more "are the same thing", right?

> 
> > Obviously, this still doesn't address your concern
> > about the unicode -> str change (for Py2), so I doubt you'd now find
> > the patch acceptable.
> > 
> > That said, I suspect changing host_string_to_python_string as in the
> > above would probably be a good thing, right?  The above function is
> > used, so all I need to do is inject some non ASCII characters into a
> > code path that calls the above function and the existing GDB will
> > break, but the above change would allow things to work correctly.
> 
> Code paths that do use this function already get a "str" in both Python
> 2 and 3 (which I think is wrong, as explained above, but that's what we
> have to deal with) and would still receive a "str" after, so the change
> is is safe from that point of view.

I understand why, given the view that host_string_to_python_string is
basically wrong, adding any additional calls to it would be considered
wrong.  Maybe we should rename it to
deprecated_host_string_to_python_string, and add a new function,
host_string_to_python_unicode.

If/when Py2 support is dropped then user of the old function could be
changed to use the new *_unicode function?

>                                       Today, a non-ASCII character in the
> input argument just causes an exception with Python 2.  With the change
> you propose, it would return an "str" containing whatever bytes came in,
> in the host system's encoding.  So, the new behavior seems like an
> improvement at first sight.

OK, thanks for the feedback.  I see what I can put together in the new
year.

> 
> >> So my suggestion is to leave things as-is, avoid changing the behavior
> >> for Python 2 until we finally remove Python 2 support (for which anybody
> >> willing to do it has my automatic +1).
> > 
> > I certainly wouldn't want to stop someone removing Py2 support.
> 
> I'll take a look at that, many people have suggested it, and this is an
> example of how keeping Python 2 support is a burden.
> 
> >>> ---
> >>>  gdb/python/py-cmd.c          |  9 +++------
> >>>  gdb/python/py-frame.c        |  5 ++---
> >>>  gdb/python/py-type.c         |  3 +--
> >>>  gdb/python/py-utils.c        |  5 ++---
> >>>  gdb/python/py-value.c        |  4 ++--
> >>>  gdb/python/python-internal.h | 11 ++++++++++-
> >>>  gdb/python/python.c          |  4 ++--
> >>>  7 files changed, 22 insertions(+), 19 deletions(-)
> >>>
> >>> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> >>> index 94608e0bbcf..b2cafeba320 100644
> >>> --- a/gdb/python/py-cmd.c
> >>> +++ b/gdb/python/py-cmd.c
> >>> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
> >>>  
> >>>    if (! args)
> >>>      args = "";
> >>> -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
> >>> -					NULL));
> >>> +  gdbpy_ref<> argobj = host_string_to_python_string (args);
> >>>    if (argobj == NULL)
> >>>      {
> >>>        gdbpy_print_stack ();
> >>
> >> Try with this in "script.py", where a user correctly called encode (as
> >> you mentioned in the commit log) before printing args, a 'unicode'
> >> object:
> >>
> >> ---
> >> class MyCmd(gdb.Command):
> >>     def __init__(self):
> >>         super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER)
> >>
> >>     def invoke(self, args, from_tty):
> >>         print(args.encode('utf-8'))
> > 
> > Really this should be:
> > 
> >   def invoke(self, args, from_tty):
> >       print(args.encode(gdb.host_charset ()))
> > 
> > Except we don't have a gdb.host_charset method, otherwise it should be
> > possible for this code to go wrong.
> 
> True.  Although somebody could still use .encode('utf-8') and just use
> that script on machines where UTF-8 is the locale (which is just the
> norm today).

I don't understand what you mean here.  If the user is running on a
machine with non utf-8 locale, then (if I understand how this all
works), the bytes read by GDB would be in the machines (host_charset)
local, these bytes would be sent over to Python, which would then
convert them to a unicode object in the host_charset locale.

Now if the user wants to get the bytes back they need to know the
correct value to pass to .encode, right?

> 
> > I'll write a patch to add that.
> > 
> > I assume you'll not object if I propose updating the documentation for
> > all the functions I tried to change here to document the actual
> > behaviour?
> 
> Sure.  Although if we end up removing Python 2 support (which is not a
> given), it might be unnecessary.

Based on the above discussion, shouldn't every API that includes a
unicode object also indicate what the encoding of that unicode object
is?  I mean, sure, users can probably figure it out in most cases,
values from the inferior, target_charset, values from the user,
host_charset, but surely a well documented API should be explicit
about these things?

Thanks,
Andrew