[PATCH] gdb/python: make more use of host_string_to_python_string
Andrew Burgess
aburgess@redhat.com
Fri Dec 24 12:20:42 GMT 2021
* Simon Marchi <simon.marchi@polymtl.ca> [2021-12-23 22:10:00 -0500]:
>
>
> On 2021-12-23 06:07, Andrew Burgess wrote:
> > * Simon Marchi <simon.marchi@polymtl.ca> [2021-12-22 21:19:41 -0500]:
> >
> >>
> >>
> >> On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote:
> >>> We have a function host_string_to_python_string, which is a wrapper
> >>> around PyString_Decode, which, on Python 3, is an alias for
> >>> PyUnicode_Decode.
> >>>
> >>> However, there are a few places where we still call PyUnicode_Decode
> >>> directly.
> >>>
> >>> This commit replaces all uses of PyUnicode_Decode with calls to
> >>> host_string_to_python_string instead.
> >>>
> >>> To make the use of host_string_to_python_string easier, I've added a
> >>> couple of overloads for this function in python-internal.h, these all
> >>> just forward their calls onto the single base implementation. The
> >>> signatures of all host_string_to_python_string overloads are:
> >>>
> >>> gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
> >>> gdbpy_ref<> host_string_to_python_string (const char *str)
> >>> gdbpy_ref<> host_string_to_python_string (const string_file &str)
> >>>
> >>> For Python 3 PyString_Decode is setup (in python-internal.h) as an
> >>> alias for PyUnicode_Decode, so there should certainly be no user
> >>> visible changes in this case.
> >>>
> >>> For Python 2 this commit will change the behaviour. Previously by
> >>> calling PyUnicode_Decode we would have been returning a Unicode
> >>> object. Now, after calling host_string_to_python_string, we will have
> >>> a str object.
> >>>
> >>> I've checked the GDB documentation, and, as far as I can tell, the
> >>> methods I've touched were all documented as returning a string, or in
> >>> the gdb.Command case, take a string as an argument, and my
> >>> understanding is that for Python 2, string generally means str. So I
> >>> think the new behaviour would be more expected.
> >>>
> >>> A different solution, that would also make things more consistent in
> >>> the Python 2 world, would be to have host_string_to_python_string
> >>> always return a Unicode object. However, I've been reading this page:
> >>>
> >>> https://pythonhosted.org/kitchen/unicode-frustrations.html
> >>>
> >>> item #3 recommends that unicode strings should be converted to str
> >>> objects before being printed (in Python 2). That would mean that
> >>> users should have been adding .encode() calls to the output of the
> >>> routines I've changed in this commit (if they wanted to print that
> >>> output), which is not something I think is made clear from the GDB
> >>> documentation.
> >>
> >> I don't think we can (nor want) to change things currently returning
> >> unicode objects to return str objects. It changes significantly what
> >> the user code receives. If a given method/function has been returning a
> >> unicode object since the dawn of time and we now return a string object,
> >> user code expecting a unicode object will break.
> >
> > ACK. I think I disagree about the significance of the change, but I
> > understand your concerns.
>
> I fail to see how you don't find it significant: this is a public API,
> we change the type of the returned value, doesn't it break existing
> users?
I have two reasons, (1) the behaviour doesn't match the documented
API, which I think gives some scope for fixing either the
implementation or the documentation, and (2) in reality, for many
purposes, a unicode can be used just like a str, e.g. I guess most
folk are not calling encode before printing the unicode object, they
just pass them to print.
Please do remember though, I say the above just to give you insight to
my thinking, I've already ACK'd the above as a blocker for the patch I
proposed.
>
> >> Also, changing a PyUnicode_Decode call to host_string_to_python_string
> >> (and therefore PyString_Decode) means that if the string contains
> >> something not encodable in ascii, things will now fail So that sounds
> >> like a regression to me: where we could handle UTF-8 strings before, we
> >> don't now. See below for examples.
> >
> > Yeah, you're absolutely correct. I don't understand why we would want
> > to call PyString_Decode here for Py2 at all, maybe you understand why
> > that's a good thing?
> >
> > If I rewrite the function like this:
> >
> > /* Convert a host string to a python string. */
> >
> > gdbpy_ref<>
> > host_string_to_python_string (const char *str, size_t len)
> > {
> > #ifdef IS_PY3K
> > return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
> > #else
> > return gdbpy_ref<> (PyString_FromStringAndSize(str, len));
> > #endif
> > }
> >
> > Then I think this would achieve everything I actually intended this
> > patch to achieve.
>
> Can you clarify what the patch is trying to achieve? I don't think I
> understand properly.
I started working on a slightly related issue, and as I started
digging around I was running into places where the API docs said
function XXX returns a string, which isn't the case on Py2. That got
me looking the places where we called PyUnicode_Decode directly rather
than passing through host_string_to_python_string, and, well, here we
are...
>
> Given that Python 2's string type is equivalent to Python 3's bytes
> type, and Python 2's unicode type is equivalent to Python 3's string
> type, I find it odd to have host_string_to_python_string (and some
> Python API functions) return a "string" both in Python 2 and Python 3,
> as they are both fundamentally different types.
I understand you position.
> Returning a unicode in
> Python 2 and a string in Python 3 makes sense to me, they are basically
> the same thing.
I thought in Py3 unicode and str were the same, so its less
"basically the same thing", and more "are the same thing", right?
>
> > Obviously, this still doesn't address your concern
> > about the unicode -> str change (for Py2), so I doubt you'd now find
> > the patch acceptable.
> >
> > That said, I suspect changing host_string_to_python_string as in the
> > above would probably be a good thing, right? The above function is
> > used, so all I need to do is inject some non ASCII characters into a
> > code path that calls the above function and the existing GDB will
> > break, but the above change would allow things to work correctly.
>
> Code paths that do use this function already get a "str" in both Python
> 2 and 3 (which I think is wrong, as explained above, but that's what we
> have to deal with) and would still receive a "str" after, so the change
> is is safe from that point of view.
I understand why, given the view that host_string_to_python_string is
basically wrong, adding any additional calls to it would be considered
wrong. Maybe we should rename it to
deprecated_host_string_to_python_string, and add a new function,
host_string_to_python_unicode.
If/when Py2 support is dropped then user of the old function could be
changed to use the new *_unicode function?
> Today, a non-ASCII character in the
> input argument just causes an exception with Python 2. With the change
> you propose, it would return an "str" containing whatever bytes came in,
> in the host system's encoding. So, the new behavior seems like an
> improvement at first sight.
OK, thanks for the feedback. I see what I can put together in the new
year.
>
> >> So my suggestion is to leave things as-is, avoid changing the behavior
> >> for Python 2 until we finally remove Python 2 support (for which anybody
> >> willing to do it has my automatic +1).
> >
> > I certainly wouldn't want to stop someone removing Py2 support.
>
> I'll take a look at that, many people have suggested it, and this is an
> example of how keeping Python 2 support is a burden.
>
> >>> ---
> >>> gdb/python/py-cmd.c | 9 +++------
> >>> gdb/python/py-frame.c | 5 ++---
> >>> gdb/python/py-type.c | 3 +--
> >>> gdb/python/py-utils.c | 5 ++---
> >>> gdb/python/py-value.c | 4 ++--
> >>> gdb/python/python-internal.h | 11 ++++++++++-
> >>> gdb/python/python.c | 4 ++--
> >>> 7 files changed, 22 insertions(+), 19 deletions(-)
> >>>
> >>> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> >>> index 94608e0bbcf..b2cafeba320 100644
> >>> --- a/gdb/python/py-cmd.c
> >>> +++ b/gdb/python/py-cmd.c
> >>> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
> >>>
> >>> if (! args)
> >>> args = "";
> >>> - gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
> >>> - NULL));
> >>> + gdbpy_ref<> argobj = host_string_to_python_string (args);
> >>> if (argobj == NULL)
> >>> {
> >>> gdbpy_print_stack ();
> >>
> >> Try with this in "script.py", where a user correctly called encode (as
> >> you mentioned in the commit log) before printing args, a 'unicode'
> >> object:
> >>
> >> ---
> >> class MyCmd(gdb.Command):
> >> def __init__(self):
> >> super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER)
> >>
> >> def invoke(self, args, from_tty):
> >> print(args.encode('utf-8'))
> >
> > Really this should be:
> >
> > def invoke(self, args, from_tty):
> > print(args.encode(gdb.host_charset ()))
> >
> > Except we don't have a gdb.host_charset method, otherwise it should be
> > possible for this code to go wrong.
>
> True. Although somebody could still use .encode('utf-8') and just use
> that script on machines where UTF-8 is the locale (which is just the
> norm today).
I don't understand what you mean here. If the user is running on a
machine with non utf-8 locale, then (if I understand how this all
works), the bytes read by GDB would be in the machines (host_charset)
local, these bytes would be sent over to Python, which would then
convert them to a unicode object in the host_charset locale.
Now if the user wants to get the bytes back they need to know the
correct value to pass to .encode, right?
>
> > I'll write a patch to add that.
> >
> > I assume you'll not object if I propose updating the documentation for
> > all the functions I tried to change here to document the actual
> > behaviour?
>
> Sure. Although if we end up removing Python 2 support (which is not a
> given), it might be unnecessary.
Based on the above discussion, shouldn't every API that includes a
unicode object also indicate what the encoding of that unicode object
is? I mean, sure, users can probably figure it out in most cases,
values from the inferior, target_charset, values from the user,
host_charset, but surely a well documented API should be explicit
about these things?
Thanks,
Andrew
More information about the Gdb-patches
mailing list