[PATCH] gdb/python: make more use of host_string_to_python_string

Mon Dec 27 01:30:16 GMT 2021

>> Given that Python 2's string type is equivalent to Python 3's bytes
>> type, and Python 2's unicode type is equivalent to Python 3's string
>> type, I find it odd to have host_string_to_python_string (and some
>> Python API functions) return a "string" both in Python 2 and Python 3,
>> as they are both fundamentally different types.
> 
> I understand you position.
> 
>>                                                   Returning a unicode in
>> Python 2 and a string in Python 3 makes sense to me, they are basically
>> the same  thing.
> 
> I thought in Py3 unicode and str were the same, so its less
> "basically the same thing", and more "are the same thing", right?

Well, the "unicode" type doesn't exist in Py3.  But yes, my
understanding is that Py2's unicode has been renamed to str in Py3, so
we could say "they are the same thing".  But I'm not an expert in Python
internals, so there may be subtle differences I don't know about.  So I
went for the more careful formulation :).

>>> Obviously, this still doesn't address your concern
>>> about the unicode -> str change (for Py2), so I doubt you'd now find
>>> the patch acceptable.
>>>
>>> That said, I suspect changing host_string_to_python_string as in the
>>> above would probably be a good thing, right?  The above function is
>>> used, so all I need to do is inject some non ASCII characters into a
>>> code path that calls the above function and the existing GDB will
>>> break, but the above change would allow things to work correctly.
>>
>> Code paths that do use this function already get a "str" in both Python
>> 2 and 3 (which I think is wrong, as explained above, but that's what we
>> have to deal with) and would still receive a "str" after, so the change
>> is is safe from that point of view.
> 
> I understand why, given the view that host_string_to_python_string is
> basically wrong, adding any additional calls to it would be considered
> wrong.  Maybe we should rename it to
> deprecated_host_string_to_python_string, and add a new function,
> host_string_to_python_unicode.
> 
> If/when Py2 support is dropped then user of the old function could be
> changed to use the new *_unicode function?

That sounds fine.

>>>
>>> Really this should be:
>>>
>>>   def invoke(self, args, from_tty):
>>>       print(args.encode(gdb.host_charset ()))
>>>
>>> Except we don't have a gdb.host_charset method, otherwise it should be
>>> possible for this code to go wrong.
>>
>> True.  Although somebody could still use .encode('utf-8') and just use
>> that script on machines where UTF-8 is the locale (which is just the
>> norm today).
> 
> I don't understand what you mean here.  If the user is running on a
> machine with non utf-8 locale, then (if I understand how this all
> works), the bytes read by GDB would be in the machines (host_charset)
> local, these bytes would be sent over to Python, which would then
> convert them to a unicode object in the host_charset locale.
> 
> Now if the user wants to get the bytes back they need to know the
> correct value to pass to .encode, right?

Yes, you're right.  What I meant is that given that all the machines I
use have an UTF-8 locale, I could use `.encode('utf-8')` in my scripts
and just not care about other charsets.  All of this to say that there
might be scripts out there that care if they receive an str or a
unicode.

>>> I'll write a patch to add that.
>>>
>>> I assume you'll not object if I propose updating the documentation for
>>> all the functions I tried to change here to document the actual
>>> behaviour?
>>
>> Sure.  Although if we end up removing Python 2 support (which is not a
>> given), it might be unnecessary.
> 
> Based on the above discussion, shouldn't every API that includes a
> unicode object also indicate what the encoding of that unicode object
> is?  I mean, sure, users can probably figure it out in most cases,
> values from the inferior, target_charset, values from the user,
> host_charset, but surely a well documented API should be explicit
> about these things?

Not sure.  From the point of view of the user of a unicode object, a
unicode object isn't encoded using some particular encoding.  It's just
a sequence of unicode code points.  The user can then decide to
serialize these code points to the encoding of their choosing by calling
`.encode(...)`, for example 'utf-8' or 'utf-32'.

When returning a Python 2 'str', or if we happened to return text as a
Python 3 'bytes' (which we don't) , then the user just receives a
sequence of bytes.  So then it would be relevant to tell them what
encoding these are in.

Simon