This is the mail archive of the
gdb-patches@sourceware.org
mailing list for the GDB project.
Re: support C/C++ identifiers named with non-ASCII characters
> On May 21, 2018, at 12:12 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>
>> From: <Paul.Koning@dell.com>
>> CC: <zjz@zjz.name>, <gdb-patches@sourceware.org>
>> Date: Mon, 21 May 2018 14:12:12 +0000
>>
>>> Given unlimited time, would the right solution be to use a lib to parse the
>>> string as utf-8, and reject strings that are not valid utf-8?
>>
>> This sounds like a scenario where "stringprep" is helpful (or necessary). It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.
>
> Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
> can not include invalid UTF-8 sequences?
Encoding is a I/O question. "UTF-8" and "Unicode" are often mixed up, but they are distinct. Unicode is a character set, in which each character has a numeric identification. For example, 張 is Unicode character number 24373 (0x5f35).
UTF-8 is one of several ways to encode Unicode characters as a byte stream. The UTF-8 encoding of 張 is e5 bc b5.
I don't know what the C/C++ standards say about non-ASCII identifiers. I assume they are stated to be Unicode, and presumably specific Unicode character classes. So there are some sequences of Unicode characters that are valid identifiers, while others are not -- exactly as "abc" is a valid ASCII identifier while "a@bc" is not.
A separate question is the encoding of files. The encoding rule could be that UTF-8 is required -- or that the encoding is selectable. There also has to be an encoding in output files (debug data for example). And when strings are entered at the GDB user interface, they arrive in some encoding. For all these, UTF-8 is a logical answer.
Not all byte strings are valid UTF-8 strings. When a byte string is delivered from the outside, it makes sense to validate if it's a valid encoding before it is used. Or you can assume that inputs are valid and rely on "symbol not found" as the general way to handle anything that doesn't match. For gdb, that may be good enough.
Yet another issue: for many characters, there are multiple ways to represent them in Unicode. For example, ü (latin small letter u with dieresis) can be coded as the single Unicode character 0xfc, or as the pair 0x0308 0x75 (combining dieresis, latin small letter u). These are supposed to be synonymous; when doing string matches, you'd want them to be taken as equivalent. The stringprep library helps with this by offering a conversion to a standard form, at which point memcmp will give the correct answer.
paul