This is the mail archive of the
gdb-patches@sourceware.org
mailing list for the GDB project.
Re: support C/C++ identifiers named with non-ASCII characters
> On May 21, 2018, at 10:03 AM, Simon Marchi <simark@simark.ca> wrote:
>
> ...
> I am not a specialist in lexing and parsing C, so can you explain quickly why
> you think this is a good solution? Quickly, I understand that you change the
> identifier recognition algorithm to a blacklist of characters rather than
> a whitelist, so bytes that are not recognized (such as those that compose
> the utf-8 encoded characters) are not rejected.
>
> Given unlimited time, would the right solution be to use a lib to parse the
> string as utf-8, and reject strings that are not valid utf-8?
This sounds like a scenario where "stringprep" is helpful (or necessary). It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.
paul