support C/C++ identifiers named with non-ASCII characters

Mon May 21 19:25:00 GMT 2018

> On May 21, 2018, at 2:14 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: <Paul.Koning@dell.com>
>> CC: <simark@simark.ca>, <zjz@zjz.name>, <gdb-patches@sourceware.org>
>> Date: Mon, 21 May 2018 18:03:17 +0000
>> 
>>> Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
>>> can not include invalid UTF-8 sequences?
>> 
>> Encoding is a I/O question.
> 
> Not necessarily.
> 
> I asked that question because scanning a string for certain ASCII
> characters using a 'char *' pointer will only work reliably if the
> string is in UTF-8 or in some single-byte encoding.  Otherwise, we
> might find false hits for the delimiters, which are actually parts of
> multibyte sequences.

I see your point.

The I/O encoding ties to the internal encoding.  UTF-8 can be read into char[] and processed using C string primitives.  Other encodings cannot.  For example, if you have UTF-16 or UTF-32, you'd have to read it into a wchar_t string of the correct character width and use the wchar string functions.

So there are two questions:

1. What are the valid characters?  (Unicode question, independent of encoding)
2. What encoding do we expect in I/O (UTF question) from which we conclude what processing functions we need.

	paul