[RFC] iconvdata/, localedata/: Fix TSCII and document tests.

Mon Jun 22 17:10:42 GMT 2020

On 6/22/20 11:41 AM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
> 
>> diff --git a/localedata/charmaps/TSCII b/localedata/charmaps/TSCII
>> index 9646f326cb..3d9ae1fb5e 100644
>> --- a/localedata/charmaps/TSCII
>> +++ b/localedata/charmaps/TSCII
>> @@ -2,8 +2,26 @@
>>  <comment_char> %
>>  <escape_char> /
>>  <mb_cur_min> 1
>> -<mb_cur_max> 1
>> -% based on TSCII version 1.7
>> +<mb_cur_max> 3
>> +
>> +% Tamil Script Code for Information Interchange
>> +%
>> +% Based on TSCII version 1.7
>> +%
>> +% The lower 128 code points are ASCII, but the upper code points are
>> +% TSCII characters that often map to multiple Unicode code points.  The
>> +% one-to-many mapping means that much of the character map is commented
>> +% out since we don't support many-to-one mappings in POSIX-compatible
>> +% character maps.  There are 179 such mappings where one encoded TSCII
>> +% character is mapped to more than one Unicode code point.
>> +%
>> +% Note that iconv is capable of and supports such conversions, but iconv
>> +% when run with character maps as from-encoding or to-encoding is unable
>> +% to support such conversions.
>> +%
>> +% For conversion reference:
>> +% https://www.unicode.org/notes/tn15/Tscii2Unicode2.pdf
> 
> Does this mean that after this change, glibc will no longer perform
> proper multi-byte to wide string conversion for single-byte characters
> such as 0x8c?  Or is the charmap file just for reference purposes, and
> conversion of 0x8c to U+0B95 U+0BCD U+0BB7 U+0BCD works as before?

We do not support what is written in the TSCII character map today,
and the character map is completely broken and will not compile.

The parser cannot handle multiple internal code units mappings to the
encoded multi-byte stream. There is no such support in glibc. It seems
like TSCII's charmap was committed blind with no real testing, and only
provides input for reference to the iconv converter.

For example the following is invalid and the parser won't parse it:
<U0B95><U0BCD><U0BB7><U0BCD> /x8c         TAMIL GLYPH KSH

I don't know why it was ever committed in the first place. Probably
just to record the fact that this exists.

The current regexp in the test will exclude TSCII from testing because
of this line, and that means it has bit rotted and is useless.

My goal is to make TSCII actually a usable charmap if you needed it,
and have the testing include a few more comments to help future readers.

Conversion of 0x8c is handled by special case in iconv and is correct.

e.g.
iconvdata/tscii.c:
476         else if (ch == 0x8c)                                                  \
477           {                                                                   \
478             /* Output <U0B95><U0BCD><U0BB7><U0BCD>, if we have room for       \
479                four characters.  */                                           \

In summary:
- TSCII is completely broken today, and this patch makes it usable.
- TSCII can now be used in a locale, but is missing a lot of entries
  because of the POSIX charmap limitations.
- Conversions for TSCII are correctly handled only by iconv.

Does that answer your question?

-- 
Cheers,
Carlos.