This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: Unicode 3.2 support (6)
- From: James Su <suzhe at turbolinux dot com dot cn>
- To: Bruno Haible <bruno at clisp dot org>
- Cc: Anthony Fok <anthony at thizlinux dot com>, libc-alpha at sources dot redhat dot com, Roger So <roger dot so at sw-linux dot com>, Andrew Fung <atcfung at itsd dot gov dot hk>
- Date: Wed, 15 May 2002 00:21:53 +0800
- Subject: Re: Unicode 3.2 support (6)
- Organization: Turbolinux
- References: <15549.35579.721000.386597@honolulu.ilog.fr> <20020418024429.GA12198@sunrise> <15551.1788.723999.850228@honolulu.ilog.fr> <20020419015257.GA12294@sunrise> <15583.41528.852429.118723@honolulu.ilog.fr>
Hi,
I wrote a big5hkscs.c long time ago. Maybe you want to look at it. You
can get it at http://www.turbolinux.com.cn/~suzhe/big5hkscs.c.gz
And I think it's better to use CP950 as the base of BIG5-HKSCS. Most
users use CP950 rather than ordinary BIG5, because Microsoft Windows
uses CP950. And CP950 is superset of BIG5, it'll be OK to replace BIG5
with CP950 .
Regards
James Su
Bruno Haible wrote:
>Anthony Fok writes:
>
>>In our big5hkscs.c tables ("BIG5-HKSCS-1999"), the to_unicode function maps
>>to BMP+PUA, whereas from_unicode maps from BMP+PUA+CJK_ExtB back to Big5
>>(i.e. quite a few characters have two-to-one mappings in the from_unicode
>>direction). It would be best if this two-to-one mappings in the
>>from_unicode direction be kept in both BIG5-HKSCS-2001.
>>
>
>I agree. For user's convenience it is best if the from_unicode
>direction of both BIG5-HKSCS-1999 and BIG5-HKSCS-2001 is identical.
>In other words, each of the two from_unicode converters will then
>accept Unicode text that has been converted by either one of two
>to_unicode converters.
>
>>Andrew Fung of ITSD explained that to me. big5cmp.txt is mainly for
>>compatibility with old documents using the GCCS (1995) (Government Common
>>Character Set, which predates HKSCS.
>>
>>Here was my question:
>>
>> Andrew, I was reading the HKSCS Standard in more detail, and I was
>> wondering how the ITSD would like vendors to handle Annex I, i.e.
>> support for compatibility code points in GCCS but not in HKSCS,
>> especially the "unified" ones. How important is it to handle these
>> unified characters in the BIG5-HKSCS <-> Unicode tables? (Mandatory?
>> Recommended? Suggested?)
>>
>> For example, B5+ADC5 and the B5+FA5F variant where the a small portion
>> is written slightly differently, but ISO 10646 classify these two as
>> "different glyphs but the same character":
>>
>> ,-------------------------+-----------------.
>> | Big5 ^ ADC5 | EUDC ^ FA5F |
>> `----------------|--------+--------|--------'
>> HKSCS | GCCS |
>> | |
>> ,----------------|--------+--------|--------.
>> | Unicode CJK v U+5029 | PUA v U+E01F |
>> `-------------------------+-----------------'
>>
>> And which of the following would be the preferred behaviour in a
>> BIG5-HKSCS <-> Unicode table?
>>
>> 1. "Do nothing". Keep the FA5F <-> U+E01F mapping in both directions.
>> (For GCCS, at least there won't be data loss during conversion, but
>> the GCCS document won't be changed to a HKSCS one either.)
>>
>> 2. FA5F -> U+5029 (unidirectional).
>>
>> 3. U+E01F -> ADC5 (unidirectional).
>>
>> 4. Both 2 and 3. (B5+FA5F -> U+5029, U+E01F -> B5+ADC5)
>>
>>Andrew replied: ...
>>
>>So, based on Andrew's recommendation, and since GCCS is obsolete,
>>I think we should go with "Option 4" which would in effect normalize
>>documents with GCCS encoding to HKSCS encoding.
>>
>
>Thanks for the explanations. I'm implementing this suggestion in both
>converters.
>
>>such big5hkscs.c tables have already been made by both James and I,
>>so you can use one or the other to save you some time.
>>
>
>I cannot take your converters as-is, because
>1) they contain redundant private area mappings, for example
> B5+8140 -> U+EEB8, which is nowhere official.
>2) they are apparently based on CP950, not BIG5. For example they map
> B5+A1C5 to U+02CD, but according to page 108 of e_hkscs.pdf the
> character U+02CD is not part of BIG5-HKSCS.
>3) they contain additional stuff, like B5+8C40 -> U+503B, which
> is not found in page 109 of e_hkscs.pdf.
>
>>Or, for that matter, what is the CHARMAP for exactly? (I just know
>>that there is an ISO Technical Report final draft (14632 or
>>something like that?) about this and other locale stuff.
>>
>
>The CHARMAP serves three purposes:
>1) Association between Unicode values and byte sequences, used when a
>locale is built by localedef.
>2) Documentation for the end users (that's why we have these long
>character names in every charmap).
>3) Verification of the corresponding iconv converter. Deviations are
>partially noted as %IRREVERSIBLE% in the charmap, partially in a file
>named iconvdata/$CODESET.irreversible.
>
>Bruno
>