This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Guys, >> > or perhaps make two versions of "BIG5-HKSCS" in glibc: >> > say "BIG5-HKSCS-1999" which maps BIG5-HKSCS to ISO 10646-1:2000+PUA, I would say ITSD want: "Big5-HKSCS-2001" <--> ISO 10646-1:2000 + PUA (as an interim before the rest of the system are ready for ISO 10646-2:2001); and "Big5-HKSCS-2001" <--> ISO 10646-1:2000 + ISO 10646-2:2001 + PUA (when the rest of the system support ISO 10646-2:2001. Note: about 35 HKSCS-2001 characters are still put in the PUA as they are not included in ISO 10646-1:2000 nor ISO 10646-2:2001.) Yours, Andrew Fung, APII ITSD From: Anthony Fok <anthony@thizlinux.com> on 2002/04/19 09:52 AM To: Bruno Haible <haible@ilog.fr> cc: libc-alpha@sources.redhat.com, James Su <suzhe@turbolinux.com.cn>, Roger So <roger.so@sw-linux.com>, Andrew TC Fung/ITSD/HKSARG@ITSD Subject: Re: Unicode 3.2 support (6) On Thu, Apr 18, 2002 at 07:48:44PM +0200, Bruno Haible wrote: > > So, in the interim, please consider using the following scheme for the > > default BIG5-HKSCS charmap/converter: > > > > BIG5-HKSCS --> ISO 10646-1:2000 + PUA > > > > PUA + ISO 10646-1:2000 \___\ BIG5-HKSCS > > ISO 10646-2:2001 / / > > This is not a migration plan. Real migration would be to convert like > this: > > BIG5-HKSCS --> ISO 10646-1:2000 + ISO 10646-2:2001 > > PUA + ISO 10646-1:2000 \___\ BIG5-HKSCS > ISO 10646-2:2001 / / Yes, you're right, of course. :-) The "ISO 10646-1:2000 + PUA" is the old semantics. In 2003 or 2004, we can probably safely switch to "ISO 10646-1:2000 + ISO 10646-2:2001". I hope other components on GNU/Linux system will be ready by then. :-) > > or perhaps make two versions of "BIG5-HKSCS" in glibc: > > say "BIG5-HKSCS-1999" which maps BIG5-HKSCS to ISO 10646-1:2000+PUA, > > That sounds reasonable. I will provide a patch that adds > BIG5-HKSCS-1999 with the old semantics, for use by people who have not > upgraded their fonts to use the non-BMP planes. Thanks for your help. :-) BTW, such big5hkscs.c tables have already been made by both James and I, so you can use one or the other to save you some time. CHARMAP stuff will probably need help from you though. :-) We are unsure how glibc handles <Unassigned> and %IRREVERSIBLE stuff in CHARMAP files yet. :-) In our big5hkscs.c tables ("BIG5-HKSCS-1999"), the to_unicode function maps to BMP+PUA, whereas from_unicode maps from BMP+PUA+CJK_ExtB back to Big5 (i.e. quite a few characters have two-to-one mappings in the from_unicode direction). It would be best if this two-to-one mappings in the from_unicode direction be kept in both BIG5-HKSCS-2001. (About HKSCS-2001 fonts, well, some major font vendors are still in the process of making them, so most fonts on the market only conform to HKSCS-1999 so far. :-) > > There is another intricacy with BIG5-HKSCS with unified characters, > > in big5cmp.txt. If you like, please take a look at: > > > > http://www.thizlinux.com/~anthony/hkscs/ > > I don't understand what this big5cmp.txt means for the converters. Can > you explain in more detail, please? Andrew Fung of ITSD explained that to me. big5cmp.txt is mainly for compatibility with old documents using the GCCS (1995) (Government Common Character Set, which predates HKSCS. Here was my question: Andrew, I was reading the HKSCS Standard in more detail, and I was wondering how the ITSD would like vendors to handle Annex I, i.e. support for compatibility code points in GCCS but not in HKSCS, especially the "unified" ones. How important is it to handle these unified characters in the BIG5-HKSCS <-> Unicode tables? (Mandatory? Recommended? Suggested?) For example, B5+ADC5 and the B5+FA5F variant where the a small portion is written slightly differently, but ISO 10646 classify these two as "different glyphs but the same character": ,-------------------------+-----------------. | Big5 ^ ADC5 | EUDC ^ FA5F | `----------------|--------+--------|--------' HKSCS | GCCS | | | ,----------------|--------+--------|--------. | Unicode CJK v U+5029 | PUA v U+E01F | `-------------------------+-----------------' And which of the following would be the preferred behaviour in a BIG5-HKSCS <-> Unicode table? 1. "Do nothing". Keep the FA5F <-> U+E01F mapping in both directions. (For GCCS, at least there won't be data loss during conversion, but the GCCS document won't be changed to a HKSCS one either.) 2. FA5F -> U+5029 (unidirectional). 3. U+E01F -> ADC5 (unidirectional). 4. Both 2 and 3. (B5+FA5F -> U+5029, U+E01F -> B5+ADC5) Andrew replied: According to our HKSCS Document, compatibility points (CPs) are code points reserved for backward compatibility. In other words, we simply require users/vendors not to use these CPs to define characters. Also, fonts should contain the glyphs for the CPs for displaying old documents that may contain CP. In the HKSCS Document, we described CPs by explaining several occasions where CPs exist. However, we do not post an explicit requirement on how CPs in Big-5 should be mapped to ISO 10646 or vice versa. Nevertheless, based on the descriptions on CPs in the HKSCS Document, vendors should be able to decide how to implement the mapping between Big-5 and ISO 10646. For example, a code converter between Big-5 and ISO 10646 should map CPs in Big-5 to the "correct" ISO 10646 code point and vice versa. In other words, option 4 in your mail should be implemented. However, if round trip conversion is important for the application, then option 1 in your mail should be implemented. So, based on Andrew's recommendation, and since GCCS is obsolete, I think we should go with "Option 4" which would in effect normalize documents with GCCS encoding to HKSCS encoding. These are already implemented in James' or my big5hkscs.c. A question that we both had was: how do we reflect that in the CHARMAP? :-) Or, for that matter, what is the CHARMAP for exactly? (I just know that there is an ISO Technical Report final draft (14632 or something like that?) about this and other locale stuff. Cheers, Anthony -- Anthony Fok Tung-Ling ThizLinux Laboratory <anthony@thizlinux.com> http://www.thizlinux.com/ Debian Chinese Project <foka@debian.org> http://www.debian.org/intl/zh/ Come visit Our Lady of Victory Camp! http://www.olvc.ab.ca/ (See attached file: att1.eml)
Attachment:
att1.eml
Description: Binary data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |