This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re[2]: Unicode 3.2 support (6)

From: Andrew TC Fung <atcfung at itsd dot gov dot hk>
To: Anthony Fok <anthony at thizlinux dot com>, Bruno Haible <bruno at clisp dot org>
Cc: libc-alpha at sources dot redhat dot com, James Su <suzhe at turbolinux dot com dot cn>, Roger So <roger dot so at sw-linux dot com>
Date: Thu, 16 May 2002 19:02:26 +0800
Subject: Re[2]: Unicode 3.2 support (6)

Dear all,

>That's great!  Thank you!  :-)  BTW, I don't know if Andrew agrees with
>the naming of these two encodings, because he clarified their
>distinctions about HKSCS-1999 and HKSCS-2001 to the mailing list after
>my posting.  I guess calling it "BIG5-HKSCS-1999" is a misnomer.
>In big5-iso.txt, the two are called:
>
>    HKSCS-2001 in ISO/IEC 10646-1:2000
>and
>    HKSCS-2001 in ISO/IEC 10646-2:2001

I am not sure whether you guys are implementing "HKSCS-2001 in ISO/IEC
10646-1:2000" or "HKSCS (released in 1999) in ISO/IEC 10646-1:2000" using
the name "BIG5-HKSCS-1999".

As I mentioned in an earlier mail, we would definitely like to have
"HKSCS-2001 in ISO/IEC 10646-1:2000" rather than "HKSCS (released in 1999)
in ISO/IEC 10646-1:2000".  If you guys will actually implement "HKSCS-2001
in ISO/IEC 10646-1:2000" using the name "BIG5-HKSCS-1999", then the name
may mislead people to think that "BIG5-HKSCS-1999" only supports the
characters defined by the HKSCS released in 1999.

The names "BIG5-HKSCS-1999" and "BIG5-HKSCS-2001" gives no indication in
how the HKSCS are mapped to UCS-4.  Is it possible to reflect the way of
mapping in the name?

>So, practically, BIG5-HKSCS is CP950 + HKSCS, with the end result being
>(almost strictly):  Big5-1984 < CP950 < Big5-ETen < Big5-HKSCS
>
>Nevertheless, it would be best if we can get a clarification from
>Andrew on this.  (Many thanks, Andrew!  :-)

I think this understanding is correct.  And as Anthony and James have
suggested, implementing the mapping according to CP950 seems more
reasonable as this would enhance the data compatibility between Microsoft
platforms and Linux.

Rgds

From: Anthony Fok <anthony@thizlinux.com> on 2002/05/14 10:56 AM
To: Bruno Haible <bruno@clisp.org>
cc: libc-alpha@sources.redhat.com, James Su <suzhe@turbolinux.com.cn>,
      Roger So <roger.so@sw-linux.com>, Andrew TC Fung/ITSD/HKSARG@ITSD
Subject: Re: Unicode 3.2 support (6)

On Mon, May 13, 2002 at 01:23:36PM +0200, Bruno Haible wrote:
> Anthony Fok writes:
> > In our big5hkscs.c tables ("BIG5-HKSCS-1999"), the to_unicode function
maps
> > to BMP+PUA, whereas from_unicode maps from BMP+PUA+CJK_ExtB back to
Big5
> > (i.e. quite a few characters have two-to-one mappings in the
from_unicode
> > direction).  It would be best if this two-to-one mappings in the
> > from_unicode direction be kept in both BIG5-HKSCS-2001.
>
> I agree. For user's convenience it is best if the from_unicode
> direction of both BIG5-HKSCS-1999 and BIG5-HKSCS-2001 is identical.
> In other words, each of the two from_unicode converters will then
> accept Unicode text that has been converted by either one of two
> to_unicode converters.

That's great!  Thank you!  :-)  BTW, I don't know if Andrew agrees with
the naming of these two encodings, because he clarified their
distinctions about HKSCS-1999 and HKSCS-2001 to the mailing list after
my posting.  I guess calling it "BIG5-HKSCS-1999" is a misnomer.
In big5-iso.txt, the two are called:

     HKSCS-2001 in ISO/IEC 10646-1:2000
and
     HKSCS-2001 in ISO/IEC 10646-2:2001

I wonder how we should call them.  :-)

> > Andrew Fung of ITSD explained that to me.  big5cmp.txt is mainly for
> > compatibility with old documents using the GCCS (1995) (Government
Common
> > Character Set, which predates HKSCS.
> >
> > Andrew replied: ...
> >
> > So, based on Andrew's recommendation, and since GCCS is obsolete,
> > I think we should go with "Option 4" which would in effect normalize
> > documents with GCCS encoding to HKSCS encoding.
>
> Thanks for the explanations. I'm implementing this suggestion in both
> converters.

Great!  Many thanks!  :-)

> > such big5hkscs.c tables have already been made by both James and I,
> > so you can use one or the other to save you some time.
>
> I cannot take your converters as-is, because
> 1) they contain redundant private area mappings, for example
>    B5+8140 -> U+EEB8, which is nowhere official.

They are not redundant, as there is only one mapping to and fro that
area. They are defined in CP950.  And supposedly, Big5-ETen and
BIG5-HKSCS are both supersets of CP950, they should also contain this
area.  (i.e. it would be best if this area is added to glibc's BIG5
table too.  I should bring up a discussion on CLE.)  :-)

This area (B5+8140 - B5+84FE) is part of UDA3 (B5+8140 - B5-8DFE) in
HKSCS-2001.  True, 8140-84FE is reserved for end users and will not be
used by future extensions of HKSCS-2001, and thus it is important that
there is still a mapping from 8140-84FE to an area in Unicode's PUA.

The BIG5-HKSCS table only lists what they add above "Big5", and by
"Big5", the ITSD doesn't define explicitly, but we can safely assume it
to be CP950, because the ITSD has provided first implementations of
HKSCS on Windows, which contains the 8140-84FE mapping.  Thus, it is
important for the HKSCS table in glibc to do the same.

So, practically, BIG5-HKSCS is CP950 + HKSCS, with the end result being
(almost strictly):  Big5-1984 < CP950 < Big5-ETen < Big5-HKSCS

Nevertheless, it would be best if we can get a clarification from
Andrew on this.  (Many thanks, Andrew!  :-)

> 2) they are apparently based on CP950, not BIG5. For example they map
>    B5+A1C5 to U+02CD, but according to page 108 of e_hkscs.pdf the
>    character U+02CD is not part of BIG5-HKSCS.

Try this:

     echo '¡Å' | iconv -f big5 -t ucs2 | hexdump

This also gives U+02CD.  This means it is not extraneous, as it is in the
Big5 encoding submitted by the CLE too.  ;-)

> 3) they contain additional stuff, like B5+8C40 -> U+503B, which
>    is not found in page 109 of e_hkscs.pdf.

They are in the official big5-iso.txt (HKSCS-2001 version) provided on
the HKSCS-2001 web site, and it is on page 3-116 (or, in Acrobat Reader,
page 176 of 287) of e_hkscs.pdf.  :-)

> > Or, for that matter, what is the CHARMAP for exactly?  (I just know
> > that there is an ISO Technical Report final draft (14632 or
> > something like that?) about this and other locale stuff.
>
> The CHARMAP serves three purposes:
> 1) Association between Unicode values and byte sequences, used when a
> locale is built by localedef.
> 2) Documentation for the end users (that's why we have these long
> character names in every charmap).
> 3) Verification of the corresponding iconv converter. Deviations are
> partially noted as %IRREVERSIBLE% in the charmap, partially in a file
> named iconvdata/$CODESET.irreversible.

Thank you very much for your explanations!  :-)  BTW, does glibc's
CHARMAP strictly follow DTR 14652 (eventually TR 14652 and ISO 14652)?
Are there any glibc-specific extension, etc.?  :-)

     http://std.dkuug.dk/jtc1/sc22/wg20/docs/n897-14652w25.pdf

Thanks,

Best regards,

Anthony

--
Anthony Fok Tung-Ling
ThizLinux Laboratory   <anthony@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <foka@debian.org>
http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp!           http://www.olvc.ab.ca/
(See attached file: att1.eml)

Attachment: =?big5?B?YXR0MS5lbWw=?=
Description: Binary data

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]