17563 – cmn_TW: add hanzi collation

Bug 17563 - cmn_TW: add hanzi collation

Summary: cmn_TW: add hanzi collation

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	2.27
Assignee:	Mike FABIAN

URL:
Keywords:

Depends on:	16905
Blocks:
	Show dependency tree / graph

Reported:	2014-11-07 10:24 UTC by Wei-Lun Chao
Modified:	2017-08-11 02:10 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
patch for hanzi collation (812 bytes, patch) 2014-11-07 10:24 UTC, Wei-Lun Chao	Details \| Diff
patch for hanzi collation (1.15 KB, patch) 2017-07-20 02:48 UTC, Wei-Lun Chao	Details \| Diff
patch for hanzi collation (1.04 KB, patch) 2017-08-08 10:02 UTC, Wei-Lun Chao	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wei-Lun Chao 2014-11-07 10:24:07 UTC

Created attachment 7912 [details]
patch for hanzi collation

Add new collation file for hanzi from bug 16905 to localedata of cmn_TW.

Comment 1 Wei-Lun Chao 2014-11-07 10:25:38 UTC

Tested on Fedora 21 x86_64 beta.

Comment 2 Wei-Lun Chao 2016-08-10 09:38:23 UTC

Tested OK on Fedora 24.

Comment 3 Wei-Lun Chao 2017-07-20 02:48:49 UTC

Created attachment 10276 [details]
patch for hanzi collation

patch updated

Comment 4 Mike FABIAN 2017-08-08 09:32:17 UTC

(In reply to Wei-Lun Chao from comment #3)
> Created attachment 10276 [details]
> patch for hanzi collation
> 
> patch updated

Why does your patch remove “country_car”?

@@ -200,7 +208,6 @@ 
 % TWN
 country_ab3  "TWN"
 country_num  158
-country_car "RC"
 country_isbn 957
 % 漢語官話
 lang_name    "漢語官話"

According to

https://en.wikipedia.org/wiki/List_of_international_vehicle_registration_codes

“RC” seems to be correct.

Comment 5 Wei-Lun Chao 2017-08-08 10:02:01 UTC

Created attachment 10325 [details]
patch for hanzi collation

Oh! Its my fault.
patch re-uploaded.

Comment 6 Mike FABIAN 2017-08-08 10:45:42 UTC

Should this stroke count sorting also be applied to zh_TW, or
only to cmn_TW?
(In reply to Wei-Lun Chao from comment #5)
> Created attachment 10325 [details]
> patch for hanzi collation
> 
> Oh! Its my fault.
> patch re-uploaded.

Thank you!

Should the new collation also be used for zh_TW, or only
for cmn_TW.

By the way, what is the difference between zh_TW 
and cmn_TW, isn’t both Mandarin?

Comment 7 Wei-Lun Chao 2017-08-08 15:14:15 UTC

(In reply to Mike FABIAN from comment #6)
> Should the new collation also be used for zh_TW, or only
> for cmn_TW.
> By the way, what is the difference between zh_TW 
> and cmn_TW, isn’t both Mandarin?

As reasons for bug 15963, those 14 languages have been behind the macro-language "zh" for a long time. Technically zh_TW and cmn_TW are the same, but for fairness, IMHO, the locale zh_TW should be deprecated and replaced with cmn_TW and other chinese locales.

Personally I would like to differentiate cmn from zh with this radical patch, which may be followed by similar patches against nan_TW, hak_TW, lzh_TW and yue_HK.

Comment 8 Mike FABIAN 2017-08-09 10:43:28 UTC

(In reply to Wei-Lun Chao from comment #7)
> (In reply to Mike FABIAN from comment #6)
> > Should the new collation also be used for zh_TW, or only
> > for cmn_TW.
> > By the way, what is the difference between zh_TW 
> > and cmn_TW, isn’t both Mandarin?
> 
> As reasons for bug 15963, those 14 languages have been behind the
> macro-language "zh" for a long time. Technically zh_TW and cmn_TW are the
> same, but for fairness, IMHO, the locale zh_TW should be deprecated and
> replaced with cmn_TW and other chinese locales.
> 
> Personally I would like to differentiate cmn from zh with this radical
> patch, which may be followed by similar patches against nan_TW, hak_TW,
> lzh_TW and yue_HK.

OK. 

How to test your patch?

I did this:

Without your patch:

$ echo -e "黄\n木\n機\n期" | LC_ALL=cmn_TW.UTF-8 sort
期
木
機
黄
$

With your patch:

$ echo -e "黄\n木\n機\n期" | LC_ALL=cmn_TW.UTF-8 sort
木
黄
期
機
$

That seems to show that I applied your patch correctly, right?

Comment 9 Wei-Lun Chao 2017-08-09 19:04:39 UTC

(In reply to Mike FABIAN from comment #8)
> (In reply to Wei-Lun Chao from comment #7)
> > (In reply to Mike FABIAN from comment #6)
> > > Should the new collation also be used for zh_TW, or only
> > > for cmn_TW.
> > > By the way, what is the difference between zh_TW 
> > > and cmn_TW, isn’t both Mandarin?
> > 
> > As reasons for bug 15963, those 14 languages have been behind the
> > macro-language "zh" for a long time. Technically zh_TW and cmn_TW are the
> > same, but for fairness, IMHO, the locale zh_TW should be deprecated and
> > replaced with cmn_TW and other chinese locales.
> > 
> > Personally I would like to differentiate cmn from zh with this radical
> > patch, which may be followed by similar patches against nan_TW, hak_TW,
> > lzh_TW and yue_HK.
> 
> OK. 
> 
> How to test your patch?
> 
> I did this:
> 
> Without your patch:
> 
> $ echo -e "黄\n木\n機\n期" | LC_ALL=cmn_TW.UTF-8 sort
> 期
> 木
> 機
> 黄
> $
> 
> With your patch:
> 
> $ echo -e "黄\n木\n機\n期" | LC_ALL=cmn_TW.UTF-8 sort
> 木
> 黄
> 期
> 機
> $
> 
> That seems to show that I applied your patch correctly, right?

Yes, I used to test bug 16905 like this:
$ touch 黄 木 機 期
$ ls
$ LC_ALL=cmn_TW.UTF-8 ls

Comment 10 Sourceware Commits 2017-08-10 11:49:54 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  bd80111ed9cb93b2d56720dcd1d1f259616c27ae (commit)
       via  4169825556bcc23ced731e711be91819465d4a83 (commit)
       via  38dbcacb606f70ad0a35fbcacb6f3cbff5f34d94 (commit)
      from  68dc02d1dcbfb37ee22327d6a3c43f528d593035 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bd80111ed9cb93b2d56720dcd1d1f259616c27ae

commit bd80111ed9cb93b2d56720dcd1d1f259616c27ae
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Thu Aug 10 12:16:29 2017 +0200

    Fix stdlib/tst-strfmon_l.c test case to agree with the changes in Indian monetary formatting
    
    The test cases should expose non-standard grouping and the trailing
    space after the currency sign. After the changes to the Indian
    monetary formatting, the Indian formatting still shows the
    non-standard grouping. To test the trailing space after the currency
    sign I chose the hr_HR locale.
    
    See:
    
        commit 82b3124268bec0609b337dd993e771c93e44cbf2
        Author: Akhilesh Kumar <akhilesh.k@samsung.com>
    
            Remove redundant data for LC_MONETARY for Indian locales

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4169825556bcc23ced731e711be91819465d4a83

commit 4169825556bcc23ced731e711be91819465d4a83
Author: Akhilesh Kumar <akhilesh.k@samsung.com>
Date:   Wed Aug 9 18:27:14 2017 +0530

    Remove redundant data for LC_MONETARY for Indian locales
    
    	Reference is taken from
    	https://en.wikipedia.org/wiki/Indian_numbering_system
    	https://en.wikipedia.org/wiki/Indian_rupee
    
    	CLDR has the currency format pattern “¤#,##,##0.00”.
    
    	[BZ #21836]
    	* locales/ar_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/as_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/bhb_IN (LC_MONETARY): copy "hi_IN"
    	* locales/bn_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/en_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/gu_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/hi_IN (LC_MONETARY) : Fix mon_grouping,
    	p_sep_by_space and n_sep_by_space
    	* locales/kn_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/kok_IN(LC_MONETARY) : copy "hi_IN"
    	* locales/ks_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/ml_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/mr_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/or_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/pa_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/sa_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/sd_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/ta_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/tcy_IN(LC_MONETARY) : copy "hi_IN"
    	* locales/te_IN (LC_MONETARY) : copy "hi_IN"
    	* locales/ur_IN (LC_MONETARY) : copy "hi_IN"

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=38dbcacb606f70ad0a35fbcacb6f3cbff5f34d94

commit 38dbcacb606f70ad0a35fbcacb6f3cbff5f34d94
Author: Wei-Lun Chao <bluebat@member.fsf.org>
Date:   Wed Aug 9 12:19:44 2017 +0200

    cmn_TW: add hanzi collation
    
    	[BZ #17563]
    	[BZ #16905]
    	* locales/cmn_TW (LC_COLLATE): Use cns11643_stroke file for sorting.
    	* locales/cmn_TW (LC_TIME): Improve time and date formats.
    	* locales/cmn_TW (LC_MESSAGES): Add  yesstr and nostr.
    	* locales/cns11643_stroke: New file, stroke count collation for
    	traditional Chinese.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                          |    7 +
 localedata/ChangeLog               |   41 +
 localedata/locales/ar_IN           |   22 +-
 localedata/locales/as_IN           |   22 +-
 localedata/locales/bhb_IN          |    2 +-
 localedata/locales/bn_IN           |   22 +-
 localedata/locales/cmn_TW          |   44 +-
 localedata/locales/cns11643_stroke |70754 ++++++++++++++++++++++++++++++++++++
 localedata/locales/en_IN           |   22 +-
 localedata/locales/gu_IN           |   21 +-
 localedata/locales/hi_IN           |   16 +-
 localedata/locales/kn_IN           |   21 +-
 localedata/locales/kok_IN          |   22 +-
 localedata/locales/ks_IN           |   23 +-
 localedata/locales/ml_IN           |   25 +-
 localedata/locales/mr_IN           |   22 +-
 localedata/locales/or_IN           |   22 +-
 localedata/locales/pa_IN           |   18 +-
 localedata/locales/sa_IN           |   21 +-
 localedata/locales/sd_IN           |   22 +-
 localedata/locales/ta_IN           |   22 +-
 localedata/locales/tcy_IN          |    2 +-
 localedata/locales/te_IN           |   22 +-
 localedata/locales/ur_IN           |    2 +-
 stdlib/Makefile                    |    2 +-
 stdlib/tst-strfmon_l.c             |   20 +-
 26 files changed, 70868 insertions(+), 371 deletions(-)
 create mode 100644 localedata/locales/cns11643_stroke

Comment 11 Mike FABIAN 2017-08-10 13:12:25 UTC

FIXED.

Comment 12 Mike FABIAN 2017-08-10 13:21:16 UTC

(In reply to Wei-Lun Chao from comment #7)
> (In reply to Mike FABIAN from comment #6)
> > Should the new collation also be used for zh_TW, or only
> > for cmn_TW.
> > By the way, what is the difference between zh_TW 
> > and cmn_TW, isn’t both Mandarin?
> 
> As reasons for bug 15963, those 14 languages have been behind the
> macro-language "zh" for a long time. Technically zh_TW and cmn_TW are the
> same, but for fairness, IMHO, the locale zh_TW should be deprecated and
> replaced with cmn_TW and other chinese locales.
> 
> Personally I would like to differentiate cmn from zh with this radical
> patch, which may be followed by similar patches against nan_TW, hak_TW,
> lzh_TW and yue_HK.

What about the translations? On Fedora 26, most translations at the moment
are in

/usr/share/locale/zh_TW/

and very few are in /usr/share/locale/cmn/ 

I also wonder why only the "cmn" exists and not "cmn_TW" and "cmn_CN",
probably one would need to make a distinction between traditional and simplified 
here as well. As there is no cmn_CN locale, this does not matter at the
moment but it might matter in future ...

Users of zh_TW and cmn_TW would probably want the same translations, so maybe
one of these folders should be a symlink to the other?

Comment 13 Wei-Lun Chao 2017-08-11 02:10:04 UTC

(In reply to Mike FABIAN from comment #12)
> (In reply to Wei-Lun Chao from comment #7)
> > (In reply to Mike FABIAN from comment #6)
> > > Should the new collation also be used for zh_TW, or only
> > > for cmn_TW.
> > > By the way, what is the difference between zh_TW 
> > > and cmn_TW, isn’t both Mandarin?
> > 
> > As reasons for bug 15963, those 14 languages have been behind the
> > macro-language "zh" for a long time. Technically zh_TW and cmn_TW are the
> > same, but for fairness, IMHO, the locale zh_TW should be deprecated and
> > replaced with cmn_TW and other chinese locales.
> > 
> > Personally I would like to differentiate cmn from zh with this radical
> > patch, which may be followed by similar patches against nan_TW, hak_TW,
> > lzh_TW and yue_HK.
> 
> What about the translations? On Fedora 26, most translations at the moment
> are in
> 
> /usr/share/locale/zh_TW/
> 
> and very few are in /usr/share/locale/cmn/ 
> 
> I also wonder why only the "cmn" exists and not "cmn_TW" and "cmn_CN",
> probably one would need to make a distinction between traditional and
> simplified 
> here as well. As there is no cmn_CN locale, this does not matter at the
> moment but it might matter in future ...
> 
> Users of zh_TW and cmn_TW would probably want the same translations, so maybe
> one of these folders should be a symlink to the other?

Thanks for your concern :)