Bug 22515 - hsb_DE LC_COLLATE does not use copy "iso14651_t1"
Summary: hsb_DE LC_COLLATE does not use copy "iso14651_t1"
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.26
: P2 normal
Target Milestone: 2.27
Assignee: Mike FABIAN
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-29 12:28 UTC by Mike FABIAN
Modified: 2017-12-06 13:48 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mike FABIAN 2017-11-29 12:28:25 UTC
LC_COLLATE in localedata/locales/hsb_DE does not build upon

copy "iso14651_t1"

missing all updates from there.
Comment 1 Mike FABIAN 2017-11-29 12:33:10 UTC
https://unicode.org/cldr/trac/browser/trunk/common/collation/hsb.xml

contains:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ldml SYSTEM "../../common/dtd/ldml.dtd">
<!--
Copyright © 2014 Unicode, Inc.
CLDR data files are interpreted according to the LDML specification (http://unicode.org/reports/tr35/)
For terms of use, see http://www.unicode.org/copyright.html
-->
<ldml>
  <identity>
    <version number="$Revision: 11914 $" />
    <language type="hsb" />
  </identity>
  <collations>
    <collation type="standard" references="Prawopisny słownik hornjoserbskeje rěče, Pawoł Völkel,
                                           wobdźěłał Timo Meškank, 1970/2005, ISBN 3-7420-1920-1 ">
      <cr><![CDATA[
      &C<č<<<Č<ć<<<Ć
      &E<ě<<<Ě
      &H<ch<<<cH<<<Ch<<<CH
      &[before 1] L<ł<<<Ł
      &R<ř<<<Ř
      &S<š<<<Š
      &Z<ž<<<Ž<ź<<<Ź
      ]]></cr>
    </collation>
  </collations>
</ldml>

In glibc, in localedata/locales/hsb_DE, LC_COLLATE contains:

collating-element <D-Z'> from "<U0044><U0179>"
collating-element <D-z'> from "<U0044><U017A>"
collating-element <d-Z'> from "<U0064><U0179>"
collating-element <d-z'> from "<U0064><U017A>"
[...]
<d8>
<D-Z'>	<D-Z'>;<NONE>;<CAPITAL>;IGNORE
<D-z'>	<D-Z'>;<NONE>;<CAPITAL-SMALL>;IGNORE
<d-Z'>	<D-Z'>;<NONE>;<SMALL-CAPITAL>;IGNORE
<d-z'>	<D-Z'>;<NONE>;<SMALL>;IGNORE
[...]

I.e. it contains special rules to sort dź which CLDR has not.
Comment 2 Mike FABIAN 2017-12-06 08:49:59 UTC
The current hsb_DE locale sorts ć and Ć after t:

<t8>
<U0106>	<U0106>;<NONE>;<CAPITAL>;IGNORE
<U0107>	<U0106>;<NONE>;<SMALL>;IGNORE

I.e. it sorts like this:

   S
   š
   Š
   ć
   Ć
   Z

This seems wrong.
Comment 3 Mike FABIAN 2017-12-06 08:52:27 UTC
The current hsb_DE sorting also contradicts the CLDR sort order in
sorting like  this:

   Z
   ź
   Ź
   ž
   Ž

i.e. sorting ž after ź. In  CLDR it is the other way round:

&Z<ž<<<Ž<ź<<<Ź
Comment 4 Mike FABIAN 2017-12-06 08:59:24 UTC
There is a little bit of a contradiction in the CLDR data
for collation.

https://unicode.org/cldr/trac/browser/trunk/common/collation/hsb.xml

contains:

    &C<č<<<Č<ć<<<Ć
    &E<ě<<<Ě
    &H<ch<<<cH<<<Ch<<<CH
    &[before 1] L<ł<<<Ł
    &R<ř<<<Ř
    &S<š<<<Š
    &Z<ž<<<Ž<ź<<<Ź

but

https://unicode.org/cldr/trac/browser/trunk/common/main/hsb.xml

contains:

<exemplarCharacters type="index">[A B C Č Ć D {DŹ} E F G H {CH} I J K Ł L M N O P Q R S Š T U V W X Y Z Ž]</exemplarCharacters>

I.e. in the index, DŹ is considered as a special character whereas in
the sorting rules it is not.

Also, Ź is special in the sorting rules but not in the index.
Comment 5 Sourceware Commits 2017-12-06 11:33:33 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  62ea2193ee4b538b13da1c579113761e0b92376c (commit)
      from  37ac8e635a29810318f6d79902102e2e96b2b5bf (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=62ea2193ee4b538b13da1c579113761e0b92376c

commit 62ea2193ee4b538b13da1c579113761e0b92376c
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Wed Dec 6 10:02:48 2017 +0100

    hsb_DE locale: Base collation on copy "iso14651_t1" [BZ #22515]
    
    	[BZ #22515]
    	* localedata/Makefile: Add hsb_DE.UTF-8 to test-input
    	and to the list of locales to be built for testing.
    	* localedata/hsb_DE.UTF-8.in: New file for testing the collation.
    	* localedata/locales/hsb_DE (LC_COLLATE): Use “copy "iso14651_t1"”
    	and build the collation rules upon that.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                  |    9 +
 localedata/Makefile        |    5 +-
 localedata/hsb_DE.UTF-8.in |   35 +
 localedata/locales/hsb_DE  | 2159 ++------------------------------------------
 4 files changed, 133 insertions(+), 2075 deletions(-)
 create mode 100644 localedata/hsb_DE.UTF-8.in
Comment 6 Mike FABIAN 2017-12-06 11:38:45 UTC
Fixed in glibc master.
Comment 7 Mike FABIAN 2017-12-06 13:48:44 UTC
(In reply to Mike FABIAN from comment #4)
> There is a little bit of a contradiction in the CLDR data
> for collation.
> 
> https://unicode.org/cldr/trac/browser/trunk/common/collation/hsb.xml
> 
> contains:
> 
>     &C<č<<<Č<ć<<<Ć
>     &E<ě<<<Ě
>     &H<ch<<<cH<<<Ch<<<CH
>     &[before 1] L<ł<<<Ł
>     &R<ř<<<Ř
>     &S<š<<<Š
>     &Z<ž<<<Ž<ź<<<Ź
> 
> but
> 
> https://unicode.org/cldr/trac/browser/trunk/common/main/hsb.xml
> 
> contains:
> 
> <exemplarCharacters type="index">[A B C Č Ć D {DŹ} E F G H {CH} I J K Ł L M
> N O P Q R S Š T U V W X Y Z Ž]</exemplarCharacters>
> 
> I.e. in the index, DŹ is considered as a special character whereas in
> the sorting rules it is not.
> 
> Also, Ź is special in the sorting rules but not in the index.

I reported this to CLDR:

https://unicode.org/cldr/trac/ticket/10797