10502 – sorting between Indic Languages should be as per unicode code point

Bug 10502 - sorting between Indic Languages should be as per unicode code point

Summary: sorting between Indic Languages should be as per unicode code point

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.38

Importance:	P2 normal
Target Milestone:	---
Assignee:	Mike FABIAN

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-08-10 04:21 UTC by Pravin S
Modified:	2024-01-05 11:43 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pravin S 2009-08-10 04:21:49 UTC

sorting in between indic languages should be happen as per unicode code point 

like one given in http://www.unicode.org/Public/UCA/latest/allkeys.txt

presently its happening not working like that, bengali script should come after
Devanagari its coming at the end.

It should be fisrt 
Devanagari, Bengali , Gurumukhi, Gujarati and so on as per unicode code point

Comment 1 Mike Frysinger 2016-04-07 17:59:32 UTC

can you post some example unicode data that shows incorrect sorting ?  that'll make it easier for us to integrate into tests to prevent future regressions.

Comment 2 Pravin S 2018-09-15 14:45:55 UTC

Hi Mike,

 This is not about incorrect sorting. But what should be order when different scripts come together. Example:

 I think order for these codepoints should be as follows: 
  u+0915, u+0995, u+0A15, u+0A95, u+0B15, u+0B95

 I dont remember any reference as of now, but when we decide sorting order between different Unicode script, what should we follow?  And IMO answer is http://www.unicode.org/Public/UCA/latest/allkeys.txt

Comment 3 Mike FABIAN 2024-01-05 11:38:31 UTC

In 2018, we updated the iso14651_t1_common to a 2016 version and then adapted the sort order of many locales. So the sort order of these Indic languages should now be in sync with the DUCET (http://www.unicode.org/Public/UCA/latest/allkeys.txt) as approximately defined in 2016. So I think the problem in the original comment is fixed.

commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Tue Jan 30 17:59:00 2018 +0100

    Update iso14651_t1_common file to ISO14651_2016_TABLE1_en.txt [BZ #14095]
    
    [BZ #14095] - Review / update collation data from Unicode / ISO 14651
    
    File downloaded from:
    http://standards.iso.org/iso-iec/14651/ed-4/ISO14651_2016_TABLE1_en.txt
    
    Updating this file alone is not enough, there are problems in the new
    file which need to be fixed and the collation rules for many locales
    need to be adapted. This is done by the following patches.
    
    This update also fixes the problem that many characters are treated as
    identical when sorting because they were not yet in the old
    iso14651_t1_common file, see:
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1336308
    - Infinite (∞) and empty set (∅) are treated as if they were the same character by sort and uniq
    
            [BZ #14095]
            * localedata/locales/iso14651_t1_common: Update file to
            latest version from ISO (ISO14651_2016_TABLE1_en.txt).

Comment 4 Mike FABIAN 2024-01-05 11:43:26 UTC

Closing as FIXED.