Bug 10501 - bn_IN collation does not have canonical equivalence definitions
Summary: bn_IN collation does not have canonical equivalence definitions
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.10
: P2 normal
Target Milestone: ---
Assignee: GNU C Library Locale Maintainers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-08-09 05:12 UTC by Santhosh Thottingal
Modified: 2014-07-01 07:21 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Santhosh Thottingal 2009-08-09 05:12:46 UTC
Th bn_IN collation definitions does not have canonical equivalence definitions
for the canonical decomposition of the following letters :
U+09CB BENGALI VOWEL SIGN O
U+09CC BENGALI VOWEL SIGN AU
U+09DC BENGALI LETTER RRA
U+09DD BENGALI LETTER RHA
U+09DF BENGALI LETTER YYA

How to reproduce :
Sort the following sequences in LANG =bn_IN.UTF-8 
কো  written using canonical decomposition of U+09CB BENGALI VOWEL SIGN O
কৈ
কো 
The expected sorting order is 
কৈ
কো
কো
But the actual result is 
কো
কৈ
কো
Comment 1 Pravin S 2009-08-10 04:16:30 UTC
(In reply to comment #0)
> Th bn_IN collation definitions does not have canonical equivalence definitions
> for the canonical decomposition of the following letters :
> U+09CB BENGALI VOWEL SIGN O
> U+09CC BENGALI VOWEL SIGN AU

these combination never occur in real world typing data, so no need to handle
these thing.

even if somebody mistakenly typing same we suppose to tell them this in
incorrect and can create spoofing (and note these are not normalized sequences)
and thats why rendering engine throwing dotted circle for these combination
please check qt, icu there is bug with pango, if possible see uniscribe as well  
 
> U+09DC BENGALI LETTER RRA
> U+09DD BENGALI LETTER RHA
> U+09DF BENGALI LETTER YYA

this is already handled.

so in IMO we should close this bug as Not a Bug
Comment 2 Santhosh Thottingal 2009-08-17 11:40:26 UTC
Refer the collation rules of UCA -
http://www.unicode.org/Public/UCA/latest/allkeys.txt
[...]
09CB  ; [.1B48.0020.0002.09CB] # BENGALI VOWEL SIGN O
09C7 09BE ; [.1B48.0020.0002.09CB] # BENGALI VOWEL SIGN O
09CC  ; [.1B49.0020.0002.09CC] # BENGALI VOWEL SIGN AU
09C7 09D7 ; [.1B49.0020.0002.09CC] # BENGALI VOWEL SIGN AU
[...]

It is implemented in UCA and should be available in glibc localedata too. ie,
Collation weights of canonically equivalent sequences should be explicitly
defined in glibc and there should not be any assumption on the input to the
collation.
Comment 3 Sayamindu Dasgupta 2009-08-17 12:18:36 UTC
(In reply to comment #2)
> Refer the collation rules of UCA -
> http://www.unicode.org/Public/UCA/latest/allkeys.txt
> [...]
> 09CB  ; [.1B48.0020.0002.09CB] # BENGALI VOWEL SIGN O
> 09C7 09BE ; [.1B48.0020.0002.09CB] # BENGALI VOWEL SIGN O
> 09CC  ; [.1B49.0020.0002.09CC] # BENGALI VOWEL SIGN AU
> 09C7 09D7 ; [.1B49.0020.0002.09CC] # BENGALI VOWEL SIGN AU
> [...]
> 
> It is implemented in UCA and should be available in glibc localedata too. ie,
> Collation weights of canonically equivalent sequences should be explicitly
> defined in glibc and there should not be any assumption on the input to the
> collation.
> 

I would tend to second Santhosh here, since we do not know where the data might
be coming from (eg: someone might try to assume a shortcut while implementing a
legacy encoding -> unicode converter, etc)