[Bug localedata/23421] Strange collation rules for A and space with UTF-8 locale when other characters appended

carlos at redhat dot com sourceware-bugzilla@sourceware.org
Tue Jul 17 15:21:00 GMT 2018


https://sourceware.org/bugzilla/show_bug.cgi?id=23421

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
                 CC|                            |carlos at redhat dot com
         Resolution|---                         |INVALID

--- Comment #2 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Benjamin Cama from comment #1)
> Created attachment 11137 [details]
> Test case for collation between letter and space with later letter appended
> 
> OK, this is not actually about letter “A” only: I reproduce it with any
> appended letter which is *after* the one we test against space; see attached
> test case. The different ordering does not happen with a letter *prior* to
> it appended (replace the “D” with “A” in my example).

This is expected.

In en_US.UTF-8 the space (as are many special symbols) is ignored for
collation.

Therefore "A" < "B" < " B" < "  B" < "    B" etc.

Notes:
- On master for localedata/locales/iso14651_t1_common we have:
54827 order_start <SPECIAL>;forward;backward;forward;forward,position
55297 <U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
64325 order_start <LATIN>;forward;backward;forward;forward,position
64347 <U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A
- This follows the POSIX locale specification:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html
"
~~~
The special keyword IGNORE as a weight shall indicate that when strings are
compared using the weights at the level where IGNORE is specified, the
collating element shall be ignored; that is, as if the string did not contain
the collating element. In regular expressions and pattern matching, all
characters that are subject to IGNORE in their primary weight form an
equivalence class.
~~~
- Sorting SPECIAL and LATIN have the same rules, and when you compare "A" to "
" it is ignored until the 4th weight where it's compared by Unicode code point,
and so results in "A" > " " (which is true.
- Sorting then "A" to " B" skips " " (because IGNORE) and compares "A" to "B"
which results in "A" < " B".

If you want full code point sorting you need to use C.UTF-8 which some
distributions provide, and which is still not available in upstream glibc
(though I'm working on it slowly).

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the Libc-locales mailing list