Bug 18978 - The collation symbol “UNDEFINED” does not work as specified in the standard
Summary: The collation symbol “UNDEFINED” does not work as specified in the standard
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.22
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-09-17 11:56 UTC by Mike FABIAN
Modified: 2021-11-02 02:55 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mike FABIAN 2015-09-17 11:56:07 UTC
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

says: 

opengroup> Collation Order
opengroup> 
opengroup> [...]
opengroup> 
opengroup> The symbol UNDEFINED shall be interpreted as including all
opengroup> coded character set values not specified explicitly or via
opengroup> the ellipsis symbol. Such characters shall be inserted in
opengroup> the character collation order at the point indicated by the
opengroup> symbol, and in ascending order according to their coded
opengroup> character set values. If no UNDEFINED symbol is specified,
opengroup> and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a
opengroup> warning message and place such characters at the end of the
opengroup> character collation order.

Unfortunatly it does not work like that in glibc.

For example:

The Japanese locale source file /usr/share/i18n/locales/ja_JP
has this in the LC_COLLATE section:

    mfabian@ari:/usr/share/i18n/locales
    $ grep -A 8 ^LC_COLLATE ja_JP
    LC_COLLATE
    order_start forward
    %
    % C0
    %
    <U0000>
    <U0001>
    <U0002>
    <U0003>
    mfabian@ari:/usr/share/i18n/locales
    $ grep -B 8 '^END LC_COLLATE' ja_JP
    <U9F97>
    <U9F9E>
    <U9FA1>
    <U9FA2>
    <U9FA3>
    <U9FA5>
    UNDEFINED
    order_end
    END LC_COLLATE
    mfabian@ari:/usr/share/i18n/locales
    $

I.e. it includes the “UNDEFINED” collation symbol at the end.

Now if I choose a character which is *not* specified in
the LC_COLLATE section, neither explicitly nor via the ellipsis
for example:

    ⅞ U+215E VULGAR FRACTION SEVEN EIGHTHS

and check how it sorts, I find:

mfabian@ari:~/testdir
$ LANG=ja_JP.UTF-8 ls
⅞ A  B  C  D  O  U  Z  a  b  c  d  o  u  z  Þ  æ  đ  ı  ß  İ  ä  ö  ü
mfabian@ari:~/testdir
$

I.e. it sorts at the beginning, not at the end (the other non-ASCII
characters in that sort example *are* explicitly specified
in the sort order,  that’s why they appear after “z” which is how
it is specified).

To test this further, I created my own variant of

/usr/share/i18n/locales/POSIX

by removing the

LC_COLLATE
# This is the POSIX Locale definition for the LC_COLLATE category.
# The order is the same as in the ASCII code set.
order_start forward
<U0000>
<U0001>

normal stuff here

modified part follows:

<U0040>         <- @
<U0044>         <- D (moved here make sure I am really using my modified locale)
<U0041>         <- A
<U0043>         <- C 
UNDEFINED       <- B is *not* specified any more! Therefore it should go here!
<U0045>         <- E
<U0046>         <- F

more normal stuff here

<U007E>
<U007F>
order_end
#
END LC_COLLATE

And when testing this (I installed this modified POSIX locale
using localedef under the name "POSIXMIKE"):

mfabian@ari:~/testdir
$ LANG=POSIXMIKE ls
B  ??  ??  ??  ??  ??  ??  ??  ??  ??  ???  D  A  C  O  U  Z  a  b  c  d  o  u  z
mfabian@ari:~/testdir
$

So the now unspecified “B” is sorted at the beginning and *not*
after “C” where the “UNDEFINED” collation symbol is.