This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Collation of INFINITY vs. EMPTY SET?
- From: Mike FABIAN <mfabian at redhat dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>
- Cc: Pravin Satpute <psatpute at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, Mike Frysinger <vapier at gentoo dot org>
- Date: Tue, 17 May 2016 07:33:16 +0200
- Subject: Re: Collation of INFINITY vs. EMPTY SET?
- Authentication-results: sourceware.org; auth=none
- References: <573AA6FF dot 3020604 at redhat dot com>
"Carlos O'Donell" <carlos@redhat.com> wrote:
> In en_US we use localedata/locales/iso14651_t1_common
> for collation.
>
> A recent Fedora bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1336308
>
> Shows we don't have code-point-based collation for
> elements which are not defined in the collation source
> files (the locale files themselves and the files
> they include to define their LC_COLLATE).
I reported a similar bug a while ago:
https://sourceware.org/bugzilla/show_bug.cgi?id=18978
> In localedata/locales/iso14651_t1_common we have the
> following comment:
>
> 4827 # Any character not precisely specified will be considered as a special
> 4828 # character and considered only at the last level.
> 4829 # <U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>
>
> ... and then:
>
> 5001 # The comment at the beginning of this section mentions characters which
> 5002 # are not otherwise covered. But this description cannot express this.
> 5003 # Therefore we add here a few entries which are used in older
> implementations
> 5004 # to be compatible. --drepper
>
> I always thought we would fall back to code point
> order (former comment implies), but Drepper's comment
> makes it seem like that's not true? The Fedora bug
> also makes it seem like that's not true.
>
> Why might we not want code-point-based sorting for
> entries not defined?
>
> Is the solution to write automation to create iso14651_t1_common
> and list all the unspecified elements in code point order?
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html
opengroup> The symbol UNDEFINED shall be interpreted as including all
opengroup> coded character set values not specified explicitly or via
opengroup> the ellipsis symbol. Such characters shall be inserted in
opengroup> the character collation order at the point indicated by the
opengroup> symbol, and in ascending order according to their coded
opengroup> character set values. If no UNDEFINED symbol is specified,
opengroup> and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a warning
opengroup> message and place such characters at the end of the
opengroup> character collation order.
If this UNDEFINED symbol worked as specified, we could easily use code
point order as a fallback for entries not defined in the collation
order by inserting the UNDEFINED symbol in the LC_COLLATE definition
of the locale sources at an appropriate place.
Unfortunately UNDEFINED does not work as specified.
Some locale sources use it but it does not work.
--
Mike FABIAN <mfabian@redhat.com>