This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Collation of INFINITY vs. EMPTY SET?


"Carlos O'Donell" <carlos@redhat.com> wrote:

> In en_US we use localedata/locales/iso14651_t1_common
> for collation.
>
> A recent Fedora bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1336308
>
> Shows we don't have code-point-based collation for
> elements which are not defined in the collation source
> files (the locale files themselves and the files
> they include to define their LC_COLLATE).

I reported a similar bug a while ago:

https://sourceware.org/bugzilla/show_bug.cgi?id=18978

> In localedata/locales/iso14651_t1_common we have the
> following comment:
>
> 4827 # Any character not precisely specified will be considered as a special
> 4828 # character and considered only at the last level.
> 4829 # <U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>
>
> ... and then:
>
> 5001 # The comment at the beginning of this section mentions characters which
> 5002 # are not otherwise covered.  But this description cannot express this.
> 5003 # Therefore we add here a few entries which are used in older
> implementations
> 5004 # to be compatible.  --drepper
>
> I always thought we would fall back to code point
> order (former comment implies), but Drepper's comment
> makes it seem like that's not true? The Fedora bug
> also makes it seem like that's not true.
>
> Why might we not want code-point-based sorting for
> entries not defined?
>
> Is the solution to write automation to create iso14651_t1_common
> and list all the unspecified elements in code point order?

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html

opengroup> The symbol UNDEFINED shall be interpreted as including all
opengroup> coded character set values not specified explicitly or via
opengroup> the ellipsis symbol. Such characters shall be inserted in
opengroup> the character collation order at the point indicated by the
opengroup> symbol, and in ascending order according to their coded
opengroup> character set values. If no UNDEFINED symbol is specified,
opengroup> and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a warning
opengroup> message and place such characters at the end of the
opengroup> character collation order.

If this UNDEFINED symbol worked as specified, we could easily use code
point order as a fallback for entries not defined in the collation
order by inserting the UNDEFINED symbol in the LC_COLLATE definition
of the locale sources at an appropriate place.

Unfortunately UNDEFINED does not work as specified.

Some locale sources use it but it does not work.

-- 
Mike FABIAN <mfabian@redhat.com>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]