This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug regex/23393] Handle [a-z] and [A-Z] in consistent portable fashion regardless of locale.


https://sourceware.org/bugzilla/show_bug.cgi?id=23393

--- Comment #22 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Florian Weimer from comment #20)
> The point Rich and I are making is that there is no requirement in POSIX to
> have ranges following collation sorting.  Our current implementations do
> this, but it's not required by POSIX.  We can change the code (and not the
> data).

This is not my interpretation.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

~~~
7. In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence, inclusive.
~~~

We would not meet that rule if we used code points?

> > With the ISO 14651 update (derived from Unicode 9.0) we have this issue for
> > all languages that use ISO 14651 as the basis for their collation, and this
> > includes en_US.
> 
> We aren't proposing changes to the collation rules.

For 2.28 I'm going to propose we de-interlace a-zA-z to solve the issues we
have seen so far since we're not ready to make a decision.

This is no worse than what we had before and users that already fixed scripts
to portable use :upper: :lower: still work too.

The 15 locales that use interleaved aA-zZ will remain as-is and need not
change.

> > My opinion is that if we really want to make a change to preserve backwards
> > compatibility it should be in regex and it should be to treat a-z explicitly
> > as :lower: and A-z explicitly as :upper: and in the case of the existing 15
> > locales, they will have to adjust all of their regexp's to match
> > upper/lower-case expectations. The nominal notion of case is a far more
> > compelling argument than code-points, or equivalence classes.
> 
> This still fixes only a subset of the problematic cases.  For example, using
> [0-7] for an octal digit or [0-9a-f] for a lower-case hexadecimal digit
> would still not work, and [a-zA-Z/.] would not match base64 digits only,
> either.

These are based on an erroneous understanding of POSIX regular expressions.

Either way for 2.28 I'm suggesting we revert the lower/upper interleaving in
localedata/locales/iso14651_t1_common for now.

Thoughts?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]