This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug regex/23393] Handle [a-z] and [A-Z] in consistent portable fashion regardless of locale.
- From: "carlos at redhat dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Thu, 19 Jul 2018 13:59:18 +0000
- Subject: [Bug regex/23393] Handle [a-z] and [A-Z] in consistent portable fashion regardless of locale.
- Auto-submitted: auto-generated
- References: <bug-23393-131@http.sourceware.org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=23393
--- Comment #22 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Florian Weimer from comment #20)
> The point Rich and I are making is that there is no requirement in POSIX to
> have ranges following collation sorting. Our current implementations do
> this, but it's not required by POSIX. We can change the code (and not the
> data).
This is not my interpretation.
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
~~~
7. In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence, inclusive.
~~~
We would not meet that rule if we used code points?
> > With the ISO 14651 update (derived from Unicode 9.0) we have this issue for
> > all languages that use ISO 14651 as the basis for their collation, and this
> > includes en_US.
>
> We aren't proposing changes to the collation rules.
For 2.28 I'm going to propose we de-interlace a-zA-z to solve the issues we
have seen so far since we're not ready to make a decision.
This is no worse than what we had before and users that already fixed scripts
to portable use :upper: :lower: still work too.
The 15 locales that use interleaved aA-zZ will remain as-is and need not
change.
> > My opinion is that if we really want to make a change to preserve backwards
> > compatibility it should be in regex and it should be to treat a-z explicitly
> > as :lower: and A-z explicitly as :upper: and in the case of the existing 15
> > locales, they will have to adjust all of their regexp's to match
> > upper/lower-case expectations. The nominal notion of case is a far more
> > compelling argument than code-points, or equivalence classes.
>
> This still fixes only a subset of the problematic cases. For example, using
> [0-7] for an octal digit or [0-9a-f] for a lower-case hexadecimal digit
> would still not work, and [a-zA-Z/.] would not match base64 digits only,
> either.
These are based on an erroneous understanding of POSIX regular expressions.
Either way for 2.28 I'm suggesting we revert the lower/upper interleaving in
localedata/locales/iso14651_t1_common for now.
Thoughts?
--
You are receiving this mail because:
You are on the CC list for the bug.