This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug regex/23393] Handle [a-z] and [A-Z] in consistent portable fashion regardless of locale.

From: "carlos at redhat dot com" <sourceware-bugzilla at sourceware dot org>
To: glibc-bugs at sourceware dot org
Date: Thu, 19 Jul 2018 13:59:18 +0000
Subject: [Bug regex/23393] Handle [a-z] and [A-Z] in consistent portable fashion regardless of locale.
Auto-submitted: auto-generated
References: <bug-23393-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=23393

--- Comment #22 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Florian Weimer from comment #20)
> The point Rich and I are making is that there is no requirement in POSIX to
> have ranges following collation sorting.  Our current implementations do
> this, but it's not required by POSIX.  We can change the code (and not the
> data).

This is not my interpretation.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

~~~
7. In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence, inclusive.
~~~

We would not meet that rule if we used code points?

> > With the ISO 14651 update (derived from Unicode 9.0) we have this issue for
> > all languages that use ISO 14651 as the basis for their collation, and this
> > includes en_US.
> 
> We aren't proposing changes to the collation rules.

For 2.28 I'm going to propose we de-interlace a-zA-z to solve the issues we
have seen so far since we're not ready to make a decision.

This is no worse than what we had before and users that already fixed scripts
to portable use :upper: :lower: still work too.

The 15 locales that use interleaved aA-zZ will remain as-is and need not
change.

> > My opinion is that if we really want to make a change to preserve backwards
> > compatibility it should be in regex and it should be to treat a-z explicitly
> > as :lower: and A-z explicitly as :upper: and in the case of the existing 15
> > locales, they will have to adjust all of their regexp's to match
> > upper/lower-case expectations. The nominal notion of case is a far more
> > compelling argument than code-points, or equivalence classes.
> 
> This still fixes only a subset of the problematic cases.  For example, using
> [0-7] for an octal digit or [0-9a-f] for a lower-case hexadecimal digit
> would still not work, and [a-zA-Z/.] would not match base64 digits only,
> either.

These are based on an erroneous understanding of POSIX regular expressions.

Either way for 2.28 I'm suggesting we revert the lower/upper interleaving in
localedata/locales/iso14651_t1_common for now.

Thoughts?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]