This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).


On 07/19/2018 03:43 PM, Carlos O'Donell wrote:
> In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
> the collation data to harmonize with the new version of ISO 14651
> which is derived from Unicode 9.0.0.  This collation update brought
> with it some changes to locales which were not desirable by some
> users, in particular it altered the meaning of the
> locale-dependent-range regular expression, namely [a-z] and [A-Z], and
> for en_US it caused uppercase letters to be matched by [a-z] for the
> first time.  The matching of uppercase letters by [a-z] is something
> which is already known to users of other locales which have this
> property, but this change could cause significant problems to en_US
> and other similar locales that had never had this change before.
> Whether this behaviour is desirable or not is contentious and GNU Awk
> has this to say on the topic:
> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
> While the POSIX standard also has this further to say: "RE Bracket
> Expression":
> http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
> "The current standard leaves unspecified the behavior of a range
> expression outside the POSIX locale. ... As noted above, efforts were
> made to resolve the differences, but no solution has been found that
> would be specific enough to allow for portable software while not
> invalidating existing implementations."
> In glibc we implement the requirement of ISO POSIX-2:1993 and use
> collation element order (CEO) to construct the range expression, the
> API internally is __collseq_table_lookup().  The fact that we use CEO
> and also have 4-level weights on each collation rule means that we can
> in practice reorder the collation rules in iso14651_t1_common (the new
> data) to provide consistent range expression resolution *and* the
> weights should maintain the expected total order.  Therefore this
> patch does three things:
> 
> * Reorder the collation rules for the LATIN script in
>   iso14651_t1_common to deinterlace uppercase and lowercase letters in
>   the collation element orders.
> 
> * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
>   strcoll* and strxfrm* and ensures the ISO 14651 collation remains.
> 
> * Add back tests to tst-fnmatch.input and tst-regexloc.c which
>   exercise that [a-z] does not match A or Z.
> 
> The reordering of the ISO 14651 data is done in an entirely mechanical
> fashion using the following program attached to the bug:
> https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28
> 
> It is up for discussion if the iso14651_t1_common data should be
> refined further to have 3 very tight collation element ranges that
> include only a-z, A-Z, and 0-9, which would implement the solution
> sought after in:
> https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12
> 
> No regressions on x86_64.
> Verified that removal of the iso14651_t1_common change causes tst-fnmatch
> to regress with:
> 422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
> ...
> 425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
> ---
>  ChangeLog                             |   11 +
>  localedata/Makefile                   |    1 +
>  localedata/en_US.UTF-8.in             | 2159 +++++++++++++++++++++++++++++++++
>  localedata/locales/iso14651_t1_common | 1928 ++++++++++++++---------------
>  posix/tst-fnmatch.input               |  125 +-
>  posix/tst-regexloc.c                  |    8 +-
>  6 files changed, 3224 insertions(+), 1008 deletions(-)
>  create mode 100644 localedata/en_US.UTF-8.in
> 
> I'm suggesting this change immediately for 2.28 to avoid further
> problems with users expectations and sorting with [a-z] and [A-Z] until
> a clearer consensus can be reached for a final solution.
> 
> File attached as .tar.gz to get past spam detectors. There is a lot
> of UTF-8 data in en_US.UTF-8 (every possible character in the LATIN
> set that can be sorted with the existing test case infrastructure).
> 

I have committed only the most conservative fix for this issue, which is
to deinterlace the lower and upper case ranges.

I think we are too late to commit rational ranges, and we can do that in
2.29 when it opens. Right now I want to remove the blocker that is causing
regressions for en_US.UTF-8 scripts that use [a-z], and [A-Z].

We have consensus that this is the right direction to take a solution,
and if anyone objects, please speak up before I cut the branch on August 1st
(if we can still achieve that and get good machine coverage).

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]