This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).


On 07/25/2018 09:33 PM, Jonathan Nieder wrote:
> Hi,
> 
> Carlos O'Donell wrote:
> 
>> In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
>> the collation data to harmonize with the new version of ISO 14651
>> which is derived from Unicode 9.0.0.  This collation update brought
>> with it some changes to locales which were not desirable by some
>> users, in particular it altered the meaning of the
>> locale-dependent-range regular expression, namely [a-z] and [A-Z], and
>> for en_US it caused uppercase letters to be matched by [a-z] for the
>> first time.
> 
> The Debian system where it is most convenient for me to test has
> Debian's libc6 package, version 2.24-12.  [a-z] matches uppercase
> letters.  I've always considered that undesirable but I'm confused
> about the described regression.  Did one of Debian's patches to
> localedata cause it to pick up the regression early (by which I mean,
> more than 5 years ago)?

It depends entirely on the locale you use. Some locales already have
[a-z] matching uppercase and have had it for years. The problem is that
this is new for en_US.UTF-8.

Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have
done something different with iso14651_t1_common to change this, or added
something else. I did a quick look at the debian patches for 2.24-12 and
didn't see anything that would change this materially for en_US.

>> In glibc we implement the requirement of ISO POSIX-2:1993 and use
>> collation element order (CEO) to construct the range expression, the
>> API internally is __collseq_table_lookup().  The fact that we use CEO
>> and also have 4-level weights on each collation rule means that we can
>> in practice reorder the collation rules in iso14651_t1_common (the new
>> data) to provide consistent range expression resolution *and* the
>> weights should maintain the expected total order.
> [...]
>> * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
>>   strcoll* and strxfrm* and ensures the ISO 14651 collation remains.
> 
> Cool!  Checking my understanding: does this mean that if I have files
> 
> 	lll
> 	MMM
> 	nnn
> 
> that with this patch,
> 
> 	echo [a-z]*
> 
> would no longer match MMM, and

Correct.

> 
> 	ls | sort
> 
> would continue to sort in the order lll < MMM < nnn?

Yes.

> 
> I wish we had done it 10 years ago. ;-)  Thanks for getting it done.

The rational ranges follow code point order.

The sorting follows collation sequence.

I think this was never an issue because most locales following ISO 14651
were using an old data set which never exhibited this issue. However, thanks
to Mike Fabian's hard work (and no good deed goes unpunished) we have updated
collation all the way to Unicode 9.0.0-era and so encountered this problem.

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]