This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).

On 07/30/2018 01:54 PM, Florian Weimer wrote:
> On 07/30/2018 07:45 PM, Carlos O'Donell wrote:
>> On 07/30/2018 01:39 PM, Florian Weimer wrote:
>>> On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
>>>> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>>>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>>>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>>>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>>>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
>>>> This is a WIP, because the number of tests now is too big
>>>> to simply add them to tst-fnmatch.input, and so I'm writing
>>>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
>>>> expecting all of the locales to be built for testing, and
>>>> then running through all the rational ranges to test
>>>> inclusion of the required datums.
>>> Let me repeat my suggestion that we should initially fix the locales
>>> with the common collation order, where glibc 2.28 regresses.
>> I do not think it is appropriate to release rational range support on
>> only a subset of the SUPPORTED set of locales. Either we support it on
>> all SUPPORTED locales or we work until we are ready.
>> At present glibc 2.28 does not regress because of commit
>> 7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
>> uppercase.
>> In glibc 2.28 we simply have ~2500 characters in the range of a-z,
>> and in 2.27 we had ~250, it's still a large set of non-ASCII characters
>> accepted by the range, all because we caught up to Unicode 9.0.0 with
>> the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
>> with the next release, and probably always lagging a bit).
> Ahh.  So it's more complex and a regression longer in the making.

I'm worried I don't quite follow your statement of "longer in the making,"
but let me summarize what I think you wrote, and tell me if I have
it right.

The regression, from the perspective of en_US, is that [a-z] in master
accepts uppercase ASCII characters, and this breaks user expectations.

This is the only regression I'm considering serious enough to block the
release for and we've fixed it for now.

The regression which you say is "longer in the making" is that at some
point in the past the collation data for en_US contained only ASCII
ranges for a-z, A-Z, and 0-9. Then at some point in the past the ranges,
particularly those from a-z, and A-Z began accepting non-ASCII characters.

Thus the regression, from your perspective, happened far in the past.

As far as I can tell the regression has existed since the first import
for en_US which copied LC_COLLATE from en_DK (showing en_DK):
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000  967) <A> <A>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000  968) <a> <A>;<NONE>;<SMALL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'>        <Z>;<ACUTE>;<CAPITAL>;IGNORE
Is this what you mean by "longer in the making?"

I expect that en_US at some point along the way is switched to use the
iso14651_t1 data, and so gains non-interleaved a-z/A-Z CEO, but it's hard
to tell exactly if CEO was fully functional, if fnmatch worked as expected,

Either way this is all a poorly understood and structured solution at this
point, and I hope that in 1 or 2 releases we go from "unusable interface" to
"rational ranges (data)" to "full rational ranges (code point ranges)" and
end up with a sensible portable solution.

>> I don't see an urgent need to get rational range support into 2.28.
>> I was happy to get it in earlier, but now with deeper testing showing
>> that not all locales are working correctly, I'm not happy to see this
>> go out the door. I think it will be ready very shortly, and we can check
>> it in immediately into 2.29, and then continue our work on code point
>> ranges as the next step, which will require even more testing, and
>> internal API cleanup.
> Sounds reasonable.

That sounds great. I will continue to update this patch set and get some
independent checking from your scripts, and my own testing. I also need
to add collation tests for all the locales I touch to ensure that the
reordering is just that, and that it doesn't materially change the collation
sequence (if it does it's a bug). This all adds more coverage to the
SUPPORTED set of languages which is a positive thing.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]