This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
On 07/20/2018 08:49 PM, Carlos O'Donell wrote:
On 07/19/2018 04:39 PM, Florian Weimer wrote:
On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.
[a-z] still matches ñ, 𝚗, but not 𝚣, which I doubt is useful.
Sorry, I don't follow, it absolutely matches ASCII z.
The z I wrote above is one of the non-BMP math characters.
We deinterlace the collation element ordering (not sequence) to get
the right range expression resolution.
See the added fnmatch tests:
+en_US.UTF-8 "a" "[a-z]" 0
+en_US.UTF-8 "z" "[a-z]" 0
+en_US.UTF-8 "A" "[a-z]" NOMATCH
+en_US.UTF-8 "Z" "[a-z]" NOMATCH
+en_US.UTF-8 "a" "[A-Z]" NOMATCH
+en_US.UTF-8 "z" "[A-Z]" NOMATCH
+en_US.UTF-8 "A" "[A-Z]" 0
+en_US.UTF-8 "Z" "[A-Z]" 0
+en_US.UTF-8 "0" "[0-9]" 0
+en_US.UTF-8 "9" "[0-9]" 0
[a-z] matches a-z (including z), *and* all the lowercase inbetween,
and so behaves like :lower: effectively.
There are characters equivalent to ASCII z (like the z above), but which
sort after z, so they are not matched. This is one reason why I think
this is a bad idea: it looks like [:lower:], but it's not. Same for
[0-9], I assume.
It's an improvement, and it may be good enough for glibc 2.28, but I would
rather see us implement the rational ranges interpretation.
That requires all ranges behave rationally?
We could fix a-z, A-Z, and 0-9 easily.
Patch attached.
(NB: Patch is relative to the previous patch.)
My enumeration tester likes it much more. 8-)
actual: "abcdefghijklmnopqrstuvwxyz"
actual: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
actual: "0123456789"
That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1.
However, I still get this:
tst-regex-classes.script:85:0: result character set difference in locale
tr_TR.ISO-8859-9
enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz";
^
expected: "abcdefghijklmnopqrstuvwxyz"
actual: "abcdefghjklmnopqrstuvwxyz"
tst-regex-classes.script:86:0: result character set difference in locale
tr_TR.ISO-8859-9
enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
^
expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
actual: "ABCDEFGHJKLMNOPQRSTUVWXYZ"
error: 2 test failures
Can you fix this with data-only changes, too?
posix/bug-regex17 regresses as well in the test for bug 9697, but I can
incorporate that into my enumeration tester. I don't think the bug is
actually regressing, it's just that the test objective is not expressed
properly in it.
posix/tst-rxspencer fails as well, presumably due to this:
UTF-8 aA FAIL regcomp failed: Invalid range end
UTF-8 aAcC FAIL regcomp failed: Invalid range end
I think this happens because the test blindly replaces ASCII characters
with non-ASCII characters, which causes issues if they are not ordered
as expected.
Thanks,
Florian