This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).


On 07/20/2018 08:49 PM, Carlos O'Donell wrote:
On 07/19/2018 04:39 PM, Florian Weimer wrote:
On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.

[a-z] still matches ñ, 𝚗, but not 𝚣, which I doubt is useful.

Sorry, I don't follow, it absolutely matches ASCII z.

The z I wrote above is one of the non-BMP math characters.

We deinterlace the collation element ordering (not sequence) to get
the right range expression resolution.

See the added fnmatch tests:

+en_US.UTF-8     "a"                    "[a-z]"                0
+en_US.UTF-8     "z"                    "[a-z]"                0
+en_US.UTF-8     "A"                    "[a-z]"                NOMATCH
+en_US.UTF-8     "Z"                    "[a-z]"                NOMATCH
+en_US.UTF-8     "a"                    "[A-Z]"                NOMATCH
+en_US.UTF-8     "z"                    "[A-Z]"                NOMATCH
+en_US.UTF-8     "A"                    "[A-Z]"                0
+en_US.UTF-8     "Z"                    "[A-Z]"                0
+en_US.UTF-8     "0"                    "[0-9]"                0
+en_US.UTF-8     "9"                    "[0-9]"                0

[a-z] matches a-z (including z), *and* all the lowercase inbetween,
and so behaves like :lower: effectively.

There are characters equivalent to ASCII z (like the z above), but which sort after z, so they are not matched. This is one reason why I think this is a bad idea: it looks like [:lower:], but it's not. Same for [0-9], I assume.

It's an improvement, and it may be good enough for glibc 2.28, but I would
rather see us implement the rational ranges interpretation.

That requires all ranges behave rationally?

We could fix a-z, A-Z, and 0-9 easily.

Patch attached.

(NB: Patch is relative to the previous patch.)

My enumeration tester likes it much more. 8-)

  actual:   "abcdefghijklmnopqrstuvwxyz"
  actual:   "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  actual:   "0123456789"

That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1. However, I still get this:

tst-regex-classes.script:85:0: result character set difference in locale tr_TR.ISO-8859-9
enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz";
^
  expected: "abcdefghijklmnopqrstuvwxyz"
  actual:   "abcdefghjklmnopqrstuvwxyz"
tst-regex-classes.script:86:0: result character set difference in locale tr_TR.ISO-8859-9
enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
^
  expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  actual:   "ABCDEFGHJKLMNOPQRSTUVWXYZ"
error: 2 test failures

Can you fix this with data-only changes, too?

posix/bug-regex17 regresses as well in the test for bug 9697, but I can incorporate that into my enumeration tester. I don't think the bug is actually regressing, it's just that the test objective is not expressed properly in it.

posix/tst-rxspencer fails as well, presumably due to this:

UTF-8 aA FAIL regcomp failed: Invalid range end
UTF-8 aAcC FAIL regcomp failed: Invalid range end

I think this happens because the test blindly replaces ASCII characters with non-ASCII characters, which causes issues if they are not ordered as expected.

Thanks,
Florian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]