This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).

From: Florian Weimer <fweimer at redhat dot com>
To: Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, Rich Felker <dalias at aerifal dot cx>, Mike Fabian <mfabian at redhat dot com>, Zorro Lang <zlang at redhat dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>
Date: Fri, 20 Jul 2018 21:19:28 +0200
Subject: Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
References: <9d6f47ec-f9eb-ead0-889c-3b9aae66551c@redhat.com> <d4edb4e5-0750-4dad-eb5a-d5d9fd4d3a53@redhat.com> <f905879a-fd42-331e-eac1-46ed54d06d9e@redhat.com>

On 07/20/2018 08:49 PM, Carlos O'Donell wrote:

On 07/19/2018 04:39 PM, Florian Weimer wrote:

On 07/19/2018 09:43 PM, Carlos O'Donell wrote:

* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.


[a-z] still matches ñ, 𝚗, but not 𝚣, which I doubt is useful.


Sorry, I don't follow, it absolutely matches ASCII z.


The z I wrote above is one of the non-BMP math characters.

We deinterlace the collation element ordering (not sequence) to get
the right range expression resolution.

See the added fnmatch tests:

+en_US.UTF-8     "a"                    "[a-z]"                0
+en_US.UTF-8     "z"                    "[a-z]"                0
+en_US.UTF-8     "A"                    "[a-z]"                NOMATCH
+en_US.UTF-8     "Z"                    "[a-z]"                NOMATCH
+en_US.UTF-8     "a"                    "[A-Z]"                NOMATCH
+en_US.UTF-8     "z"                    "[A-Z]"                NOMATCH
+en_US.UTF-8     "A"                    "[A-Z]"                0
+en_US.UTF-8     "Z"                    "[A-Z]"                0
+en_US.UTF-8     "0"                    "[0-9]"                0
+en_US.UTF-8     "9"                    "[0-9]"                0

[a-z] matches a-z (including z), *and* all the lowercase inbetween,
and so behaves like :lower: effectively.

There are characters equivalent to ASCII z (like the z above), but whichsort after z, so they are not matched. This is one reason why I thinkthis is a bad idea: it looks like [:lower:], but it's not. Same for[0-9], I assume.

It's an improvement, and it may be good enough for glibc 2.28, but I would
rather see us implement the rational ranges interpretation.


That requires all ranges behave rationally?

We could fix a-z, A-Z, and 0-9 easily.

Patch attached.


(NB: Patch is relative to the previous patch.)

My enumeration tester likes it much more. 8-)

  actual:   "abcdefghijklmnopqrstuvwxyz"
  actual:   "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  actual:   "0123456789"

That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1.However, I still get this:

tst-regex-classes.script:85:0: result character set difference in localetr_TR.ISO-8859-9

enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz";
^
  expected: "abcdefghijklmnopqrstuvwxyz"
  actual:   "abcdefghjklmnopqrstuvwxyz"

tst-regex-classes.script:86:0: result character set difference in localetr_TR.ISO-8859-9

enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
^
  expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  actual:   "ABCDEFGHJKLMNOPQRSTUVWXYZ"
error: 2 test failures

Can you fix this with data-only changes, too?

posix/bug-regex17 regresses as well in the test for bug 9697, but I canincorporate that into my enumeration tester. I don't think the bug isactually regressing, it's just that the test objective is not expressedproperly in it.


posix/tst-rxspencer fails as well, presumably due to this:

UTF-8 aA FAIL regcomp failed: Invalid range end
UTF-8 aAcC FAIL regcomp failed: Invalid range end

I think this happens because the test blindly replaces ASCII characterswith non-ASCII characters, which causes issues if they are not orderedas expected.


Thanks,
Florian

Follow-Ups:
- Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  - From: Carlos O'Donell

References:
- [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  - From: Carlos O'Donell
- Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  - From: Florian Weimer
- Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]