Symptom: GNU Grep does not handle Syriac characters (U+0700 – U+074F) correctly
$ echo 'ܫܠܡܐ' > peace
$ egrep '\<[ܐ-ܬ]' peace
grep: Invalid collation character
$ awk /'\<[ܐ-ܬ]'/ peace
However when grep is build with ./configure --with-included-regex
it works just fine and there is no REG_ECOLLATE error
$ echo ܫܠܡܐ | src/egrep [ܫ-ܬ]
$ echo ܫܠܡܐ | src/egrep [ܒ-ܓ]
This is because GNU Grep contains improved version of regcomp.
The bus was found here: http://forum.rosalab.ru/viewtopic.php?f=53&t=6219&p=54747 (in Russian)
It is tested and confirmed also on Gentoo (both glibc and grep are 2.22).
I expect there are other bugs that could be fixed with this upgrade.
This seems like a bug in the locale definitions (similar to the infamous "[A-Z] matches some lowercase characters one), not in regex.
What is your locale?
My locale is ru_RU.UTF-8
Yes, I have got the same idea at the beginning.
with LC_CTYPE=en_GB there was no error
but with LC_ALL=en_US.UTF-8 the bug appeared.
Next step I found there are two files regcomp.c in both Glibc and Grep.
I have compared them with Diff. They are very similar, but not exactly.
The one from Grep is obviously newer. But for some reason grep links with glibc by deafult.
./configure --with-included-regex enforces linking with newer buildin version.
Then it works flawlessly with the same locale.
I am not a native English speaker and hope my explanation are clear enough.
It works flawlessly because it bypasses the localedata. That's why I moved the bug to localedata. :)
(In reply to Paolo Bonzini from comment #3)
> It works flawlessly because it bypasses the localedata. That's why I moved
> the bug to localedata. :)
Well the localedata is updated as much as possible and we're on Unicode 8.0.0 right now for UTF-8 charsets.
How might we determine exactly what's wrong?
On Fri, 18 Dec 2015, carlos at redhat dot com wrote:
> Well the localedata is updated as much as possible and we're on Unicode 8.0.0
> right now for UTF-8 charsets.
Collation, however, is much more out of date (and probably harder to
correlate with Unicode so we can make sure we're not losing desirable
local changes if we update it). See bug 14095.
(In reply to firstname.lastname@example.org from comment #5)
> On Fri, 18 Dec 2015, carlos at redhat dot com wrote:
> > Well the localedata is updated as much as possible and we're on Unicode 8.0.0
> > right now for UTF-8 charsets.
> Collation, however, is much more out of date (and probably harder to
> correlate with Unicode so we can make sure we're not losing desirable
> local changes if we update it). See bug 14095.
Correct, so if it's a collation issue, likely maybe, then it would be good to find a reproducer that shows via strlcoll the problem with Syriac characters. Until then an english speaking developer is going to have a hard time figuring this out, or the issue will go away once we start automating the collation data updates also (which should be our plan).