Bug 19376 - regex reports "Invalid collation character" for Syriac characters
Summary: regex reports "Invalid collation character" for Syriac characters
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.22
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-12-18 09:26 UTC by t.rus76
Modified: 2015-12-19 17:25 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description t.rus76 2015-12-18 09:26:14 UTC
Symptom: GNU Grep does not handle Syriac characters (U+0700 – U+074F) correctly

$ echo 'ܫܠܡܐ' > peace
$ egrep '\<[ܐ-ܬ]' peace
grep: Invalid collation character
$ awk /'\<[ܐ-ܬ]'/ peace
ܫܠܡܐ

However when grep is build with ./configure --with-included-regex
it works just fine and there is no REG_ECOLLATE error

$ echo ܫܠܡܐ | src/egrep [ܫ-ܬ]
ܫܠܡܐ
$ echo ܫܠܡܐ | src/egrep [ܒ-ܓ]
$

This is because GNU Grep contains improved version of regcomp.

The bus was found here: http://forum.rosalab.ru/viewtopic.php?f=53&t=6219&p=54747 (in Russian)

It is tested and confirmed also on Gentoo (both glibc and grep are 2.22).


I expect there are other bugs that could be fixed with this upgrade.
Comment 1 Paolo Bonzini 2015-12-18 12:26:56 UTC
This seems like a bug in the locale definitions (similar to the infamous "[A-Z] matches some lowercase characters one), not in regex.

What is your locale?
Comment 2 t.rus76 2015-12-18 15:05:32 UTC
My locale is ru_RU.UTF-8 

Yes, I have got the same idea at the beginning.
with LC_CTYPE=en_GB there was no error
but with LC_ALL=en_US.UTF-8 the bug appeared.

Next step I found there are two files regcomp.c in both Glibc and Grep. 
I have compared them with Diff. They are very similar, but not exactly. 
The one from Grep is obviously newer. But for some reason grep links with glibc by deafult. 
./configure --with-included-regex enforces linking with newer buildin version.
Then it works flawlessly with the same locale.

I am not a native English speaker and hope my explanation are clear enough.
Comment 3 Paolo Bonzini 2015-12-18 15:51:33 UTC
It works flawlessly because it bypasses the localedata. That's why I moved the bug to localedata. :)
Comment 4 Carlos O'Donell 2015-12-18 16:24:01 UTC
(In reply to Paolo Bonzini from comment #3)
> It works flawlessly because it bypasses the localedata. That's why I moved
> the bug to localedata. :)

Well the localedata is updated as much as possible and we're on Unicode 8.0.0 right now for UTF-8 charsets.

How might we determine exactly what's wrong?
Comment 5 joseph@codesourcery.com 2015-12-18 16:31:06 UTC
On Fri, 18 Dec 2015, carlos at redhat dot com wrote:

> Well the localedata is updated as much as possible and we're on Unicode 8.0.0
> right now for UTF-8 charsets.

Collation, however, is much more out of date (and probably harder to 
correlate with Unicode so we can make sure we're not losing desirable 
local changes if we update it).  See bug 14095.
Comment 6 Carlos O'Donell 2015-12-18 16:41:50 UTC
(In reply to joseph@codesourcery.com from comment #5)
> On Fri, 18 Dec 2015, carlos at redhat dot com wrote:
> 
> > Well the localedata is updated as much as possible and we're on Unicode 8.0.0
> > right now for UTF-8 charsets.
> 
> Collation, however, is much more out of date (and probably harder to 
> correlate with Unicode so we can make sure we're not losing desirable 
> local changes if we update it).  See bug 14095.

Correct, so if it's a collation issue, likely maybe, then it would be good to find a reproducer that shows via strlcoll the problem with Syriac characters. Until then an english speaking developer is going to have a hard time figuring this out, or the issue will go away once we start automating the collation data updates also (which should be our plan).