19376 – regex reports "Invalid collation character" for Syriac characters

Bug 19376 - regex reports "Invalid collation character" for Syriac characters

Summary: regex reports "Invalid collation character" for Syriac characters

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.22

Importance:	P2 normal
Target Milestone:	---
Assignee:	Mike FABIAN

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-12-18 09:26 UTC by t.rus76
Modified:	2024-01-08 10:43 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description t.rus76 2015-12-18 09:26:14 UTC

Symptom: GNU Grep does not handle Syriac characters (U+0700 – U+074F) correctly

$ echo 'ܫܠܡܐ' > peace
$ egrep '\<[ܐ-ܬ]' peace
grep: Invalid collation character
$ awk /'\<[ܐ-ܬ]'/ peace
ܫܠܡܐ

However when grep is build with ./configure --with-included-regex
it works just fine and there is no REG_ECOLLATE error

$ echo ܫܠܡܐ | src/egrep [ܫ-ܬ]
ܫܠܡܐ
$ echo ܫܠܡܐ | src/egrep [ܒ-ܓ]
$

This is because GNU Grep contains improved version of regcomp.

The bus was found here: http://forum.rosalab.ru/viewtopic.php?f=53&t=6219&p=54747 (in Russian)

It is tested and confirmed also on Gentoo (both glibc and grep are 2.22).


I expect there are other bugs that could be fixed with this upgrade.

Comment 1 Paolo Bonzini 2015-12-18 12:26:56 UTC

This seems like a bug in the locale definitions (similar to the infamous "[A-Z] matches some lowercase characters one), not in regex.

What is your locale?

Comment 2 t.rus76 2015-12-18 15:05:32 UTC

My locale is ru_RU.UTF-8 

Yes, I have got the same idea at the beginning.
with LC_CTYPE=en_GB there was no error
but with LC_ALL=en_US.UTF-8 the bug appeared.

Next step I found there are two files regcomp.c in both Glibc and Grep. 
I have compared them with Diff. They are very similar, but not exactly. 
The one from Grep is obviously newer. But for some reason grep links with glibc by deafult. 
./configure --with-included-regex enforces linking with newer buildin version.
Then it works flawlessly with the same locale.

I am not a native English speaker and hope my explanation are clear enough.

Comment 3 Paolo Bonzini 2015-12-18 15:51:33 UTC

It works flawlessly because it bypasses the localedata. That's why I moved the bug to localedata. :)

Comment 4 Carlos O'Donell 2015-12-18 16:24:01 UTC

(In reply to Paolo Bonzini from comment #3)
> It works flawlessly because it bypasses the localedata. That's why I moved
> the bug to localedata. :)

Well the localedata is updated as much as possible and we're on Unicode 8.0.0 right now for UTF-8 charsets.

How might we determine exactly what's wrong?

Comment 5 jsm-csl@polyomino.org.uk 2015-12-18 16:31:06 UTC

On Fri, 18 Dec 2015, carlos at redhat dot com wrote:

> Well the localedata is updated as much as possible and we're on Unicode 8.0.0
> right now for UTF-8 charsets.

Collation, however, is much more out of date (and probably harder to 
correlate with Unicode so we can make sure we're not losing desirable 
local changes if we update it).  See bug 14095.

Comment 6 Carlos O'Donell 2015-12-18 16:41:50 UTC

(In reply to joseph@codesourcery.com from comment #5)
> On Fri, 18 Dec 2015, carlos at redhat dot com wrote:
> 
> > Well the localedata is updated as much as possible and we're on Unicode 8.0.0
> > right now for UTF-8 charsets.
> 
> Collation, however, is much more out of date (and probably harder to 
> correlate with Unicode so we can make sure we're not losing desirable 
> local changes if we update it).  See bug 14095.

Correct, so if it's a collation issue, likely maybe, then it would be good to find a reproducer that shows via strlcoll the problem with Syriac characters. Until then an english speaking developer is going to have a hard time figuring this out, or the issue will go away once we start automating the collation data updates also (which should be our plan).

Comment 7 Mike FABIAN 2024-01-08 10:37:42 UTC

mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (convert-RU-to-utf8 $%)
$ rpm -q glibc
glibc-2.38-14.fc39.x86_64
mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (convert-RU-to-utf8 $%)
$ for i in $(locale -a | grep utf8$); do echo $i; env LC_MESSAGES=en_US.UTF-8 LC_COLLATE=$i grep -E '\<[ܐ-ܬ]' peace ; done
C.utf8
grep: Invalid collation character
aa_DJ.utf8
ܫܠܡܐ
aa_ER.utf8
ܫܠܡܐ
aa_ET.utf8
ܫܠܡܐ
af_ZA.utf8
ܫܠܡܐ
agr_PE.utf8
ܫܠܡܐ
ak_GH.utf8
ܫܠܡܐ
am_ET.utf8
ܫܠܡܐ
an_ES.utf8
ܫܠܡܐ
anp_IN.utf8
ܫܠܡܐ
ar_AE.utf8
ܫܠܡܐ
ar_BH.utf8
ܫܠܡܐ
ar_DZ.utf8
ܫܠܡܐ
ar_EG.utf8
ܫܠܡܐ
ar_IN.utf8
ܫܠܡܐ
ar_IQ.utf8
ܫܠܡܐ
ar_JO.utf8
ܫܠܡܐ
ar_KW.utf8
ܫܠܡܐ
ar_LB.utf8
ܫܠܡܐ
ar_LY.utf8
ܫܠܡܐ
ar_MA.utf8
ܫܠܡܐ
ar_OM.utf8
ܫܠܡܐ
ar_QA.utf8
ܫܠܡܐ
ar_SA.utf8
grep: Invalid collation character
ar_SD.utf8
ܫܠܡܐ
ar_SS.utf8
ܫܠܡܐ
ar_SY.utf8
ܫܠܡܐ
ar_TN.utf8
ܫܠܡܐ
ar_YE.utf8
ܫܠܡܐ
as_IN.utf8
ܫܠܡܐ
ast_ES.utf8
ܫܠܡܐ
ayc_PE.utf8
ܫܠܡܐ
az_AZ.utf8
ܫܠܡܐ
az_IR.utf8
ܫܠܡܐ
be_BY.utf8
ܫܠܡܐ
bem_ZM.utf8
ܫܠܡܐ
ber_DZ.utf8
ܫܠܡܐ
ber_MA.utf8
ܫܠܡܐ
bg_BG.utf8
ܫܠܡܐ
bhb_IN.utf8
ܫܠܡܐ
bho_IN.utf8
ܫܠܡܐ
bho_NP.utf8
ܫܠܡܐ
bi_VU.utf8
ܫܠܡܐ
bn_BD.utf8
ܫܠܡܐ
bn_IN.utf8
ܫܠܡܐ
bo_CN.utf8
ܫܠܡܐ
bo_IN.utf8
ܫܠܡܐ
br_FR.utf8
ܫܠܡܐ
brx_IN.utf8
ܫܠܡܐ
bs_BA.utf8
ܫܠܡܐ
byn_ER.utf8
ܫܠܡܐ
ca_AD.utf8
ܫܠܡܐ
ca_ES.utf8
ܫܠܡܐ
ca_FR.utf8
ܫܠܡܐ
ca_IT.utf8
ܫܠܡܐ
ce_RU.utf8
ܫܠܡܐ
chr_US.utf8
ܫܠܡܐ
ckb_IQ.utf8
ܫܠܡܐ
cmn_TW.utf8
ܫܠܡܐ
crh_UA.utf8
ܫܠܡܐ
cs_CZ.utf8
ܫܠܡܐ
csb_PL.utf8
ܫܠܡܐ
cv_RU.utf8
ܫܠܡܐ
cy_GB.utf8
ܫܠܡܐ
da_DK.utf8
ܫܠܡܐ
de_AT.utf8
ܫܠܡܐ
de_BE.utf8
ܫܠܡܐ
de_CH.utf8
ܫܠܡܐ
de_DE.utf8
ܫܠܡܐ
de_IT.utf8
ܫܠܡܐ
de_LI.utf8
ܫܠܡܐ
de_LU.utf8
ܫܠܡܐ
doi_IN.utf8
ܫܠܡܐ
dsb_DE.utf8
ܫܠܡܐ
dv_MV.utf8
ܫܠܡܐ
dz_BT.utf8
ܫܠܡܐ
el_CY.utf8
ܫܠܡܐ
el_GR.utf8
ܫܠܡܐ
en_AG.utf8
ܫܠܡܐ
en_AU.utf8
ܫܠܡܐ
en_BW.utf8
ܫܠܡܐ
en_CA.utf8
ܫܠܡܐ
en_DK.utf8
ܫܠܡܐ
en_GB.utf8
ܫܠܡܐ
en_HK.utf8
ܫܠܡܐ
en_IE.utf8
ܫܠܡܐ
en_IL.utf8
ܫܠܡܐ
en_IN.utf8
ܫܠܡܐ
en_NG.utf8
ܫܠܡܐ
en_NZ.utf8
ܫܠܡܐ
en_PH.utf8
ܫܠܡܐ
en_SC.utf8
ܫܠܡܐ
en_SG.utf8
ܫܠܡܐ
en_US.utf8
ܫܠܡܐ
en_ZA.utf8
ܫܠܡܐ
en_ZM.utf8
ܫܠܡܐ
en_ZW.utf8
ܫܠܡܐ
eo.utf8
ܫܠܡܐ
es_AR.utf8
ܫܠܡܐ
es_BO.utf8
ܫܠܡܐ
es_CL.utf8
ܫܠܡܐ
es_CO.utf8
ܫܠܡܐ
es_CR.utf8
ܫܠܡܐ
es_CU.utf8
ܫܠܡܐ
es_DO.utf8
ܫܠܡܐ
es_EC.utf8
ܫܠܡܐ
es_ES.utf8
ܫܠܡܐ
es_GT.utf8
ܫܠܡܐ
es_HN.utf8
ܫܠܡܐ
es_MX.utf8
ܫܠܡܐ
es_NI.utf8
ܫܠܡܐ
es_PA.utf8
ܫܠܡܐ
es_PE.utf8
ܫܠܡܐ
es_PR.utf8
ܫܠܡܐ
es_PY.utf8
ܫܠܡܐ
es_SV.utf8
ܫܠܡܐ
es_US.utf8
ܫܠܡܐ
es_UY.utf8
ܫܠܡܐ
es_VE.utf8
ܫܠܡܐ
et_EE.utf8
ܫܠܡܐ
eu_ES.utf8
ܫܠܡܐ
fa_IR.utf8
ܫܠܡܐ
ff_SN.utf8
ܫܠܡܐ
fi_FI.utf8
ܫܠܡܐ
fil_PH.utf8
ܫܠܡܐ
fo_FO.utf8
ܫܠܡܐ
fr_BE.utf8
ܫܠܡܐ
fr_CA.utf8
ܫܠܡܐ
fr_CH.utf8
ܫܠܡܐ
fr_FR.utf8
ܫܠܡܐ
fr_LU.utf8
ܫܠܡܐ
fur_IT.utf8
ܫܠܡܐ
fy_DE.utf8
ܫܠܡܐ
fy_NL.utf8
ܫܠܡܐ
ga_IE.utf8
ܫܠܡܐ
gd_GB.utf8
ܫܠܡܐ
gez_ER.utf8
ܫܠܡܐ
gez_ET.utf8
ܫܠܡܐ
gl_ES.utf8
ܫܠܡܐ
gu_IN.utf8
ܫܠܡܐ
gv_GB.utf8
ܫܠܡܐ
ha_NG.utf8
ܫܠܡܐ
hak_TW.utf8
ܫܠܡܐ
he_IL.utf8
ܫܠܡܐ
hi_IN.utf8
ܫܠܡܐ
hif_FJ.utf8
ܫܠܡܐ
hne_IN.utf8
ܫܠܡܐ
hr_HR.utf8
ܫܠܡܐ
hsb_DE.utf8
ܫܠܡܐ
ht_HT.utf8
ܫܠܡܐ
hu_HU.utf8
ܫܠܡܐ
hy_AM.utf8
ܫܠܡܐ
ia_FR.utf8
ܫܠܡܐ
id_ID.utf8
ܫܠܡܐ
ig_NG.utf8
ܫܠܡܐ
ik_CA.utf8
ܫܠܡܐ
is_IS.utf8
ܫܠܡܐ
it_CH.utf8
ܫܠܡܐ
it_IT.utf8
ܫܠܡܐ
iu_CA.utf8
ܫܠܡܐ
ja_JP.utf8
grep: Invalid collation character
ka_GE.utf8
ܫܠܡܐ
kab_DZ.utf8
ܫܠܡܐ
kk_KZ.utf8
ܫܠܡܐ
kl_GL.utf8
ܫܠܡܐ
km_KH.utf8
grep: Invalid collation character
kn_IN.utf8
ܫܠܡܐ
ko_KR.utf8
grep: Invalid collation character
kok_IN.utf8
ܫܠܡܐ
ks_IN.utf8
ܫܠܡܐ
ku_TR.utf8
ܫܠܡܐ
kw_GB.utf8
ܫܠܡܐ
ky_KG.utf8
ܫܠܡܐ
lb_LU.utf8
ܫܠܡܐ
lg_UG.utf8
ܫܠܡܐ
li_BE.utf8
ܫܠܡܐ
li_NL.utf8
ܫܠܡܐ
lij_IT.utf8
ܫܠܡܐ
ln_CD.utf8
ܫܠܡܐ
lo_LA.utf8
grep: Invalid collation character
lt_LT.utf8
ܫܠܡܐ
lv_LV.utf8
ܫܠܡܐ
lzh_TW.utf8
ܫܠܡܐ
mag_IN.utf8
ܫܠܡܐ
mai_IN.utf8
ܫܠܡܐ
mai_NP.utf8
ܫܠܡܐ
mfe_MU.utf8
ܫܠܡܐ
mg_MG.utf8
ܫܠܡܐ
mhr_RU.utf8
ܫܠܡܐ
mi_NZ.utf8
ܫܠܡܐ
miq_NI.utf8
ܫܠܡܐ
mjw_IN.utf8
ܫܠܡܐ
mk_MK.utf8
ܫܠܡܐ
ml_IN.utf8
ܫܠܡܐ
mn_MN.utf8
ܫܠܡܐ
mni_IN.utf8
ܫܠܡܐ
mnw_MM.utf8
ܫܠܡܐ
mr_IN.utf8
ܫܠܡܐ
ms_MY.utf8
ܫܠܡܐ
mt_MT.utf8
ܫܠܡܐ
my_MM.utf8
ܫܠܡܐ
nan_TW.utf8
ܫܠܡܐ
nb_NO.utf8
ܫܠܡܐ
nds_DE.utf8
ܫܠܡܐ
nds_NL.utf8
ܫܠܡܐ
ne_NP.utf8
ܫܠܡܐ
nhn_MX.utf8
ܫܠܡܐ
niu_NU.utf8
ܫܠܡܐ
niu_NZ.utf8
ܫܠܡܐ
nl_AW.utf8
ܫܠܡܐ
nl_BE.utf8
ܫܠܡܐ
nl_NL.utf8
ܫܠܡܐ
nn_NO.utf8
ܫܠܡܐ
nr_ZA.utf8
ܫܠܡܐ
nso_ZA.utf8
ܫܠܡܐ
oc_FR.utf8
ܫܠܡܐ
om_ET.utf8
ܫܠܡܐ
om_KE.utf8
ܫܠܡܐ
or_IN.utf8
ܫܠܡܐ
os_RU.utf8
ܫܠܡܐ
pa_IN.utf8
ܫܠܡܐ
pa_PK.utf8
ܫܠܡܐ
pap_AW.utf8
ܫܠܡܐ
pap_CW.utf8
ܫܠܡܐ
pl_PL.utf8
ܫܠܡܐ
ps_AF.utf8
ܫܠܡܐ
pt_BR.utf8
ܫܠܡܐ
pt_PT.utf8
ܫܠܡܐ
quz_PE.utf8
ܫܠܡܐ
raj_IN.utf8
ܫܠܡܐ
rif_MA.utf8
ܫܠܡܐ
ro_RO.utf8
ܫܠܡܐ
ru_RU.utf8
ܫܠܡܐ
ru_UA.utf8
ܫܠܡܐ
rw_RW.utf8
ܫܠܡܐ
sa_IN.utf8
ܫܠܡܐ
sah_RU.utf8
ܫܠܡܐ
sat_IN.utf8
ܫܠܡܐ
sc_IT.utf8
ܫܠܡܐ
sd_IN.utf8
ܫܠܡܐ
se_NO.utf8
ܫܠܡܐ
sgs_LT.utf8
ܫܠܡܐ
shn_MM.utf8
ܫܠܡܐ
shs_CA.utf8
ܫܠܡܐ
si_LK.utf8
ܫܠܡܐ
sid_ET.utf8
ܫܠܡܐ
sk_SK.utf8
ܫܠܡܐ
sl_SI.utf8
grep: Invalid collation character
sm_WS.utf8
ܫܠܡܐ
so_DJ.utf8
ܫܠܡܐ
so_ET.utf8
ܫܠܡܐ
so_KE.utf8
ܫܠܡܐ
so_SO.utf8
ܫܠܡܐ
sq_AL.utf8
ܫܠܡܐ
sq_MK.utf8
ܫܠܡܐ
sr_ME.utf8
ܫܠܡܐ
sr_RS.utf8
ܫܠܡܐ
ss_ZA.utf8
ܫܠܡܐ
st_ZA.utf8
ܫܠܡܐ
sv_FI.utf8
ܫܠܡܐ
sv_SE.utf8
ܫܠܡܐ
sw_KE.utf8
ܫܠܡܐ
sw_TZ.utf8
ܫܠܡܐ
syr.utf8
ܫܠܡܐ
szl_PL.utf8
ܫܠܡܐ
ta_IN.utf8
ܫܠܡܐ
ta_LK.utf8
ܫܠܡܐ
tcy_IN.utf8
ܫܠܡܐ
te_IN.utf8
ܫܠܡܐ
tg_TJ.utf8
ܫܠܡܐ
th_TH.utf8
grep: Invalid collation character
the_NP.utf8
ܫܠܡܐ
ti_ER.utf8
ܫܠܡܐ
ti_ET.utf8
ܫܠܡܐ
tig_ER.utf8
ܫܠܡܐ
tk_TM.utf8
ܫܠܡܐ
tl_PH.utf8
ܫܠܡܐ
tn_ZA.utf8
ܫܠܡܐ
to_TO.utf8
ܫܠܡܐ
tpi_PG.utf8
ܫܠܡܐ
tr_CY.utf8
ܫܠܡܐ
tr_TR.utf8
ܫܠܡܐ
ts_ZA.utf8
ܫܠܡܐ
tt_RU.utf8
ܫܠܡܐ
ug_CN.utf8
ܫܠܡܐ
uk_UA.utf8
ܫܠܡܐ
unm_US.utf8
ܫܠܡܐ
ur_IN.utf8
ܫܠܡܐ
ur_PK.utf8
ܫܠܡܐ
uz_UZ.utf8
ܫܠܡܐ
ve_ZA.utf8
ܫܠܡܐ
vi_VN.utf8
ܫܠܡܐ
wa_BE.utf8
ܫܠܡܐ
wae_CH.utf8
ܫܠܡܐ
wal_ET.utf8
ܫܠܡܐ
wo_SN.utf8
ܫܠܡܐ
xh_ZA.utf8
ܫܠܡܐ
yi_US.utf8
ܫܠܡܐ
yo_NG.utf8
ܫܠܡܐ
yue_HK.utf8
ܫܠܡܐ
yuw_PG.utf8
ܫܠܡܐ
zh_CN.utf8
ܫܠܡܐ
zh_HK.utf8
ܫܠܡܐ
zh_SG.utf8
ܫܠܡܐ
zh_TW.utf8
ܫܠܡܐ
zu_ZA.utf8
ܫܠܡܐ
mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (convert-RU-to-utf8 $%)
$

Comment 8 Mike FABIAN 2024-01-08 10:43:49 UTC

According to my last comment, the only locales where this problem still occurs are:

C.utf8
ar_SA.utf8
ja_JP.utf8
km_KH.utf8
ko_KR.utf8
lo_LA.utf8
sl_SI.utf8
th_TH.utf8

These are all locales which do not yet use 

LC_COLLATE

% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
...


The iso14651_t1_common file has been updated in 2017 to a much newer version from 2016, which apparently fixed this problem except for the locales which still do not use 

copy "iso14651_t1"


I think that is good enough for the moment to close this bug here as FIXED.