Symptom: GNU Grep does not handle Syriac characters (U+0700 – U+074F) correctly $ echo 'ܫܠܡܐ' > peace $ egrep '\<[ܐ-ܬ]' peace grep: Invalid collation character $ awk /'\<[ܐ-ܬ]'/ peace ܫܠܡܐ However when grep is build with ./configure --with-included-regex it works just fine and there is no REG_ECOLLATE error $ echo ܫܠܡܐ | src/egrep [ܫ-ܬ] ܫܠܡܐ $ echo ܫܠܡܐ | src/egrep [ܒ-ܓ] $ This is because GNU Grep contains improved version of regcomp. The bus was found here: http://forum.rosalab.ru/viewtopic.php?f=53&t=6219&p=54747 (in Russian) It is tested and confirmed also on Gentoo (both glibc and grep are 2.22). I expect there are other bugs that could be fixed with this upgrade.
This seems like a bug in the locale definitions (similar to the infamous "[A-Z] matches some lowercase characters one), not in regex. What is your locale?
My locale is ru_RU.UTF-8 Yes, I have got the same idea at the beginning. with LC_CTYPE=en_GB there was no error but with LC_ALL=en_US.UTF-8 the bug appeared. Next step I found there are two files regcomp.c in both Glibc and Grep. I have compared them with Diff. They are very similar, but not exactly. The one from Grep is obviously newer. But for some reason grep links with glibc by deafult. ./configure --with-included-regex enforces linking with newer buildin version. Then it works flawlessly with the same locale. I am not a native English speaker and hope my explanation are clear enough.
It works flawlessly because it bypasses the localedata. That's why I moved the bug to localedata. :)
(In reply to Paolo Bonzini from comment #3) > It works flawlessly because it bypasses the localedata. That's why I moved > the bug to localedata. :) Well the localedata is updated as much as possible and we're on Unicode 8.0.0 right now for UTF-8 charsets. How might we determine exactly what's wrong?
On Fri, 18 Dec 2015, carlos at redhat dot com wrote: > Well the localedata is updated as much as possible and we're on Unicode 8.0.0 > right now for UTF-8 charsets. Collation, however, is much more out of date (and probably harder to correlate with Unicode so we can make sure we're not losing desirable local changes if we update it). See bug 14095.
(In reply to joseph@codesourcery.com from comment #5) > On Fri, 18 Dec 2015, carlos at redhat dot com wrote: > > > Well the localedata is updated as much as possible and we're on Unicode 8.0.0 > > right now for UTF-8 charsets. > > Collation, however, is much more out of date (and probably harder to > correlate with Unicode so we can make sure we're not losing desirable > local changes if we update it). See bug 14095. Correct, so if it's a collation issue, likely maybe, then it would be good to find a reproducer that shows via strlcoll the problem with Syriac characters. Until then an english speaking developer is going to have a hard time figuring this out, or the issue will go away once we start automating the collation data updates also (which should be our plan).
mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (convert-RU-to-utf8 $%) $ rpm -q glibc glibc-2.38-14.fc39.x86_64 mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (convert-RU-to-utf8 $%) $ for i in $(locale -a | grep utf8$); do echo $i; env LC_MESSAGES=en_US.UTF-8 LC_COLLATE=$i grep -E '\<[ܐ-ܬ]' peace ; done C.utf8 grep: Invalid collation character aa_DJ.utf8 ܫܠܡܐ aa_ER.utf8 ܫܠܡܐ aa_ET.utf8 ܫܠܡܐ af_ZA.utf8 ܫܠܡܐ agr_PE.utf8 ܫܠܡܐ ak_GH.utf8 ܫܠܡܐ am_ET.utf8 ܫܠܡܐ an_ES.utf8 ܫܠܡܐ anp_IN.utf8 ܫܠܡܐ ar_AE.utf8 ܫܠܡܐ ar_BH.utf8 ܫܠܡܐ ar_DZ.utf8 ܫܠܡܐ ar_EG.utf8 ܫܠܡܐ ar_IN.utf8 ܫܠܡܐ ar_IQ.utf8 ܫܠܡܐ ar_JO.utf8 ܫܠܡܐ ar_KW.utf8 ܫܠܡܐ ar_LB.utf8 ܫܠܡܐ ar_LY.utf8 ܫܠܡܐ ar_MA.utf8 ܫܠܡܐ ar_OM.utf8 ܫܠܡܐ ar_QA.utf8 ܫܠܡܐ ar_SA.utf8 grep: Invalid collation character ar_SD.utf8 ܫܠܡܐ ar_SS.utf8 ܫܠܡܐ ar_SY.utf8 ܫܠܡܐ ar_TN.utf8 ܫܠܡܐ ar_YE.utf8 ܫܠܡܐ as_IN.utf8 ܫܠܡܐ ast_ES.utf8 ܫܠܡܐ ayc_PE.utf8 ܫܠܡܐ az_AZ.utf8 ܫܠܡܐ az_IR.utf8 ܫܠܡܐ be_BY.utf8 ܫܠܡܐ bem_ZM.utf8 ܫܠܡܐ ber_DZ.utf8 ܫܠܡܐ ber_MA.utf8 ܫܠܡܐ bg_BG.utf8 ܫܠܡܐ bhb_IN.utf8 ܫܠܡܐ bho_IN.utf8 ܫܠܡܐ bho_NP.utf8 ܫܠܡܐ bi_VU.utf8 ܫܠܡܐ bn_BD.utf8 ܫܠܡܐ bn_IN.utf8 ܫܠܡܐ bo_CN.utf8 ܫܠܡܐ bo_IN.utf8 ܫܠܡܐ br_FR.utf8 ܫܠܡܐ brx_IN.utf8 ܫܠܡܐ bs_BA.utf8 ܫܠܡܐ byn_ER.utf8 ܫܠܡܐ ca_AD.utf8 ܫܠܡܐ ca_ES.utf8 ܫܠܡܐ ca_FR.utf8 ܫܠܡܐ ca_IT.utf8 ܫܠܡܐ ce_RU.utf8 ܫܠܡܐ chr_US.utf8 ܫܠܡܐ ckb_IQ.utf8 ܫܠܡܐ cmn_TW.utf8 ܫܠܡܐ crh_UA.utf8 ܫܠܡܐ cs_CZ.utf8 ܫܠܡܐ csb_PL.utf8 ܫܠܡܐ cv_RU.utf8 ܫܠܡܐ cy_GB.utf8 ܫܠܡܐ da_DK.utf8 ܫܠܡܐ de_AT.utf8 ܫܠܡܐ de_BE.utf8 ܫܠܡܐ de_CH.utf8 ܫܠܡܐ de_DE.utf8 ܫܠܡܐ de_IT.utf8 ܫܠܡܐ de_LI.utf8 ܫܠܡܐ de_LU.utf8 ܫܠܡܐ doi_IN.utf8 ܫܠܡܐ dsb_DE.utf8 ܫܠܡܐ dv_MV.utf8 ܫܠܡܐ dz_BT.utf8 ܫܠܡܐ el_CY.utf8 ܫܠܡܐ el_GR.utf8 ܫܠܡܐ en_AG.utf8 ܫܠܡܐ en_AU.utf8 ܫܠܡܐ en_BW.utf8 ܫܠܡܐ en_CA.utf8 ܫܠܡܐ en_DK.utf8 ܫܠܡܐ en_GB.utf8 ܫܠܡܐ en_HK.utf8 ܫܠܡܐ en_IE.utf8 ܫܠܡܐ en_IL.utf8 ܫܠܡܐ en_IN.utf8 ܫܠܡܐ en_NG.utf8 ܫܠܡܐ en_NZ.utf8 ܫܠܡܐ en_PH.utf8 ܫܠܡܐ en_SC.utf8 ܫܠܡܐ en_SG.utf8 ܫܠܡܐ en_US.utf8 ܫܠܡܐ en_ZA.utf8 ܫܠܡܐ en_ZM.utf8 ܫܠܡܐ en_ZW.utf8 ܫܠܡܐ eo.utf8 ܫܠܡܐ es_AR.utf8 ܫܠܡܐ es_BO.utf8 ܫܠܡܐ es_CL.utf8 ܫܠܡܐ es_CO.utf8 ܫܠܡܐ es_CR.utf8 ܫܠܡܐ es_CU.utf8 ܫܠܡܐ es_DO.utf8 ܫܠܡܐ es_EC.utf8 ܫܠܡܐ es_ES.utf8 ܫܠܡܐ es_GT.utf8 ܫܠܡܐ es_HN.utf8 ܫܠܡܐ es_MX.utf8 ܫܠܡܐ es_NI.utf8 ܫܠܡܐ es_PA.utf8 ܫܠܡܐ es_PE.utf8 ܫܠܡܐ es_PR.utf8 ܫܠܡܐ es_PY.utf8 ܫܠܡܐ es_SV.utf8 ܫܠܡܐ es_US.utf8 ܫܠܡܐ es_UY.utf8 ܫܠܡܐ es_VE.utf8 ܫܠܡܐ et_EE.utf8 ܫܠܡܐ eu_ES.utf8 ܫܠܡܐ fa_IR.utf8 ܫܠܡܐ ff_SN.utf8 ܫܠܡܐ fi_FI.utf8 ܫܠܡܐ fil_PH.utf8 ܫܠܡܐ fo_FO.utf8 ܫܠܡܐ fr_BE.utf8 ܫܠܡܐ fr_CA.utf8 ܫܠܡܐ fr_CH.utf8 ܫܠܡܐ fr_FR.utf8 ܫܠܡܐ fr_LU.utf8 ܫܠܡܐ fur_IT.utf8 ܫܠܡܐ fy_DE.utf8 ܫܠܡܐ fy_NL.utf8 ܫܠܡܐ ga_IE.utf8 ܫܠܡܐ gd_GB.utf8 ܫܠܡܐ gez_ER.utf8 ܫܠܡܐ gez_ET.utf8 ܫܠܡܐ gl_ES.utf8 ܫܠܡܐ gu_IN.utf8 ܫܠܡܐ gv_GB.utf8 ܫܠܡܐ ha_NG.utf8 ܫܠܡܐ hak_TW.utf8 ܫܠܡܐ he_IL.utf8 ܫܠܡܐ hi_IN.utf8 ܫܠܡܐ hif_FJ.utf8 ܫܠܡܐ hne_IN.utf8 ܫܠܡܐ hr_HR.utf8 ܫܠܡܐ hsb_DE.utf8 ܫܠܡܐ ht_HT.utf8 ܫܠܡܐ hu_HU.utf8 ܫܠܡܐ hy_AM.utf8 ܫܠܡܐ ia_FR.utf8 ܫܠܡܐ id_ID.utf8 ܫܠܡܐ ig_NG.utf8 ܫܠܡܐ ik_CA.utf8 ܫܠܡܐ is_IS.utf8 ܫܠܡܐ it_CH.utf8 ܫܠܡܐ it_IT.utf8 ܫܠܡܐ iu_CA.utf8 ܫܠܡܐ ja_JP.utf8 grep: Invalid collation character ka_GE.utf8 ܫܠܡܐ kab_DZ.utf8 ܫܠܡܐ kk_KZ.utf8 ܫܠܡܐ kl_GL.utf8 ܫܠܡܐ km_KH.utf8 grep: Invalid collation character kn_IN.utf8 ܫܠܡܐ ko_KR.utf8 grep: Invalid collation character kok_IN.utf8 ܫܠܡܐ ks_IN.utf8 ܫܠܡܐ ku_TR.utf8 ܫܠܡܐ kw_GB.utf8 ܫܠܡܐ ky_KG.utf8 ܫܠܡܐ lb_LU.utf8 ܫܠܡܐ lg_UG.utf8 ܫܠܡܐ li_BE.utf8 ܫܠܡܐ li_NL.utf8 ܫܠܡܐ lij_IT.utf8 ܫܠܡܐ ln_CD.utf8 ܫܠܡܐ lo_LA.utf8 grep: Invalid collation character lt_LT.utf8 ܫܠܡܐ lv_LV.utf8 ܫܠܡܐ lzh_TW.utf8 ܫܠܡܐ mag_IN.utf8 ܫܠܡܐ mai_IN.utf8 ܫܠܡܐ mai_NP.utf8 ܫܠܡܐ mfe_MU.utf8 ܫܠܡܐ mg_MG.utf8 ܫܠܡܐ mhr_RU.utf8 ܫܠܡܐ mi_NZ.utf8 ܫܠܡܐ miq_NI.utf8 ܫܠܡܐ mjw_IN.utf8 ܫܠܡܐ mk_MK.utf8 ܫܠܡܐ ml_IN.utf8 ܫܠܡܐ mn_MN.utf8 ܫܠܡܐ mni_IN.utf8 ܫܠܡܐ mnw_MM.utf8 ܫܠܡܐ mr_IN.utf8 ܫܠܡܐ ms_MY.utf8 ܫܠܡܐ mt_MT.utf8 ܫܠܡܐ my_MM.utf8 ܫܠܡܐ nan_TW.utf8 ܫܠܡܐ nb_NO.utf8 ܫܠܡܐ nds_DE.utf8 ܫܠܡܐ nds_NL.utf8 ܫܠܡܐ ne_NP.utf8 ܫܠܡܐ nhn_MX.utf8 ܫܠܡܐ niu_NU.utf8 ܫܠܡܐ niu_NZ.utf8 ܫܠܡܐ nl_AW.utf8 ܫܠܡܐ nl_BE.utf8 ܫܠܡܐ nl_NL.utf8 ܫܠܡܐ nn_NO.utf8 ܫܠܡܐ nr_ZA.utf8 ܫܠܡܐ nso_ZA.utf8 ܫܠܡܐ oc_FR.utf8 ܫܠܡܐ om_ET.utf8 ܫܠܡܐ om_KE.utf8 ܫܠܡܐ or_IN.utf8 ܫܠܡܐ os_RU.utf8 ܫܠܡܐ pa_IN.utf8 ܫܠܡܐ pa_PK.utf8 ܫܠܡܐ pap_AW.utf8 ܫܠܡܐ pap_CW.utf8 ܫܠܡܐ pl_PL.utf8 ܫܠܡܐ ps_AF.utf8 ܫܠܡܐ pt_BR.utf8 ܫܠܡܐ pt_PT.utf8 ܫܠܡܐ quz_PE.utf8 ܫܠܡܐ raj_IN.utf8 ܫܠܡܐ rif_MA.utf8 ܫܠܡܐ ro_RO.utf8 ܫܠܡܐ ru_RU.utf8 ܫܠܡܐ ru_UA.utf8 ܫܠܡܐ rw_RW.utf8 ܫܠܡܐ sa_IN.utf8 ܫܠܡܐ sah_RU.utf8 ܫܠܡܐ sat_IN.utf8 ܫܠܡܐ sc_IT.utf8 ܫܠܡܐ sd_IN.utf8 ܫܠܡܐ se_NO.utf8 ܫܠܡܐ sgs_LT.utf8 ܫܠܡܐ shn_MM.utf8 ܫܠܡܐ shs_CA.utf8 ܫܠܡܐ si_LK.utf8 ܫܠܡܐ sid_ET.utf8 ܫܠܡܐ sk_SK.utf8 ܫܠܡܐ sl_SI.utf8 grep: Invalid collation character sm_WS.utf8 ܫܠܡܐ so_DJ.utf8 ܫܠܡܐ so_ET.utf8 ܫܠܡܐ so_KE.utf8 ܫܠܡܐ so_SO.utf8 ܫܠܡܐ sq_AL.utf8 ܫܠܡܐ sq_MK.utf8 ܫܠܡܐ sr_ME.utf8 ܫܠܡܐ sr_RS.utf8 ܫܠܡܐ ss_ZA.utf8 ܫܠܡܐ st_ZA.utf8 ܫܠܡܐ sv_FI.utf8 ܫܠܡܐ sv_SE.utf8 ܫܠܡܐ sw_KE.utf8 ܫܠܡܐ sw_TZ.utf8 ܫܠܡܐ syr.utf8 ܫܠܡܐ szl_PL.utf8 ܫܠܡܐ ta_IN.utf8 ܫܠܡܐ ta_LK.utf8 ܫܠܡܐ tcy_IN.utf8 ܫܠܡܐ te_IN.utf8 ܫܠܡܐ tg_TJ.utf8 ܫܠܡܐ th_TH.utf8 grep: Invalid collation character the_NP.utf8 ܫܠܡܐ ti_ER.utf8 ܫܠܡܐ ti_ET.utf8 ܫܠܡܐ tig_ER.utf8 ܫܠܡܐ tk_TM.utf8 ܫܠܡܐ tl_PH.utf8 ܫܠܡܐ tn_ZA.utf8 ܫܠܡܐ to_TO.utf8 ܫܠܡܐ tpi_PG.utf8 ܫܠܡܐ tr_CY.utf8 ܫܠܡܐ tr_TR.utf8 ܫܠܡܐ ts_ZA.utf8 ܫܠܡܐ tt_RU.utf8 ܫܠܡܐ ug_CN.utf8 ܫܠܡܐ uk_UA.utf8 ܫܠܡܐ unm_US.utf8 ܫܠܡܐ ur_IN.utf8 ܫܠܡܐ ur_PK.utf8 ܫܠܡܐ uz_UZ.utf8 ܫܠܡܐ ve_ZA.utf8 ܫܠܡܐ vi_VN.utf8 ܫܠܡܐ wa_BE.utf8 ܫܠܡܐ wae_CH.utf8 ܫܠܡܐ wal_ET.utf8 ܫܠܡܐ wo_SN.utf8 ܫܠܡܐ xh_ZA.utf8 ܫܠܡܐ yi_US.utf8 ܫܠܡܐ yo_NG.utf8 ܫܠܡܐ yue_HK.utf8 ܫܠܡܐ yuw_PG.utf8 ܫܠܡܐ zh_CN.utf8 ܫܠܡܐ zh_HK.utf8 ܫܠܡܐ zh_SG.utf8 ܫܠܡܐ zh_TW.utf8 ܫܠܡܐ zu_ZA.utf8 ܫܠܡܐ mfabian@hathi:/local/mfabian/src/glibc/localedata/locales (convert-RU-to-utf8 $%) $
According to my last comment, the only locales where this problem still occurs are: C.utf8 ar_SA.utf8 ja_JP.utf8 km_KH.utf8 ko_KR.utf8 lo_LA.utf8 sl_SI.utf8 th_TH.utf8 These are all locales which do not yet use LC_COLLATE % Copy the template from ISO/IEC 14651 copy "iso14651_t1" ... The iso14651_t1_common file has been updated in 2017 to a much newer version from 2016, which apparently fixed this problem except for the locales which still do not use copy "iso14651_t1" I think that is good enough for the moment to close this bug here as FIXED.