Bug 12051 - CEO has confusing differences across locales
Summary: CEO has confusing differences across locales
Status: RESOLVED WONTFIX
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.12
: P2 normal
Target Milestone: ---
Assignee: GNU C Library Locale Maintainers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-24 12:47 UTC by Paolo Bonzini
Modified: 2018-11-27 14:58 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Paolo Bonzini 2010-09-24 12:47:46 UTC
According to POSIX 2008, there was a requirement in older POSIX that range
expressions be treated as CEO (collating element order) for all locales.  POSIX
mentions some disadvantages of CEO, but one in particular is omitted---and glibc
has it: even when only considering ASCII characters and a single implementation,
the behavior with respect to case varies across locales: in some locales,
"[a-e]" may match either 'A' or 'E', while in others it will match none.

CEO in glibc is inconsistent for these locales:

  ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK
  sl_SI th_TH tr_CY tr_TR

which are the only ones following this model (from cs_CZ):

  <U0041> <U0041>;<NONE>;<CAPITAL>;<U0041>    # A
  <U0061> <U0041>;<NONE>;<SMALL>;<U0041>    # a
  <U00AA> <U0041>;<NONE>;<U00AA>;<U0041>    # ª
  <U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041>    # Á
  <U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041>    # á
  ...
  <U005A> <U005A>;<NONE>;<CAPITAL>;<U005A>    # Z
  <U007A> <U005A>;<NONE>;<SMALL>;<U005A>    # z

rather than the one in localedata/locales/iso14651_t1_common:

  <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a    start lowercase
  <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
  <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
  ...
  <U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z
  ...
  <U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ     end lowercase
  <U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A    start uppercase
  <U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á
  ...
  <U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z
  ...
  <U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ    end uppercase

As an aside, the CEO requirement was specifically relaxed in POSIX 2001, so
glibc is insisting on CEO ordering because of a version of POSIX two editions
ago (without documenting it).  At the same time, other glibc interfaces no
longer comply with the stricter requirements in older POSIX that have since been
relaxed (for example, whether getopt() must include an error message with
"illegal" in the string).  So, there is no reason to tie regex to the older
standard's CEO ordering.
Comment 1 Ulrich Drepper 2010-10-04 02:42:45 UTC
This stays as it is.  If individual locale maintainers think the current behavior 
is unintentionally as-is then they can change it.  But in general this is the 
long-implemented behavior and won't be changed.  Collating elements are just not 
really useful outside the POSIX locale or when the locale is guaranteed to stay 
the same.