This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/12051] New: CEO has confusing differences across locales


According to POSIX 2008, there was a requirement in older POSIX that range
expressions be treated as CEO (collating element order) for all locales.  POSIX
mentions some disadvantages of CEO, but one in particular is omitted---and glibc
has it: even when only considering ASCII characters and a single implementation,
the behavior with respect to case varies across locales: in some locales,
"[a-e]" may match either 'A' or 'E', while in others it will match none.

CEO in glibc is inconsistent for these locales:

  ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK
  sl_SI th_TH tr_CY tr_TR

which are the only ones following this model (from cs_CZ):

  <U0041> <U0041>;<NONE>;<CAPITAL>;<U0041>    # A
  <U0061> <U0041>;<NONE>;<SMALL>;<U0041>    # a
  <U00AA> <U0041>;<NONE>;<U00AA>;<U0041>    # ª
  <U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041>    # Á
  <U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041>    # á
  ...
  <U005A> <U005A>;<NONE>;<CAPITAL>;<U005A>    # Z
  <U007A> <U005A>;<NONE>;<SMALL>;<U005A>    # z

rather than the one in localedata/locales/iso14651_t1_common:

  <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a    start lowercase
  <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
  <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
  ...
  <U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z
  ...
  <U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ     end lowercase
  <U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A    start uppercase
  <U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á
  ...
  <U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z
  ...
  <U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ    end uppercase

As an aside, the CEO requirement was specifically relaxed in POSIX 2001, so
glibc is insisting on CEO ordering because of a version of POSIX two editions
ago (without documenting it).  At the same time, other glibc interfaces no
longer comply with the stricter requirements in older POSIX that have since been
relaxed (for example, whether getopt() must include an error message with
"illegal" in the string).  So, there is no reason to tie regex to the older
standard's CEO ordering.

-- 
           Summary: CEO has confusing differences across locales
           Product: glibc
           Version: 2.12
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: libc-locales at sources dot redhat dot com
        ReportedBy: bonzini at gnu dot org
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=12051

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]