Summary: | regex range semantics outside of POSIX should be documented | ||
---|---|---|---|
Product: | glibc | Reporter: | Eric Blake <eblake> |
Component: | manual | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | fweimer, glibc-bugs-regex, glibc-bugs |
Priority: | P2 | Flags: | fweimer:
security-
|
Version: | 2.12 | ||
Target Milestone: | --- | ||
See Also: | https://sourceware.org/bugzilla/show_bug.cgi?id=23393 | ||
Host: | Target: | ||
Build: | Last reconfirmed: |
Description
Eric Blake
2010-09-21 15:24:45 UTC
Possibly related to the resolution of http://sources.redhat.com/bugzilla/show_bug.cgi?id=10290 Another similarly confusing example: $ echo 'ach' | LANG=cs_CZ.UTF-8 sed -n '/a[d-p]/p' ch $ echo 'ach' | LANG=cs_CZ.UTF-8 sed -n '/a[^d-p]/p' ch Actually, according to POSIX 2008, there was a requirement in older POSIX that range expressions be treated as CEO (collating element order) for all locales, but this was specifically relaxed in POSIX 2001. If glibc is going to insist on CEO ordering because of a version of POSIX two editions ago, it would be nice to see that documented. Then again, other glibc interfaces no longer comply with the stricter requirements in older POSIX that have since been relaxed (for example, whether getopt() must include an error message with "illegal" in the string). so I see no reason to tie regex to the older standard's CEO ordering either. XRAT A.9.3.5: http://www.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html Historical implementations used native character order to interpret range expressions. The ISO POSIX-2:1993 standard instead required collating element order (CEO): the order that collating elements were specified between the order_start and order_end keywords in the LC_COLLATE category of the current locale. CEO had some advantages in portability over the native character order, but it also had some disadvantages: * CEO could not feasibly be mimicked in user code, leading to inconsistencies between POSIX matchers and matchers in popular user programs like Emacs, ksh, and Perl. * CEO caused range expressions to match accented and capitalized letters contrary to many users' expectations. For example, "[a-e]" typically matched both 'E' and 'á' but neither 'A' nor 'é' . * CEO was not consistent across implementations. In practice, CEO was often less portable than native character order. For example, it was common for the CEOs of two implementation-supplied locales to disagree, even if both locales were named "da_DK" . Because of these problems, some implementations of regular expressions continued to use native character order. Others used the collation sequence, which is more consistent with sorting than either CEO or native order, but which departs further from the traditional POSIX semantics because it generally requires "[a-e]" to match either 'A' or 'E' but not both. As a result of this kind of implementation variation, programmers who wanted to write portable regular expressions could not rely on the ISO POSIX-2:1993 standard guarantees in practice. While revising the standard, lengthy consideration was given to proposals to attack this problem by adding an API for querying the CEO to allow user-mode matchers, but none of these proposals had implementation experience and none achieved consensus. Leaving the standard alone was also considered, but rejected due to the problems described above. The current standard leaves unspecified the behavior of a range expression outside the POSIX locale. This makes it clearer that conforming applications should avoid range expressions outside the POSIX locale, and it allows implementations and compatible user-mode matchers to interpret range expressions using native order, CEO, collation sequence, or other, more advanced techniques. The concerns which led to this change were raised in IEEE PASC interpretation 1003.2 #43 and others, and related to ambiguities in the specification of how multi-character collating elements should be handled in range expressions. These ambiguities had led to multiple interpretations of the specification, in conflicting ways, which led to varying implementations. As noted above, efforts were made to resolve the differences, but no solution has been found that would be specific enough to allow for portable software while not invalidating existing implementations. It turns out that regex range semantics for glibc are "CEO". They _are_ consistent, it's the locale definition files that are not consistent. I created a file with the 52 uppercase and lowercase letters and did a "sed -n /[A-Z]/p" on this file. The results I get are either this 26 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z or this 51 AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ here are the "51" locales: ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK sl_SI th_TH tr_CY tr_TR These return 51 for both $l and $l.utf8. Every other locale returns 26 for both unibyte and multibyte variants. Locales using glibc's localedata/locales/iso14651_t1_common template return 26. This template defines the collation like this: <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a start lowercase <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á ... <U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z ... <U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ end lowercase <U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A start uppercase <U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á ... <U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z ... <U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ end uppercase (There's no end to surprises: [a-z] comes _before_ [A-Z], which is why [A-z] fails but [a-Z] works). Instead, the "special" locales above use different sequence, for example in cs_CZ: <U0041> <U0041>;<NONE>;<CAPITAL>;<U0041> # A <U0061> <U0041>;<NONE>;<SMALL>;<U0041> # a <U00AA> <U0041>;<NONE>;<U00AA>;<U0041> # ª <U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041> # Á <U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041> # á ... <U005A> <U005A>;<NONE>;<CAPITAL>;<U005A> # Z <U007A> <U005A>;<NONE>;<SMALL>;<U005A> # z So, it looks like __collseq_table_lookup is what the POSIX rationale document calls "CEO". I'll open a bug on the inconsistencies caused by using CEO. In the meanwhile, this bug remains open for the documentation part. |