Bug 11561 - Collation characters represented by internal name instead of character sequence
Summary: Collation characters represented by internal name instead of character sequence
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: regex (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: 2.18
Assignee: Paolo Bonzini
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-30 07:36 UTC by Paolo Bonzini
Modified: 2014-06-30 18:14 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
seek also the hyphenated form of a collation-element name (321 bytes, patch)
2013-02-03 15:42 UTC, Benno Schulenberg
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Paolo Bonzini 2010-04-30 07:36:27 UTC
In the glibc locale definitions, collating elements have a hyphenated name:

    collating-symbol  <zs>
    collating-element <z-s> from "<U007A><U0073>"

and the hyphenated name have to be used in regular expression for [[. .]] to
work properly:

    $ echo '*ch*' | LC_COLLATE=cs_CZ.UTF-8 sed 's/[[.c-h.]]//'
    **
    $ echo 'ch' | LC_COLLATE=cs_CZ.UTF-8 sed 's/[[.ch.]]//'
    sed: -e expression #1, char 12: Invalid collation character

However, POSIX 1.2008 says:

        A collating symbol is a collating element enclosed within
        bracket-period ( "[." and ".]" ) delimiters. Collating
        elements are defined as described in Collation Order .
        Conforming applications shall represent multi-character
        collating elements as collating symbols when it is
        necessary to distinguish them from a list of the
        individual characters that make up the multi-character
        collating element. For example, if the string "ch" is a
        collating element defined using the line:

        collating-element <ch-digraph> from "<c><h>"

        in the locale definition, the expression "[[.ch.]]" shall
        be treated as an RE containing the collating symbol 'ch',
        while "[ch]" shall be treated as an RE matching 'c' or
        'h' . Collating symbols are recognized only inside
        bracket expressions. If the string is not a collating
        element in the current locale, the expression is invalid.

POSIX especially mentions [[.ch.]] in the example instead of [[.ch-digraph.]] so
this is a bug in glibc.  It shouldn't be hard to fix it in regcomp.
Comment 1 Benno Schulenberg 2013-02-03 15:37:37 UTC
> POSIX especially mentions [[.ch.]] in the example instead of [[.ch-digraph.]]
> so this is a bug in glibc.  It shouldn't be hard to fix it in regcomp.

The easiest fix would be, in my opnion, to rename all the
collation-element names for digraphs from their hyphenated
form to the non-hyphenated form.  But a few users may have
gotten used to using the hyphented forms, working around
this bug in glibc.  They would be pissed.  So for quite a
while both forms will have to recognized.  Attached patch
is an attempt to do this -- when the user specifies [.xx.],
it will first try to look up "xx" in the table of collation
elements, and when that fails, it will look up "x-x".  Is
this what you had in mind, Paolo?
Comment 2 Benno Schulenberg 2013-02-03 15:42:18 UTC
Created attachment 6843 [details]
seek also the hyphenated form of a collation-element name

(Untested patch.  I'm just majorly annoyed that collation elements
don't work as one would expect from the documenmtation.)
Comment 3 Andreas Schwab 2013-02-12 08:26:48 UTC
Fixed in 2.18.
Comment 4 Jackie Rosen 2014-02-16 17:43:58 UTC Comment hidden (spam)