Bug 31149 - combining characters (accents) misclassified as punct rather than alpha
Summary: combining characters (accents) misclassified as punct rather than alpha
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.37
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-12-12 15:53 UTC by Vincent Lefèvre
Modified: 2023-12-13 10:15 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vincent Lefèvre 2023-12-12 15:53:21 UTC
Combining characters such as U+0301 COMBINING ACUTE ACCENT are misclassified as punct, while this should be alpha.

With glibc 2.37 under Debian/unstable, I get for this character:

Property alnum : no
Property alpha : no
Property cntrl : no
Property digit : no
Property graph : yes
Property lower : no
Property print : yes
Property punct : yes
Property space : no
Property upper : no
Property xdigit: no

This affects grep: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27681 (where it is said that the bug is in the GNU libc).

Corresponding Debian bug:
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=868654
(which was reported on 2017-07-17 and hasn't got any activity yet).
Comment 1 Florian Weimer 2023-12-13 10:01:12 UTC
Isn't the larger issue here that it's reasonable to expect that [[:alpha:]] matches a single letter as perceived by the user: an entire grapheme cluster comprising the base character(s), its associated combining characters and other marks. We do not implement any of that in glibc, and there are no plans to do so.
Comment 2 Vincent Lefèvre 2023-12-13 10:15:18 UTC
(In reply to Florian Weimer from comment #1)
> Isn't the larger issue here that it's reasonable to expect that [[:alpha:]]
> matches a single letter as perceived by the user: an entire grapheme cluster
> comprising the base character(s), its associated combining characters and
> other marks. [...]

I don't think so. The functions iswctype(), iswalpha(), etc. take a single code-point (type wint_t), and the regex(7) man page says:

  Within a bracket expression, the name of a character class enclosed
  in "[:" and ":]" stands for the list of all characters belonging to
  that class. Standard character class names are:

        alnum   digit   punct
        alpha   graph   space
        blank   lower   upper
        cntrl   print   xdigit

  These stand for the character classes defined in wctype(3). [...]

so that it is expected that [[:alpha:]] matches a single character, like the above functions.