Bug 31075

Summary: fnmatch("??") matches on one two byte valid character (as well as two any-length characters)
Product: glibc Reporter: Stephane Chazelas <stephane+sourceware>
Component: globAssignee: Not yet assigned to anyone <unassigned>
Status: UNCONFIRMED ---    
Severity: normal Flags: stephane+sourceware: security?
Priority: P2    
Version: 2.34   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Stephane Chazelas 2023-11-18 10:03:11 UTC
Regression introduced in 2.34 by commit a79328c745219dcb395070cdcd3be065a8347f24  reproduced on Ubuntu 22.04, Debian sid libc6:amd64 2.37-12, and current git HEAD (dae3cf4134d476a4b4ef86fd7012231d6436c15e) built on that sid system.

find . -name '??'

In a UTF-8 locale matches on a UTF-8 encoded éé (0xc3 0xa9 0xc3 0xa9) but also on a UTF-8 encoded é (0xc3 0xa9):

To reproduce, from a shell with support for ksh93-style $'...' quotes (ksh93, zsh, bash...) and on a system where the C.UTF-8 locale has been enabled (change to any other UTF-8 locale if not):

(
  mkdir new-dir && cd new-dir || exit
  touch $'\xc3\xa9' $'\xc3\xa9\xc3\xa9'
  export LC_ALL=C.UTF-8
  locale charmap
  find . -name '??'
)

UTF-8
./é
./éé

It seems when fnmatch() fails to match in wchar_t mode, it tries again in char mode. The pattern is also treated as a char[] array then which makes it even worth than the (already quite buggy) behaviour of bash pattern matching (https://lists.gnu.org/archive/html/bug-bash/2021-02/msg00054.html), as that's done even when both the subject and pattern are properly encoded in the user's locale charmap.

$ find . '*[á-ä]*'
./é
./éé

Those didn't match in wchar_t mode but matched in char mode as that became a *[\303\241-\303\244]* match so matches on anything containing byte 0241 to 0303.

Like for bash, it becomes worse in locales that have characters whose encoding contains the encoding of [, ] or \ as it can end up matching on a pattern completely different from the one intended by the user.
Comment 1 Stephane Chazelas 2023-11-18 11:53:33 UTC
(In reply to Stephane Chazelas from comment #0)
> Regression introduced in 2.34 by commit
> a79328c745219dcb395070cdcd3be065a8347f24  reproduced on Ubuntu 22.04, Debian
> sid libc6:amd64 2.37-12, and current git HEAD
> (dae3cf4134d476a4b4ef86fd7012231d6436c15e) built on that sid system.
> 
> find . -name '??'
[...]

To clarify, "find" is used here to demonstrate the behaviour of the libc's fnmatch(). Here with GNU find.

$ LD_DEBUG=bindings find é -name '??' |& grep fnmatch
    323895:     binding file /lib/x86_64-linux-gnu/libselinux.so.1 [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fnmatch' [GLIBC_2.2.5]
    323895:     binding file find [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fnmatch' [GLIBC_2.2.5]

$ ltrace -e 'fnmatch' find é éé $'\U10FFFF\U10FFFF' -name '??'
find->fnmatch("foo", "foo", 0)                                                                                        = 0
find->fnmatch("Foo", "foo", 0)                                                                                        = 1
find->fnmatch("Foo", "foo", 16)                                                                                       = 0
find->fnmatch("??", "\303\251", 0)                                                                                    = 0
é
find->fnmatch("??", "\303\251\303\251", 0)                                                                            = 0
éé
find->fnmatch("??", "\364\217\277\277\364\217\277\277", 0)                                                            = 0
??
+++ exited (status 0) +++

(here showing ?? matching one 2-byte character, two 2-byte characters and two 4-byte characters).
Comment 2 Stephane Chazelas 2023-11-24 19:56:04 UTC
Can likely be considered a security issue as that means patterns match things that where not intended be matched (I'll let you guys decide on that), but on the other hand that bug works around long-standing issues whereby for instance

find . ! -name '*evil*' -exec ... {} +

was failing to exclude file names containing "evil" when what's on either side is not valid text in the users locale (a common issue these days where UTF-8 is the norm).

Though of course falling back to treating both pattern and subject as char[] arrays when the subject cannot be decoded as text like bash does (and what might have been the intent of a79328c745219dcb395070cdcd3be065a8347f24) is incorrect (see https://lists.gnu.org/archive/html/bug-bash/2021-02/msg00054.html for more details).