Bug 14185

Summary:	fnmatch() fails when '*' wildcard is applied on the file name containing multi-byte character(s)
Product:	glibc	Reporter:	Patsy Franklin <pfrankli>
Component:	glob	Assignee:	Adhemerval Zanella <adhemerval.zanella>
Status:	RESOLVED FIXED
Severity:	normal	CC:	adhemerval.zanella, bugdal, drepper.fsp, fweimer, law, neleai
Priority:	P2	Flags:	fweimer: security-
Version:	unspecified
Target Milestone:	2.34
Host:		Target:
Build:		Last reconfirmed:
Attachments:	simple testcase patch file

Description Patsy Franklin 2012-05-30 19:01:45 UTC

Created attachment 6427 [details]
simple testcase

fnmatch tries to convert the string (filename) from multibyte to a wide character representation via mbsrtowcs.  When the encoding is invalid (say \366 when using en_US.UTF-8), mbsrtowcs returns EILSEQ (illegal byte sequence).

Comment 1 Rich Felker 2012-05-31 00:29:51 UTC

While this is not a conformance issue (actually, the most correct/conformant thing may be to reject as non-matching or give an error on invalid strings like this), I believe it is a quality-of-implementation issue. It makes it unnecessarily difficult to do things like deleting junk files created by an obnoxious user or extracting a corrupt archive file (or just one created by someone with poor taste in character encoding); "rm -f *" will fail.

As a related (same fundamental cause: the conversion to a wchar_t string) quality-of-implementation issue, fnmatch("*", huge_string, 0) fails on glibc even though it should obviously match without even having to inspect huge_string, much less make a 4x-size copy of it.

Unfortunately glibc's fnmatch implementation is just really ugly, and I don't think issues like this can be fixed without piling on ugly hacks, or just replacing the implementation with something saner...

Comment 2 law 2012-05-31 02:49:09 UTC

AFAICT there are no defined errors from fnmatch.  It's defined as returning 0 for a match, FNM_NOMATCH for no match and another non-zero value for an error.  However, no specific error codes are defined.

But what doesn't make sense about the current implementation is if pattern is something like *.csv and the string is <invalid>.csv, then we return an error.  To me that seems plainly wrong as we can get a positive match regardless of the invalid character in the string.

As you note, it makes rm -f * fail as will "find" and a variety of other tools which rely upon fnmatch.  This actually came to our attention because "find" was missing an obvious match because of an invalid character in a filename.

It's worth noting that gnulib's fnmatch has been fixed to address this problem.


Yes, there's an additional issue with handling out of memory conditions with large strings.  We don't have a fix for that and I think it should be tracked as a separate and distinct issue.

Comment 3 Patsy Franklin 2012-05-31 20:18:10 UTC

Created attachment 6429 [details]
patch file

Comment 4 Rich Felker 2012-06-01 02:32:05 UTC

This patch fixes the bug in the test case, but I'm not convinced that it doesn't create other (possibly much worse!) bugs in the process. For instance, the pattern "[á]*" should not match "\xa1.txt", but with fallback to treating everything as bytes, it does match. Basically this patch would make "rm -f pattern" potentially delete lots of files with illegal byte sequences in their names whenever pattern contains a bracket expression with multibyte characters in it. (Note that the effect becomes even more severe with a range expression like [à-á]...)

Comment 5 law 2012-06-18 17:17:34 UTC

Rich,
Yes, the proposed patch may have an issue around pattern "[á]*" should not match "\xa1.txt"; not sure if that's a worse situation or not.

The problem the patch fixes has been reported multiple times over the last 10 years or so without anyone taking action in glibc.  gnulib fixed it in effectively the same way as the patch above in 2005 and I don't think anyone's reported problems.

SuSE had the patch attached to this BZ in their distro for a while, but pulled it for reasons unknown.

Comment 6 Ondrej Bilka 2013-10-09 09:18:53 UTC

Was patch above send to libc-alpha?

A proper solution would be use new function called say mbsrtowcs_with_errors.
At illegal sequence it would write special widechar and tried to resumed matching next byte.

It would be slow but fnmatch is not performance-critical anyway, on UTF8 2/3 of time is spend at conversion to wide chars which is not neccessary when needle is ascii only.

Comment 7 Rich Felker 2013-10-10 16:50:50 UTC

A much better solution would be to rewrite fnmatch not to go through a temporary wchar_t string at all. It's actually just as easy, if not easier, to do the matching directly as a multibyte string. It can be done in-place (no additional temporary storage at all) for standard fnmatch, but I'm not clear on whether this can be made efficient for the GNU extensions. I believe it may also be possible to adapt the twoway algorithm to fnmatch so that (at least most) matches take linear time rather than quadratic time, but it's still quite possible that I'm wrong on this (I haven't worked out enough of the details yet).

Comment 8 Florian Weimer 2018-08-17 13:16:34 UTC

Our current mbrtowc implementation is quite slow because it has to call into gconv to convert a single character.  It's not extremely bad (~40 MiB/s on a somewhat dated laptop, Core i7-2620M CPU), but I can see why there is a desire to have a fast path for single-byte strings.  Replacing *p++ with mbrtowc for string decoding would probably need a UTF-8 path inside mbrtowc, or a different kind of interface to reduce the setup overhead.

Comment 9 Adhemerval Zanella 2018-08-17 17:44:00 UTC

(In reply to Florian Weimer from comment #8)
> Our current mbrtowc implementation is quite slow because it has to call into
> gconv to convert a single character.  It's not extremely bad (~40 MiB/s on a
> somewhat dated laptop, Core i7-2620M CPU), but I can see why there is a
> desire to have a fast path for single-byte strings.  Replacing *p++ with
> mbrtowc for string decoding would probably need a UTF-8 path inside mbrtowc,
> or a different kind of interface to reduce the setup overhead.

We will need to call mbrtowc for both pattern and string, so it would be at least
two call per iteration. I am not sure if real world user cases pay off this optimization, some cases or early bailout (where there is no need to check for all pattern and/or string) might be faster by converting character by character. 

I am also not sure how extensively multibyte patterns are used since there are not exactly stardard-defined, so if it the case where multi-byte character enconding is the most common, we can simplify the implementation to just use
the multi-byte version by always converting internally and adjusting the use of expected internal collation tables (it might require some adjustments of the findidx code).

However the main issue imho is the total memory usage (without considering the GNU extensions), which I tend to prefer to handle internally character by character instead of converting the full pattern/string altogether.

Comment 10 Sourceware Commits 2021-02-23 21:06:53 UTC

The master branch has been updated by Adhemerval Zanella <azanella@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a79328c745219dcb395070cdcd3be065a8347f24

commit a79328c745219dcb395070cdcd3be065a8347f24
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Mon Jan 4 14:26:30 2021 -0300

    posix: Falling back to non wide mode in case of encoding error [BZ #14185]
    
    Gnulib has added the proposed fix with aed23714d60 (done in 2005), but
    recently with a glibc merge with 67306f6 (done in 2020 with sync back)
    it has fallback to old semantic to return -1 on in case of failure.
    
    From gnulib developer feedback it was an oversight.  Although the full
    fix for BZ #14185 would require to rewrite fnmatch implementation to use
    mbrtowc instead of mbsrtowcs on the full input, this mitigate the issue
    and it has been used by gnulib for a long time.
    
    This patch also removes the alloca usage on the string convertion to
    wide characters before calling the internal function.
    
    Checked on x86_64-linux-gnu.

Comment 11 Adhemerval Zanella 2021-02-23 21:07:43 UTC

Fixed on 2.34.