Created attachment 6427 [details]
fnmatch tries to convert the string (filename) from multibyte to a wide character representation via mbsrtowcs. When the encoding is invalid (say \366 when using en_US.UTF-8), mbsrtowcs returns EILSEQ (illegal byte sequence).
While this is not a conformance issue (actually, the most correct/conformant thing may be to reject as non-matching or give an error on invalid strings like this), I believe it is a quality-of-implementation issue. It makes it unnecessarily difficult to do things like deleting junk files created by an obnoxious user or extracting a corrupt archive file (or just one created by someone with poor taste in character encoding); "rm -f *" will fail.
As a related (same fundamental cause: the conversion to a wchar_t string) quality-of-implementation issue, fnmatch("*", huge_string, 0) fails on glibc even though it should obviously match without even having to inspect huge_string, much less make a 4x-size copy of it.
Unfortunately glibc's fnmatch implementation is just really ugly, and I don't think issues like this can be fixed without piling on ugly hacks, or just replacing the implementation with something saner...
AFAICT there are no defined errors from fnmatch. It's defined as returning 0 for a match, FNM_NOMATCH for no match and another non-zero value for an error. However, no specific error codes are defined.
But what doesn't make sense about the current implementation is if pattern is something like *.csv and the string is <invalid>.csv, then we return an error. To me that seems plainly wrong as we can get a positive match regardless of the invalid character in the string.
As you note, it makes rm -f * fail as will "find" and a variety of other tools which rely upon fnmatch. This actually came to our attention because "find" was missing an obvious match because of an invalid character in a filename.
It's worth noting that gnulib's fnmatch has been fixed to address this problem.
Yes, there's an additional issue with handling out of memory conditions with large strings. We don't have a fix for that and I think it should be tracked as a separate and distinct issue.
Created attachment 6429 [details]
This patch fixes the bug in the test case, but I'm not convinced that it doesn't create other (possibly much worse!) bugs in the process. For instance, the pattern "[á]*" should not match "\xa1.txt", but with fallback to treating everything as bytes, it does match. Basically this patch would make "rm -f pattern" potentially delete lots of files with illegal byte sequences in their names whenever pattern contains a bracket expression with multibyte characters in it. (Note that the effect becomes even more severe with a range expression like [à-á]...)
Yes, the proposed patch may have an issue around pattern "[á]*" should not match "\xa1.txt"; not sure if that's a worse situation or not.
The problem the patch fixes has been reported multiple times over the last 10 years or so without anyone taking action in glibc. gnulib fixed it in effectively the same way as the patch above in 2005 and I don't think anyone's reported problems.
SuSE had the patch attached to this BZ in their distro for a while, but pulled it for reasons unknown.
Was patch above send to libc-alpha?
A proper solution would be use new function called say mbsrtowcs_with_errors.
At illegal sequence it would write special widechar and tried to resumed matching next byte.
It would be slow but fnmatch is not performance-critical anyway, on UTF8 2/3 of time is spend at conversion to wide chars which is not neccessary when needle is ascii only.
A much better solution would be to rewrite fnmatch not to go through a temporary wchar_t string at all. It's actually just as easy, if not easier, to do the matching directly as a multibyte string. It can be done in-place (no additional temporary storage at all) for standard fnmatch, but I'm not clear on whether this can be made efficient for the GNU extensions. I believe it may also be possible to adapt the twoway algorithm to fnmatch so that (at least most) matches take linear time rather than quadratic time, but it's still quite possible that I'm wrong on this (I haven't worked out enough of the details yet).