readdir() returns inaccessible name if file was created with invalid UTF-8

Christian Franke Christian.Franke@t-online.de
Fri Jun 27 13:32:53 GMT 2025


Hi Corinna,

Corinna Vinschen via Cygwin wrote:
> Hi Christian,
>
> On Jun 26 19:07, Christian Franke via Cygwin wrote:
>> Corinna Vinschen via Cygwin wrote:
>>> On Jun 25 16:59, Christian Franke via Cygwin wrote:
>>>> On Sun, 15 Sep 2024 19:47:11 +0200, Christian Franke wrote:
>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open()
>>>>> does not refuse to create the file. Later readdir() returns a different
>>>>> name which could not be used to access the file.
>>>>>
>>>>> Testcase with U+1F321 (Thermometer):
>>>>>
>>>>> $ uname -r
>>>>> 3.5.4-1.x86_64
>>>>>
>>>>> $ printf $'\U0001F321' | od -A none -t x1
>>>>>    f0 9f 8c a1
>>>>>
>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
>>>>>
>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>>
>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
>>>>>
>>>>> $ ls -1
>>>>> ls: cannot access 'file2-.?ext': No such file or directory
>>>>> ls: cannot access 'file3-': No such file or directory
>>>>> 'file1-'$'\360\237\214\241''.ext'
>>>>> file2-.?ext
>>>>> file3-
>>>>> [...]
>>> I don't know exactly where this happens, but the input of the
>>> conversion is invalid UTF-8 because it's missing the 4th byte.
>>> There's no way to represent these filenames on Windows
>>> filesystems storing filenames as UTF-16 values.
>>>
>>> So the problem here is that the conversion somehow misses that
>>> the 4th byte is invalid and just plods forward and converts the
>>> leading three bytes into the matching high surrogate value and
>>> then stumbles over the conversion for the low surrogate.
>>>
>>> It would be really helpful to have an STC for this problem.
>> With some trial and error I found a testcase for this more serious problem
>> reported yesterday but not quoted above:
>>
>>>> In cases like file3-... above, the converted Windows path ends with
>>>> 0xF000. This suggests that this is an accidental conversion of the
>>>> terminating null to the 0xF0xx range.
>>>>
>>>> In some cases, the created Windows file name has random garbage
>>>> behind the 0xF000. Then even Cygwin is not able to access or unlink
>>>> the file after creation.
>> Testcase (attached):
> Thanks for the testcase!
>
> I found the problem in the newlib core function creating wchar_t from
> UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
> low surrogate already after reading byte 3, without checking if byte 4
> of the UTF-8 sequence is a valid byte. Hilarity ensues.
>
> Fortunately this bug has only been introduced very recently, to wit, on
> 2009-03-24, a mere 16 years ago.  And it is my bug and mine alone :}
>
> I'm just prep'ing a fix which I'll push in a minute or two.

This fixes the problem demonstrated by the testcase, thanks.


The original problem reported last year in the very first post of this 
thread still persists:

Example:

$ uname -r
3.7.0-dev-163-g5c8475417bc3.x86_64

$ mkdir test.tmp

$ cd test.tmp

$ touch $'t-\xef\x80\x80'

$ ls
ls: cannot access 't-': No such file or directory
t-

$ touch t-

$ ls -1
t-
t-

$ rm t-

$ ls
ls: cannot access 't-': No such file or directory
t-

$ cd ..

$ rm -rf test.tmp
rm: cannot remove 'test.tmp': Directory not empty

$ rm test.tmp/$'t-\xef\x80\x80'

$ rmdir test.tmp


The name mapping is:
"t-\xEF\x80\x80" -(open, ...)-> L"t-\xDB59" -(readdir)-> "t-"

Possibly difficult to fix except if creation of such files is rejected.

-- 
Thanks,
Christian



More information about the Cygwin mailing list