readdir() returns inaccessible name if file was created with invalid UTF-8

Christian Franke Christian.Franke@t-online.de
Thu Sep 19 13:27:29 GMT 2024


Mark Liam Brown via Cygwin wrote:
> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
> <cygwin@cygwin.com> wrote:
>> Christian Franke via Cygwin wrote:
>>> Thomas Wolff via Cygwin wrote:
>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open()
>>>>>> does not refuse to create the file. Later readdir() returns a
>>>>>> different name which could not be used to access the file.
>>>>>>
>>>>>> Testcase with U+1F321 (Thermometer):
>>>>>>
>>>>>> $ uname -r
>>>>>> 3.5.4-1.x86_64
>>>>>>
>>>>>> $ printf $'\U0001F321' | od -A none -t x1
>>>>>>   f0 9f 8c a1
>>>>>>
>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
>>>>>>
>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>>>
>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
>>>>>>
>>>>>> $ ls -1
>>>>>> ls: cannot access 'file2-.?ext': No such file or directory
>>>>>> ls: cannot access 'file3-': No such file or directory
>>>>>> 'file1-'$'\360\237\214\241''.ext'
>>>>>> file2-.?ext
>>>>>> file3-
>>>>> I don't reproduce this.
>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
>>> which needs to call stat(). Plain 'ls' does not, so the errors do not
>>> occur then.
>>>
>>>
>>>>> While the file name gets mangled, all resulting file names are valid
>>>>> and
>>>>> listed:
>>>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot.
>>>>> In file3 the same sequence is just dropped.
>>>>> $ ls -1|cat
>>>>> file1-🌡.ext
>>>>> file2-.ឳext
>>>>> file3-
>>>>>
>>>>> However, ls file2* fails, as does ls *.
>>>> On the other hand, ls file3- fails too, so some mapping error occurs
>>>> internally.
>>>> Also, the files cannot be deleted from cygwin (need to use cmd).
>>> 'rm' using the original names works for file2-..., but not for file3-...
>>>
>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
>>> removed 'file2-'$'\360\237\214''.ext'
>>>
>>> $ rm -v 'file3-'$'\xf0\x9f\x8c'
>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
>>>
>> Further tests suggest that the problem only occurs with:
>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
>> 'high surrogate' range (0xD800..0xDBFF).
> Makes perfect sense, the Windows kernel uses UTF16 internally.


Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> 
UTF-16 mappings. This makes no sense:

$ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on NTFS

$ strace ls -F
...
... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" > 
"file-\xE2\x9E\xB3.ext")
...
  ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...)
...
ls: cannot access 'file-?.ext': No such file or directory
file-?.ext

$ rm -v 'file-'$'\xed\xa0\x80''.ext'
removed 'file-'$'\355\240\200''.ext'

The UTF-8 sequence returned by readdir() decodes to U+27B3 
(White-Feathered Rightwards Arrow).


This could be fixed by handling UTF-8 of the surrogate range similar to 
other invalid sequences: Map each invalid byte to unicode range U+FF80 
to U+FFFF. This works as expected if the above UTF-8 sequence is truncated:

$ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS

$ ls -F
'file-'$'\355\240''.ext'

-- 
Regards,
Christian



More information about the Cygwin mailing list