Cygwin, Unicode and "long" path names

Sat Jun 26 05:33:12 GMT 2021

)()On Fri, 25 Jun 2021 at 19:55, Vadim <vad@syping.de> wrote:
>
> Ah, this beautiful topic. Windows 7 x64.
>
> This is the summary written as post-scriptum, tests and findings below:
>
> 1) Cygwin limits individual names to 255 bytes, Windows seems to follow
> UTF-16 chars and work fine: 256 bytes in 108 characters works.
>
> Basically, this becomes a bytes vs characters story.
>
> 2) Bash file name auto-expansion detects the file of that name, but it
> gets truncated to 255 bytes. find's behaviour is the same ("No such file
> or directory" due to trying to access a non-existing truncated name)
>
> 2.1) If you try to correct the above mistake by adding truncated
> characters, then the program (cat) will complain about "File name too long"
>
> 2.2) If there exists a folder with a 255-byte name, equal to the
> truncated name, then "find ." will do a listing on that folder twice
> (effectively hiding the long-named folder from tools without leaving an
> error message)
>
> 3) UNC Paths get the same treatment: File name too long.
>
> I expected Cygwin to handle these names without problems just like
> Windows, Explorer, cmd etc. do. Is this particular problem new or known?
> All I could find on the mailing list is around the time when Cygwin
> hadn't yet implemented Unicode support (UTF-8?), ~2004-2008.
>
> These names were created by youtube-dl.exe executed from within Cygwin.
>
> - Vadim

I believe this is the result of the difference between Pascal type
strings, which have a length-byte followed by data-bytes and C type
strings which have data-bytes followed by a zero-byte, or worse, in
the case of two byte characters, data-words followed by a zero-word.

For single byte characters both  P and C styles use 256 bytes. Using
the 255 length limit without accounting for the trailing zero-byte
could account for some of the observed problem.

More likely, the problems relates to double byte character sets. For
double byte characters, 255 bytes of UTF-16 characters or more likely
255 bytes of MCBS (multi-byte character set) or DBCS (double-byte
character set) can encode to more or less than 255 UTF-8 bytes
depending on the average bytes/character of the UTF-8 encoding. This
could account for the failure to handle all bytes of the NTFS filename
when converted to UTF-8. Converted Linux programs may fail to allocate
a large enough encoding buffer leading to the observed truncation.
Similarly for 510 bytes containing 255 words of DBCS characters.

Youtube-dl.exe is basically a windows Python 3 program with
C-extensions. Python 3 properly handles Unicode and the encoding and
decoding of the aforementioned character encodings.

I would look for library functions which perform decoding of NTFS file
names into UTF-8 names, verify their correctness, and follow the path
of the usage of their output through the system. I think this will
mean that using the windows 255 byte limit cannot be used at all in
any cygwin program that will handle international file names.
Unfortunately that sounds like a lot of work. In theory, if all 255
characters in the filename component required 4 byte UTF-8 encodings,
this would require about 1024 bytes. However this does not even touch
on emojis where a one character emoji can expand to as much as 35 or
so bytes! That basically means the end of static allocation for file
and directory names and name component buffers. That may be a major
job in the cygwin kernel, not to mention all the available packages!

HTH
Doug

-- 
Doug Henderson, Calgary, Alberta, Canada - from gmail.com