This is the mail archive of the
mailing list for the Cygwin project.
Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
Am 13.05.2009, 16:29 Uhr, schrieb Corinna Vinschen
On May 12 19:37, Corinna Vinschen wrote:
On May 13 02:29, IWAMURO Motonori wrote:
> I propose that the filename encoding in C locale uses UTF-8 instead
> There are three reasons:
That's an interesting thought. Do you have a patch and, if so, did you
try it? Does it, for instance, help for the issue reported in the
thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
After examining the issue Lenik reported in the above thread, I'm at
a loss how to solve this problem in a generic way.
The problem is that the filename changes dependent on the character
set used in $LANG. The reason is that every time a multibyte filename
has to be generated, it has to be converted from UTF-16 to multibyte.
For instance, taking one of the filename from Lenik's example. It's
stored on the filesystem as the UTF-16 sequence \u684c \u9762. If I set
LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence
0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
If I set LANG to en_US.GBK, `ls' returns the filename
0xd7 0xc0 0xc3 0xe6
And in case LANG=C, `ls' returns
0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
So, dependent on the character set setting in the application, the idea
of the filename differs. That's not exactly helpful for interoperability
between different applications.
I can think of two potential solutions to fix this problem:
(1) Always return filenames in UTF-8 encoding and pretend that UTF-8
is the way files are stored on disk. That results in unchangable
filenames which are always valid.
But what if an application sets LANG="xxxx.SJIS" and tries to create
a file using SJIS character encoding? Should the file be created
using the SJIS->UTF-16 conversion or should open fail with EILSEQ?
That's not good.
Why would it have to interpreted as all? Aren't filenames just opaque
strings - with exceptions, say, for / and NUL to UNIX kernels?
(2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
Cygwin uses the LC_CTYPE setting which corresponds to the current
codepage. If one of $LC_ALL/$LC_CTYPE/$LANG is set in the
Cygwin uses that to convert pathnames. If the application uses
setlocale, Cygwin uses that setting to convert pathnames.
One problem can't be solved this way: If an application fetches
and stores a filename, then switches the locale, and then tries
to use the filename in another system call, the filename is
Any better ideas?
Just questions to kindle some brainstorming:
- why do you need to touch the filename at all? I haven't read all of it.
Is the UTF-16 on disk and we need to work around UTF-16 being intractable
as C string?
- some applications in the GNOME ballpark, for instance Gnumerica, do
something like "treat as Unicode" and fall back to
SOME_ENVIRONMENT_VARIABLE specified encoding (perhaps as a colon-separated
list - not sure)
- adding to my interspersed comment above: isn't the issue more about
*presentation* of filenames to the user than internal workings? To me the
main issue appears to be that filenames should look alike in a Cygwin
application and in a native Windows application. I'd assume that
applications can get really confused if you change file names behind their
- if you speak of UTF-8, do you want to normalize file names? (I'd think
you do.) Which normalization form will you choose? NFC (canonical) or NFD
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html