This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

Corinna Vinschen wrote on Wednesday, May 13, 2009 10:30:
> On May 12 19:37, Corinna Vinschen wrote:
>> On May 13 02:29, IWAMURO Motonori wrote:
>>> I propose that the filename encoding in C locale uses UTF-8 instead
>>> of SO/UTF-8. 
>>> There are three reasons:
>> That's an interesting thought.  Do you have a patch and, if so, did
>> you try it?  Does it, for instance, help for the issue reported in
>> the thread starting at
> After examining the issue Lenik reported in the above thread,
> I'm at a loss how to solve this problem in a generic way.

I may be dense, as all of my internationlization experience was from the late
90's. But in my experience the only solution for this is a cognizant effort on
behalf of the user (or admin).

> The problem is that the filename changes dependent on the
> character set used in $LANG.  The reason is that every time a
> multibyte filename has to be generated, it has to be
> converted from UTF-16 to multibyte.
> For instance, taking one of the filename from Lenik's
> example.  It's stored on the filesystem as the UTF-16
> sequence \u684c \u9762.  If I set LANG to en_US.UTF-8, a
> readdir(2) call returns the multibyte sequence
>  0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
> If I set LANG to en_US.GBK, `ls' returns the filename
>  0xd7 0xc0 0xc3 0xe6
> And in case LANG=C, `ls' returns
>  0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
> So, dependent on the character set setting in the
> application, the idea of the filename differs.  That's not
> exactly helpful for interoperability between different applications.
> I can think of two potential solutions to fix this problem:
> (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
>     is the way files are stored on disk.  That results in unchangable
>     filenames which are always valid.
>     But what if an application sets LANG="xxxx.SJIS" and
> tries to create
>     a file using SJIS character encoding?  Should the file be created
>     using the SJIS->UTF-16 conversion or should open fail with
> EILSEQ?     That's not good. 
> (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
>     Cygwin uses the LC_CTYPE setting which corresponds to the current
>     codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in
> the environment,

If nothing is set use UTF-8 as it will work in existing code.

>     Cygwin uses that to convert pathnames.  If the application uses
>     setlocale, Cygwin uses that setting to convert pathnames.
>     One problem can't be solved this way:  If an application fetches
>     and stores a filename, then switches the locale, and then tries
>     to use the filename in another system call, the filename is    
> potentially broken. 

This is the user's problem to resolve.

> Any better ideas?

Not necessarily better, but here is a chart:

Sys:	App:	function expects/returns
SPEC:	NULL:	System Locale
NULL	SPEC:	Application Locale
C/UA:	SPEC:	Application Locale
SPEC:	SPEC:	Application Locale


Sys= System's current locale
App= Application's current locale
NULL= No setting
C/UA= C or any Unicode aware locale
SPEC= Some other locale (i.e. SJIS)


-                                                               -
- Jason Pyeron                      PD Inc. -
- Principal Consultant              10 West 24th Street #100    -
- +1 (443) 269-1555 x333            Baltimore, Maryland 21218   -
-                                                               -
This message is copyright PD Inc, subject to license 20080407P00.

Unsubscribe info:
Problem reports:

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]