[1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
Wed May 13 15:39:00 GMT 2009
Corinna Vinschen wrote on Wednesday, May 13, 2009 10:30:
> On May 12 19:37, Corinna Vinschen wrote:
>> On May 13 02:29, IWAMURO Motonori wrote:
>>> I propose that the filename encoding in C locale uses UTF-8 instead
>>> of SO/UTF-8.
>>> There are three reasons:
>> That's an interesting thought. Do you have a patch and, if so, did
>> you try it? Does it, for instance, help for the issue reported in
>> the thread starting at
> After examining the issue Lenik reported in the above thread,
> I'm at a loss how to solve this problem in a generic way.
I may be dense, as all of my internationlization experience was from the late
90's. But in my experience the only solution for this is a cognizant effort on
behalf of the user (or admin).
> The problem is that the filename changes dependent on the
> character set used in $LANG. The reason is that every time a
> multibyte filename has to be generated, it has to be
> converted from UTF-16 to multibyte.
> For instance, taking one of the filename from Lenik's
> example. It's stored on the filesystem as the UTF-16
> sequence \u684c \u9762. If I set LANG to en_US.UTF-8, a
> readdir(2) call returns the multibyte sequence
> 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
> If I set LANG to en_US.GBK, `ls' returns the filename
> 0xd7 0xc0 0xc3 0xe6
> And in case LANG=C, `ls' returns
> 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
> So, dependent on the character set setting in the
> application, the idea of the filename differs. That's not
> exactly helpful for interoperability between different applications.
> I can think of two potential solutions to fix this problem:
> (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
> is the way files are stored on disk. That results in unchangable
> filenames which are always valid.
> But what if an application sets LANG="xxxx.SJIS" and
> tries to create
> a file using SJIS character encoding? Should the file be created
> using the SJIS->UTF-16 conversion or should open fail with
> EILSEQ? That's not good.
> (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
> Cygwin uses the LC_CTYPE setting which corresponds to the current
> codepage. If one of $LC_ALL/$LC_CTYPE/$LANG is set in
> the environment,
If nothing is set use UTF-8 as it will work in existing code.
> Cygwin uses that to convert pathnames. If the application uses
> setlocale, Cygwin uses that setting to convert pathnames.
> One problem can't be solved this way: If an application fetches
> and stores a filename, then switches the locale, and then tries
> to use the filename in another system call, the filename is
> potentially broken.
This is the user's problem to resolve.
> Any better ideas?
Not necessarily better, but here is a chart:
Sys: App: function expects/returns
NULL: NULL: UTF-8
C/UA: NULL: UTF-8
NULL: C/UA: UTF-8
C/UA: C/UA: UTF-8
SPEC: NULL: System Locale
SPEC: C/UA: UTF-8
NULL SPEC: Application Locale
C/UA: SPEC: Application Locale
SPEC: SPEC: Application Locale
Sys= System's current locale
App= Application's current locale
NULL= No setting
C/UA= C or any Unicode aware locale
SPEC= Some other locale (i.e. SJIS)
- Jason Pyeron PD Inc. http://www.pdinc.us -
- Principal Consultant 10 West 24th Street #100 -
- +1 (443) 269-1555 x333 Baltimore, Maryland 21218 -
This message is copyright PD Inc, subject to license 20080407P00.
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html
More information about the Cygwin