This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: filenames with characters that have the high bit set


David Byron:
>> > And my ~/.inputrc contains:
>> >
>> > set meta-flag on
>> > set convert-meta off
>> > set input-meta on
>> > set output-meta on
>>
>> Makes plenty of sense. But note that meta-flag is a synonym for
>> input-meta, so you can remove one of them.
>
> I was just following the instructions at
> http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode

I see. FAQ maintainers, can we have the meta-flag removed?

[time passes]

Actually, it appears that bash/readline automatically sets those flags
as shown if the locale is
anything but "C". So since the default locale is "C.UTF-8" and
non-ASCII stuff can't be expected to work in the "C" locale anyway, I
think the whole FAQ entry could just be removed.

Similarly, the commented out settings of those flags in
/etc/skel/.inputrc could go.


>> > $ echo $LC_ALL
>> > en_US
>>
>> Hang on, where did that come from?
>
> When my cygwin.bat has set LANG=en_US.UTF-8, I get LANG=en_US.UTF-8 and
> LC_ALL=en_US in bash. ÂWhen my cygwin.bat doesn't set LANG, I get
> LC_ALL=en_US and LANG isn't set.

So where does LC_ALL get set? In the system-wide environment (in
Computer->Properties->Advanced->Environment Variabes)? Or in one of
the bash startup files?

> I unset LC_ALL and...

Where? I'm asking because if it's set to 'en_US' at the point bash is
invoked, but unset afterwards, then bash will be using CP1252 while
programs invoked by it will use UTF-8, which of course is bound to
cause trouble ...


> Now ls foo<tab> adds the actual accented character to the command line, but
> when I press return I get:
>
> ls: cannot access foo<a gray box>: No such file or directory

... like that ...

> when I pipe the error message to od -c, the gray box is octal 351 or 0xE9.
>
> I still get the right answer from test -f, when using the shell builtin.
> /usr/bin/test tells me the file doesn't exist.

.. and that.


>> The \x18 scheme is only used for codepoints that can not be
>> represented in the selected character set, yet U+00E9 can be
>> represented CP1252. By definition, any Unicode codepoint can be
>> represented in UTF-8, so the \x18 scheme is never used when that is
>> selected.
>>
>> To enable C-style backslash interpretation, you need to use
>> $'...' quoting.
>
> I now see the bash man page explains this. ÂMust have missed it the first
> time. ÂThe above paragraphs with some examples (where \x18 is needed and
> where it isn't) added to
> http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual
> would have gotten me farther before posting.

But what I said is explained there already:

"If you don't want or can't use UTF-8 as character set for whatever
reason, you will nevertheless be able to access the file. How does
that work? When Cygwin converts the filename from UTF-16 to your
character set, it recognizes characters which can't be converted. If
that occurs, Cygwin replaces the non-convertible character with a
special character sequence. The sequence starts with an ASCII CAN
character (hex code 0x18, equivalent Control-X), followed by the UTF-8
representation of the character. The result is a filename containing
some ugly looking characters. While it doesn't look nice, it is nice,
because Cygwin knows how to convert this filename back to UTF-16. The
filename will be converted using your usual character set. However,
when Cygwin recognizes an ASCII CAN character, it skips over the ASCII
CAN and handles the following bytes as a UTF-8 character. Thus, the
filename is symmetrically converted back to UTF-16 and you can access
the file."

Best to use UTF-8, though, and forget that you've ever heard about the
^X scheme. You're certainly not expected to have to enter \x18 on the
command line to access non-ASCII filenames.


>> Have a look in your root directory. There should be a file
>> called x18 there.
>
> I don't see anything in my cygwin root (/) but I do see x18 in the root of
> my C drive. ÂThanks.

Ah yes, '\x18' is interpreted as a DOS path, so you get the root of
your system drive rather than the Cygwin root.


> And finally here are the steps that illustrate what's going on.
>
> $ touch $'\x18'; echo $?
> 0
>
> ls shows a file named up-arrow (0x18):

What do you mean by up-arrow? I'm getting a question mark, because
that's what ls prints for non-printable characters by default. You can
choose various quoting styles using the --quoting style option.

> $ ls<tab>
> ^X
>
> which seems inconsistent.

Yep, but that's a bash vs ls issue rather than a Cygwin one. You'd get
the same on Linux. But if you use control characters in filenames, you
better know what you're doing anyway. Some argue that it shouldn't be
allowed in the first place, e.g.
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html


> $ mkshortcut -n shortcut$'\xC3\xA9' plain; echo $?
> $ readshortcut shortcut$'\xE9'

I'm afraid these aren't yet Unicode-ready, i.e. they still use Windows
"ANSI" APIs.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]