Unconsistent command-line parsing in case of UTF-8 quoted arguments

Kaz Kylheku (Cygwin) 743-406-3965@kylheku.com
Tue Oct 13 16:30:37 GMT 2020


On 2020-10-06 14:36, Jérôme Froissart wrote:
> Here is an example C file
>     $ cat example.c
>     #include <stdio.h>
> 
>     const char *GetCommandLineA(void);
> 
>     int main(int argc, char *argv[])
>     {
>         const char *s = GetCommandLineA();
>         printf("C=%s\n", s);
> 
>         for (int i = 0; argc > i; i++)
>             printf("%d=%s\n", i, argv[i]);
> 
>         return 0;
>     }

Your program's comparison seems to be based on the
hypothesis that Cygwin parses the GetCommandLineA() command line.

But this hypothesis is almost certainly wrong.

> Now, let's start a Windows shell (cmd.exe)
> Note that I had to copy cygwin1.dll from my Cygwin installation
> directory, otherwise binary.exe would not start.
> I do not know whether there is a `locale` equivalent in Windows
> command prompt, so I merely ran my program.
>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>     C=binary.exe  "foo bar" "J□r□me"
>     0=binary
>     1=foo bar
>     2="Jérôme"

The "A" command line from GetCommandLineA has "tofu"
characters: é and ô were not decoded properly.

The é and ô characters we see in the Cygwin-parsed
arguments coming into main could not have been recovered
from these "tofu" replacement characters.

What is actually being parsed must be the WCHAR command line
corresponding to what comes from GetCommandLineW().

It's necessary to show that one to get a more complete understanding.



More information about the Cygwin mailing list