This is the mail archive of the
mailing list for the Cygwin project.
Re: bug/deficiency in zip: non-ascii chars in file names work, but fail in directory names
- From: Brent <yhbrent at yahoo dot com>
- To: "cygwin at cygwin dot com" <cygwin at cygwin dot com>
- Date: Sun, 2 Nov 2014 05:32:54 -0800
- Subject: Re: bug/deficiency in zip: non-ascii chars in file names work, but fail in directory names
- Authentication-results: sourceware.org; auth=none
- References: <1414818040 dot 70941 dot YahooMailNeo at web122101 dot mail dot ne1 dot yahoo dot com> <1414818837 dot 68078 dot YahooMailNeo at web122102 dot mail dot ne1 dot yahoo dot com> <1414900169 dot 77776 dot YahooMailNeo at web122102 dot mail dot ne1 dot yahoo dot com>
- Reply-to: Brent <yhbrent at yahoo dot com>
Doug Henderson wrote:
"You need to add the -r option to recurse into directories:"
You are 100% correct; my oversight.
Actually, it was a copy and paste error: the real code that I want to test does use -r, but when I tried to adapt that code to a simpler format for my email, I accidentally dropped the -r.
The code that I really want to test fails with a different error, so you solved a mystery that was really bugging me: why the console code in my email behaved differently from the test code I really care about.
I returned to analysing my real test code more carefully, and I still see a problem with cygwin's unzip: it fails to extract zip files with unicode names that are produced by OTHER programs (i.e. some other program besides cygwin zip).
In particular, one part of my test code creates a zip archive using Java (ZipOutputStream and ZipEntry), and then confirms that the archive can be extracted and exactly reproduced by multiple other means.
The first extraction method is to again use Java (ZipFile and ZipEntry); this works perfectly, as it should.
The second extraction method is to use cygwin's unzip; this fails: IT MANGLES THE NAMES. In particular:
1) the directory should be ÃÃÃÃÃ (\u00E5\u00D8\u00E2\u00E9\u00F1)
2) the file should be ãäéïï_file#2_length2048.txt (first 5 chars \u3400\u4E01\u9FA6\uF900\uFA30)
but what cygwin unzip actually produces during extraction is
1) the directory is +ï++ï+ï+ï
2) the file is ïïïïïïÚïïïÇïï_file#2_length2048.txt
To rule out Java as being non-standard, I manually took the zip archive it produced and extracted it using the latest 7-zip (9.20), which worked perfectly (the directory and file names came out exact). To further verify, I also temporarily installed the latest WinZip (19.0 build 11293) and once again, it extracted Java's zip file with non-ASCII names perfectly. If anyone wants to verify these claims, I am attaching the zip file produced by Java (and extractable by 7zip and WinZip, but NOT by cygwin unzip) to this email. [UPDATE: my original email yesterday had this attachment, but I do not see it showing up on the mailing list. I take it that cygwin mailing lists auto reject emails with attachments?]
So, I reckon that cygwin unzip is the odd man out.
Oh, when I try to view this zip file using Windows 7's integrated zip viewed in Windows Explorer, it displays mangled directory and file names that are something different still from what cygwin unzip produced. This link
claims that Windows 7 does not really support unicode names, so this is perhaps expected.
Also, I found that this inter-program compatibility is limited to cygwin unzip: cygwin zip seems to produce archives involving unicode names that other programs can extract just fine.
I did some web research, and the most relevant link that I could find about cygwin unzip and unicode is this old announcement from 2009:
That announcement contains this ominous text:
Currently, on Windows the UTF-8 handling is limited to the character subset
contained in the configured non-unicode "system code page".
Is it possible that the deficiency mentioned above has simply not been fixed in the last 5 years?
Problem reports: http://cygwin.com/problems.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple