Cygwin fails to utilize Unicode replacement character
Steven Penny
svnpenn@gmail.com
Tue Sep 4 21:05:00 GMT 2018
On Tue, 4 Sep 2018 13:59:10, Doug Henderson wrote:
> My preference is to remove the output fiddling code that Corrina has
> been working on. It is trying to solve the wrong problem.
> I think we have gone down a rabbit hole at the wrong end of cat's data flow.
this has nothing to do with "cat". it has to do with the unfounded design
decision to use U+2592. Granted at this point we are bikeshedding - but an
official standard does exist, namely Unicode, with 2 applicable characters for
this use case:
1. U+FFFD: http://unicode.org/charts/nameslist/n_FFF0.html
2. U+25A1: http://unicode.org/charts/nameslist/n_25A0.html
> Should any changes to the way a character is displayed be required, it
> needs to be in the terminal program that display the character, not in
> cygwin which should pass the character along unmodified.
the "terminal" in this case is either "cygwin" or "xterm" - in both cases code
changes have already been made in reponse to this thread, so i dont think your
comment here holds weight.
> Both cygwin and Debian 9.5 show:
>
> $ file alfa.txt
> alfa.txt: ISO-8859 text
>
> When Linux reads the file, it assumes the encoding is UTF-8.
> When cygwin reads the file, it assume the encoding is CP1252
> This command shows the problem
>
> $ iconv -f utf8 alfa.txt
> iconv: alfa.txt:1:0: incomplete character or shift sequence
>
> On Linux, this shows a slightly different message, with the same intent.
>
> Try using this string:
>
> $ printf "\xC3\xAB\353\n"
> =C3=AB=E2=96=92
>
> to get a better understanding of the problem. It contains two
> representation of LATIN SMALL LETTER E WITH DIAERESIS, first encoded
> in UTF-8, then using ISO-8859-1.
now it appears *you* are going down the rabbit hole. both Cygwin and Mintty were
in violation on Unicode standard - however this has already been remedied in the
code.
> There are two different reasons for the MEDIUM SHADE. Here it
> indicates an invalid UTF-8 character, and the font does not have a
> glyph for REPLACEMENT CHARACTER. The MEDIUM SHADE is also used in
> place of an ordinary character without a glyph in the font.
this is flat wrong. U+2592 MEDIUM SHADE is *only* used in cases of invalid
UTF-8. In case of missing character - the ".notdef" glyph is used - as has been
discussed several times in this thread. This is not an actual character, so i
cannot paste it here - but as an example with "DejaVu Sans Mono" the glyph is
an empty rectangle.
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
More information about the Cygwin
mailing list