iconv and combining characters
Chris Heath
chris@heathens.co.nz
Thu Jan 22 13:45:00 GMT 2004
Bruno,
> Hi Chris,
>
> > I noticed that iconv isn't able to convert UTF-8 containing combining
> > characters into Latin1. I really think that iconv should be able to do this.
>
> Why? The preferred way of exchange of Unicode strings is in normalization
> form C, see [1], [2].
I agree with that. But will everyone follow that rule? Sooner or later,
you will come across a file or string that is not in NFC, and I think it
would be very useful if iconv could handle it.
I guess I like to live by the motto "be lenient in what you accept,
conservative with what you generate". In other words, I would want iconv
to handle any unnormalized Unicode, but would expect it to generate only
NFC.
Since non-NFC Unicode is rather uncommon, another less intrusive
approach may be better: use a separate codeset name for Unicode that may
be non-NFC. Something like:
iconv -f UTF-8-UNNORMALIZED -t L1
This has the advantage of not having any speed/memory penalty for those
who know their data is NFC. Also the normalization could be programmed
just once in an INTERNAL-UNNORMALIZED -> INTERNAL transcoder.
If this is something you think would be appropriate to add to the gconv
converter collection, I would be happy to work on it.
> > do you agree that we should make the L1 converter
> > do the same kind of thing?
>
> No. It's better if you avoid generating Unicode strings which are not in NFC.
> This way, you'll not only get no problems with iconv, you'll also avoid
> problems with XML and HTML parsers and lots of other software.
Agreed. But I'm talking about reading non-NFC Unicode, not generating
it.
> > But on the other hand, the CP1255 converter handles it either way:
>
> Interesting. Probably the authors thought, like you do now, that handling of
> combining characters on input is better than not handling them.
Moreover, I just noticed that U+FB1D HEBREW LETTER YOD WITH HIRIQ is in
the composition exclusion list, so that means iconv is not producing NFC
is this case.
> printf '\xE9\xC4' | iconv -f CP1255 -t UTF-8 | od -tx1
0000000 ef ac 9d
0000003
> > When I did the same conversions from UTF-8 to CP1255
> > in a C program, I noticed that iconv returned 0 in both instances. Shouldn't
> > the second one return a non-zero value since it is irreversible?
>
> Good question as well. Actually the term in POSIX is "non-identical"
> conversions, not "irreversible" conversions. If you consider the combined
> and decomposed forms as the same, then the return value should be 0. If
> you consider it different, then the return value should be 1. I don't see
> convincing arguments for either choice.
OK, yes, I could go either way with this, too.
Chris
> Bruno
>
>
> [1] http://www.unicode.org/reports/tr15/
> [2] http://www.w3.org/TR/charmod/#sec-ChoiceNFC
More information about the Libc-alpha
mailing list