Bug 26984 - conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
Summary: conversion to ascii//TRANSLIT with iconv does not work in C locale on many ch...
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.31
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-30 13:59 UTC by Vincent Lefèvre
Modified: 2021-04-30 18:03 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vincent Lefèvre 2020-11-30 13:59:05 UTC
Conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters:

$ for i in C en_US.iso885915 en_US.utf8 ; do echo '.éèêëâîôûàùïçå.€.²³⁴.Ææ.' | LC_ALL=$i /usr/bin/iconv -f utf-8 -t ascii//TRANSLIT ; done
.?????????????.EUR.???.AEae.
.eeeeaiouauica.EUR.234.AEae.
.eeeeaiouauica.EUR.234.AEae.

On this example, only €, Æ and æ are correctly handled in the C locale.

This is similar to bug 12031 comment 8, but this bug has been fixed (and indeed, I can't reproduce the issue when using LANG).
Comment 1 Carlos O'Donell 2021-04-29 21:58:10 UTC
The POSIX and C locale have no specified transliteration rules. That means that the LC_CTYPE contains no transliteration rules for any of the characters you have specified if they are outside of ASCII and so they are replaced with REPLACEMENT CHARACTER (roughly '?' in the C locale).

In practice this is a not a bug, but an expectation mismatch of the C locale. The C locale is very small and very restricted. Do you have access to a distribution that is shipping C.UTF-8? 

Example on Fedora:
for i in C.UTF-8 en_US.iso885915 en_US.utf8 ; do echo '.éèêëâîôûàùïçå.€.²³⁴.Ææ.' | LC_ALL=$i /usr/bin/iconv -f utf-8 -t ascii//TRANSLIT ; done
.eeeeaiouauica.EUR.234.AEae.
.eeeeaiouauica.EUR.234.AEae.
.eeeeaiouauica.EUR.234.AEae.

Does that answer your question?
Comment 2 Vincent Lefèvre 2021-04-30 00:55:09 UTC
(In reply to Carlos O'Donell from comment #1)
> The POSIX and C locale have no specified transliteration rules.

But why does this depend on the locale, anyway? This is silly! And this doesn't behave as documented: "The iconv program reads in text in one encoding and outputs the text in another encoding."
Comment 3 Carlos O'Donell 2021-04-30 02:05:39 UTC
(In reply to Vincent Lefèvre from comment #2)
> (In reply to Carlos O'Donell from comment #1)
> > The POSIX and C locale have no specified transliteration rules.
> 
> But why does this depend on the locale, anyway? This is silly! And this
> doesn't behave as documented: "The iconv program reads in text in one
> encoding and outputs the text in another encoding."

It depends on locale because transliteration is influenced by localization.

There are over 100+ language influenced transliterations across all the implemented localizations.

What might be obvious to an English speaker as a way to break down a letter or sound into another letter or sound may not be obvious to an Arabic speaker. 

There are some "neutral" transliterations (language independent), but today the POSIX and C locales specify no transliterations.

It does behave as documented, the text is read in from one encoding and output to the other encoding. It is just that the transliteration rules applied depend on your localization.

The behaviour you are seeing is existing for decades at this point, and changing it could have an impact on existing applications, particularly for POSIX and C locales.

My suggestion is to use C.UTF-8 if available in your distribution (we are working to add a harmonized generic C.UTF-8 to glibc).

The Linux man pages for iconv could also have a clarifying sentence added about where the transliterations come from. Note that the Linux man pages are a distinct project (https://www.kernel.org/doc/man-pages/contributing.html)

Does that answer your question?
Comment 4 Vincent Lefèvre 2021-04-30 08:00:34 UTC
(In reply to Carlos O'Donell from comment #3)
> It depends on locale because transliteration is influenced by localization.
> 
> There are over 100+ language influenced transliterations across all the
> implemented localizations.

Could you explain how the "²" transliteration depends on the language?

And why not using a commonly used transliteration for letters, derived from the Unicode description? For instance, since "é" is "e" with an accent, the transliteration to ASCII should be just "e". In any case, this is better than the replacement character.

> There are some "neutral" transliterations (language independent), but today
> the POSIX and C locales specify no transliterations.

The POSIX standard does not define the concept of transliteration at all. So it doesn't make sense to rely on it. Everything else is an extension.

> My suggestion is to use C.UTF-8 if available in your distribution (we are
> working to add a harmonized generic C.UTF-8 to glibc).

C.UTF-8 doesn't specify a language, just like C. So the fact that it behaves differently from C is wrong.

Note also that using the C.UTF-8 locale is incorrect when using a terminal that doesn't support UTF-8 (because error messages could potentially contain non-ASCII characters), which is precisely the case where transliteration is needed.
Comment 5 Andreas Schwab 2021-04-30 10:21:53 UTC
The C locale does not contain any characters that need transliterantion.
Comment 6 Vincent Lefèvre 2021-04-30 11:13:02 UTC
(In reply to Andreas Schwab from comment #5)
> The C locale does not contain any characters that need transliterantion.

Yes, but the input text does.
Comment 7 Andreas Schwab 2021-04-30 11:59:21 UTC
Characters outside the range of the locale charset are forbidden.
Comment 8 Vincent Lefèvre 2021-04-30 12:05:18 UTC
(In reply to Andreas Schwab from comment #7)
> Characters outside the range of the locale charset are forbidden.

Then what's the point of the -f option?
And in the C locale, why does "€" get converted?
Comment 9 Carlos O'Donell 2021-04-30 16:18:48 UTC
(In reply to Vincent Lefèvre from comment #8)
> (In reply to Andreas Schwab from comment #7)
> > Characters outside the range of the locale charset are forbidden.
> 
> Then what's the point of the -f option?
> And in the C locale, why does "€" get converted?

This got me digging. It turns out I was wrong.

Because the C locale is "builtin" to the library to be able to provide it 100% of the time regardless of the upgraded state of the system, and to provide it in a consistent way, there is a *limited* set of C locale transliteration rules... but they are embedded (locale/C-translit.h.in).

There are indeed +1650 transliteration rules defined for the C locale.

So this refutes my argument that we shouldn't be doing transliterations in C/POSIX.

In that case we *could* attempt to take all the neutral transliterations and autogenerate locale/C-translit.h.in from them and thus resolve this issue by providing all "neutral" transliterations to C/POSIX in a builtin way.

This would increase the .data for libc.so.6 to carry these builtin...

Thoughts?
Comment 10 Carlos O'Donell 2021-04-30 16:25:31 UTC
If we automatically processed lcoaledata/locales/translit_neutral, we would be adding ~25,000 transliterations to the builtin C/POSIX locale. These transliterations don't always produce ASCII, so we may need to do some more processing of the mappings to end up at ASCII. That's a lot of transliterations.

The intermediate fix is to add just the mappings for a subset, say all accented characters.
Comment 11 Carlos O'Donell 2021-04-30 16:31:16 UTC
So as a start I would accept and review a patch to update locale/C-translit.h.in with blocks for all the accented characters and that has immediate value until we solve the bigger issue.
Comment 12 Vincent Lefèvre 2021-04-30 16:47:07 UTC
I'm wondering. In the C/POSIX locale, can't iconv internally use the transliteration rules from C.UTF-8 when they are available?

During a system upgrade, transliteration may not fully work, but I don't think that this is a big problem compared to other things that will not work.
Comment 13 Carlos O'Donell 2021-04-30 18:03:15 UTC
(In reply to Vincent Lefèvre from comment #12)
> I'm wondering. In the C/POSIX locale, can't iconv internally use the
> transliteration rules from C.UTF-8 when they are available?

It can. It would be a bespoke thing. We are working to make C.UTF-8 become builtin, but for now it's a distinct locale (that on Fedora can't be uninstalled with a package manager).

Right now I'm going to make sure C.UTF-8 does what you're asking for, and then we still need to circle back and implement something like this check:

* When the locale is C or POSIX
* When the system has a usable C.UTF-8
* Use the transliteration rules from C.UTF-8 (superset of C/POSIX) to support broader transliteration.

The short term fix is to update C-translit.h.in though... and you or others could work on that and we'd have a fairly robust fix right away :-)