Bug 12031 - iconv -t ascii//translit with Greek characters
Summary: iconv -t ascii//translit with Greek characters
Status: RESOLVED DUPLICATE of bug 2872
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: GNU C Library Locale Maintainers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-17 15:09 UTC by Al Bogner
Modified: 2015-09-18 14:58 UTC (History)
10 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Greeklish trasliteration (2.20 KB, application/octet-stream)
2012-04-28 19:40 UTC, Nick Andrik
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Al Bogner 2010-09-17 15:09:59 UTC
I don't know, if the example below is a bug or a feature request:


TITLE=`printf "„Χρώματα“" | sed -e "s/_/ /g;s/'/ /g;s/\+/ /g;s/.*/\U&/"`

printf "$TITLE\n"
„ΧΡΏΜΑΤΑ“

TITLE2=$(printf "$TITLE\n" | iconv -t ascii//translit)

printf "$TITLE2\n"
,,???????"
Comment 1 Ulrich Drepper 2011-05-15 04:43:52 UTC
What would the transliteration look like?  And is it locale-independent?
Comment 2 squarious 2011-06-26 15:17:00 UTC
In general greek transliteration (greeklish) can be orthographic-correct or just phonetic-correct.

Standards^[1][3]
=======
[ISO 843]: is orthograpih-correct where you can find complete transliteration map from Greek characters to Latin ones. It also includes accented letters and digraphs.
http://en.wikipedia.org/wiki/ISO_843

Examples using this ISO-843 are:
Ψάρι : Psári   (notice 1-letter Ψ -> 2-letters Ps)
Όργανο : Όrgano 
Φλάουτο : Fláouto (notice exception double vowel ου -> ou)
Φύτρο : Fýtro
Αυτοκίνητο : Autokínīto
Αυγό : Augó
Φεγγάρι : Feggári
Μπάμια: Mpámia

----
[UN Elot 743]: It is much like ISO 843. Key differences is that it has different rules for consonant and vowel digraphs, as they become phonetic-correct. It also uses more mainstream latin letters like "i" and not "ī" that is used by ISO-843.

Αυτοκίνητο : Aftokínito (notice the αυ becomes af to be phonetic-correct)
Αυγό : Avgo (notice that here αυ becomes av to be phonetic-correct)
Φεγγάρι : Fengári ( notice that γγ becomes ng to be phonetic-correct)

-----
[ALA-LC]: http://www.loc.gov/catdir/cpso/romanization/greek.pdf

Everyday greeklish
===========
In everyday usage (sms, im, forums etc) no one is using accented greeklish. There is no standard conversion table and the majority does only phonetic-correct and sometimes visually-correct transliteration.

Ψάρι : Psari
Όργανο : Organo
Φλάουτο : Flaouto, Flauto (you can skip "o" from "u" because "ou" sounds like "u")
Φύτρο : Fytro, Fitro
Αυτοκίνητο : Aftokinito, Aytokinhto, Aftokinito
Αυγό : Avgo, Aygo
Φεγγάρι : Feggari, Fegari
Θα έρθω: Tha erthw, Tha ertho, 8a er8w, 8a er8o  (a lot of times "8" is used for "θ" because it looks a bit the same - visually correct)

Personal opinion
==========
Accents are not used because usually you have Greek and US/English k/b layout. On the US/English you don't have accents. Idioms like θ -> 8 which try to mimic letters can be confusing so they can be skipped. What I would like to see is a transliteration that can be written by users also. E.g. the ISO-843 is not good one because η => ī , I have no clue what layout do I need to write this accented i !? 
Off course if there is possibility to support multiple systems, will be the best for all.

In any case, there are many higher-level frameworks that need transliteration and it is very annoying to specialize for greek if you are based on iconv. Any solution is welcome :)

Please ask if you need more information.

Info - References
----------------------------
[1] http://transliteration.eki.ee/pdf/Greek.pdf
[2] http://en.wikipedia.org/wiki/Greeklish
[3] http://en.wikipedia.org/wiki/Romanization_of_Greek
Comment 3 -EMail Hidden- 2011-09-24 02:33:54 UTC
Absolutely any transliteration scheme is good if it gives some ASCII characters instead of exception this function does now.
Comment 4 Alexander Karlstad 2012-02-03 19:28:54 UTC
I have a similar problem with later versions of iconv (2.13 in Ubuntu).

iconv -t ascii//TRANSLIT <<< 'æ,ø,å'

gives me "ae,?,a" but in my opinion it should give me "ae,o,a".

Tested this on several machines with the same version (2.13) and on an old SunOS box with 1.9. The latter returned the desired result.

My LC_ALL and LANG variables are all set to nb_NO.UTF-8 and I've tried changing it to other available locales, without getting the wanted result.

Is this a bug?
Comment 5 Petter Reinholdtsen 2012-02-04 11:20:39 UTC
(In reply to comment #4)
> gives me "ae,?,a" but in my opinion it should give me "ae,o,a".
[...]
> Is this a bug?

I believe it is a bug.

The request to change transliteration for æøå is http://sourceware.org/bugzilla/show_bug.cgi?id=89 .  Please explain there why you believe it should transliterate to ae,o,a and not ae,oe,aa.
Comment 6 Nick Andrik 2012-04-28 19:40:48 UTC
Created attachment 6380 [details]
Greeklish trasliteration

I have created a first version of a file to use for greeklish (greek to ascii) transliteration.

The conversion scheme is:

alpha -> a
beta -> b
gamma -> g
delta -> d
epsilon -> e
zeta -> z
eta -> h
theta -> 8
iota -> i
kappa -> k
lamda -> l
mu -> m
nu -> n
xi -> ks
omikron -> o
pi -> p
ro -> r
sigma -> s
tau -> t
ypsilon -> y
phi -> f
chi  -> x
psi -> ps
omega -> w

From my experiments I realized that there isn't "chained" transliteration.
By this, I mean that I had to specify the greeklish transliterations for all accented versions of letters, even I had specified for the simply one.

Example:
ETA with PERISPOMENI -> ETA (this is already in translit_combining)
ETA -> H (this is my addition)
If I try to convert "ETA with PERISPOMENI" to ascii then I get ?, I had to edit it to this:
ETA with PERISPOMENI -> ETA;H
Comment 7 Jackie Rosen 2014-02-16 17:50:50 UTC
*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.
Comment 8 Mike FABIAN 2015-05-04 18:53:52 UTC
(In reply to Petter Reinholdtsen from comment #5)
> (In reply to comment #4)
> > gives me "ae,?,a" but in my opinion it should give me "ae,o,a".
> [...]
> > Is this a bug?
> 
> I believe it is a bug.

It works in recent glibc (glibc-2.20-8.fc21.x86_64)
in *all* locales except C/POSIX. 

$ echo 'Æ,æ,Ø,ø,Å,å' | LANG=nb_NO.UTF-8 iconv -t ascii//TRANSLIT 
AE,ae,OE,oe,A,a

$ echo 'Æ,æ,Ø,ø,Å,å' | LANG=en_US.UTF-8 iconv -t ascii//TRANSLIT 
AE,ae,OE,oe,A,a

$ echo 'Æ,æ,Ø,ø,Å,å' | LANG=POSIX iconv -t ascii//TRANSLIT 
iconv: illegal input sequence at position 0

It is independent of the locale because all locales (except C/POSIX)
include translit_neutral where this is defined.

> The request to change transliteration for æøå is
> http://sourceware.org/bugzilla/show_bug.cgi?id=89 .  Please explain there
> why you believe it should transliterate to ae,o,a and not ae,oe,aa.

For Scandinavian locales, transliterating 'Æ,æ,Ø,ø,Å,å' to 'Ae, ae,
Oe, oe, Aa, aa' is more appropriate. For most other locales,
transliterating å to a is probably OK.  I am a bit puzzled about Æ ->
AE, shouldn’t this be transliterated to Ae, even in English locales?
(Same with Ø, transliterating to just O or maybe Oe in
translit_neutral for all locales which do not have special rules
seems better.

The patch attached to

https://sourceware.org/bugzilla/show_bug.cgi?id=89#c5

fixes the transliteration for Norwegian locales (nn_NO and nb_NO).
Probably the same fix should be applied also for Swedish and Finnish
locales (and maybe Icelandic locales as well).
Comment 9 Petter Reinholdtsen 2015-05-04 21:00:36 UTC
(In reply to Mike FABIAN from comment #8)

> I am a bit puzzled about Æ ->
> AE, shouldn’t this be transliterated to Ae, even in English locales?
> (Same with Ø, transliterating to just O or maybe Oe in
> translit_neutral for all locales which do not have special rules
> seems better.

For me it make more sense to transliterate a capital letter to all capital
letters, to ensure words with only capital letters look sane.  For example
SØRING would end up like SOERING, not SOeRING.  Sure, if the capital letter is the first one in the sentence, it would make more sense to use Øvelse -> Oevelse,
but I suspect special norwegian characters at the start of the sentence
is less common than capital special norwegian letters in an all capital word.  Most Norwegian words do not start with æ, ø or å. :)

-- 
Happy hacking
Petter Reinholdtsen
Comment 10 keld@keldix.com 2015-05-04 21:11:22 UTC
On Mon, May 04, 2015 at 09:00:36PM +0000, pere at hungry dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=12031
> 
> --- Comment #9 from Petter Reinholdtsen <pere at hungry dot com> ---
> (In reply to Mike FABIAN from comment #8)
> 
> > I am a bit puzzled about Æ ->
> > AE, shouldn???t this be transliterated to Ae, even in English locales?
> > (Same with Ø, transliterating to just O or maybe Oe in
> > translit_neutral for all locales which do not have special rules
> > seems better.
> 
> For me it make more sense to transliterate a capital letter to all capital
> letters, to ensure words with only capital letters look sane.  For example
> SØRING would end up like SOERING, not SOeRING.  Sure, if the capital letter is
> the first one in the sentence, it would make more sense to use Øvelse ->
> Oevelse,
> but I suspect special norwegian characters at the start of the sentence
> is less common than capital special norwegian letters in an all capital word. 
> Most Norwegian words do not start with æ, ø or å. :)

The same goes for Danish which due to some common hertiage use the same letters
and to some extent the same transliteration rules.

I would also recommend transliterating Æ, Ø, Å to AE, OE, AA

Best regards
Keld
Comment 11 Egor Kobylkin 2015-09-18 09:29:03 UTC
The problem is present for many languages and was reporter earlier
https://sourceware.org/bugzilla/show_bug.cgi?id=2872 
I have created a spreadsheet to generate transliteration tables
https://sourceware.org/bugzilla/attachment.cgi?id=8590
The table should look like this https://sourceware.org/bugzilla/attachment.cgi?id=8591
And the list of unicode characters can be found here http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Those who are interested in their language being included for transliteration, would you spend some time to generate the needed table/file?

*** This bug has been marked as a duplicate of bug 2872 ***
Comment 12 Egor Kobylkin 2015-09-18 14:58:19 UTC
I have tested the translit_greeklish by  Nick Andrik and will try to get it included into the fix along with with the translit_cyrilic that I have generated myself.