Hello, I have a feature request for iconv. It would be useful if translitteration could be done only from one character to one character long string. In the context of EDIFACT, or flat files, the length in number of characters matters. So translitterating "€" to "EUR" is not an option. Hence I deactivated //TRANSLIT in our source code. However if translitteration could be done only when the result is one character long that would still be useful. It could also truncate the translitteration to one character, like "€" to "E". I don't know what would be the best way to implement a feature //TRANSLITONETOONE. Maybe the more general way of implementing it would be to have a new parameter for choosing the "translitteration profile", which yields two choices : either predefined translitteration profiles (classic, onetoonetruncate, onetoonediscard), integrated in iconv, or user defined profiles. Let me know what you think about it, please. Best regards, Laurent Lyaudet PS : You can also iconv first, using translitteration, and then truncate your fields for EDIFACT or flat files, but it's not a satisfactory method since information at the end of a field may be more important than 10 new characters from the translitteration part. PPS : I sent also this feature request on bug-gnu-libiconv@gnu.org
Laurent, You could write a custom locale that does what you want. Create a new locale that transliterates "€" to "E"? Is that a solution for you? It is not possible to simply truncate a transliteration to the first character, it really requires context and understanding of the character being converted.
You can implement your desired feature as a string processing function in C, based on iconv(), in a way that 1. works with all iconv() implementations that support '//TRANSLIT' (that include glibc and GNU libiconv), 2. works with all encodings and in all locales. The starting point of this implementation is a function that converts one character at a time, like the function iconv_carefully_1 in http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/striconveh.c . Then you need a function that determines the number of characters in the conversion result of a character; this one too is based on iconv_carefully_1, but on iconv_open(DESTINATION,DESTINATION). With these two building bricks you can do it. Therefore there is no need to bother glibc (or any iconv() implementation) to achieve your desired feature.
I don't think we can offer a general-purpose solution here. If you translate from UTF-8 to ASCII, "€" and "EUR" both have same number of bytes.
Hello, Thanks for the quick answers. First of, I must say that I'm using PHP at work where I need this feature and that I was hoping that the additional parameter "transliteration profile" could be added to iconv function in PHP in a few years. I'm not in a hurry ;). In fact, when you convert "é" in UTF8 to "é" in ISO-8859-1 it is also transliteration at the bytes level. We don't see it immediately because we have the ASCII characters that are transliterated with the identity function. But everything fits in transliteration with multi-byte character input and multi-byte character output. Maybe what is needed is a new library or a new function in iconv that takes a single parameter that is a transliteration profile: function TPiconv(tp, iStream, oStream) It shouldn't be hard to generate the standard transliteration profiles for all couples of standard encodings. >You could write a custom locale that does what you want. >Create a new locale that transliterates "€" to "E"? I looked at the files in /usr/share/i18n/locales. I didn't see any transliteration defined here. Moreover I thought that transliteration had to be defined for a couple of encodings, not a single encoding. I don't see how I can make the iconv() function in PHP use a locale depending on the output encoding. I would be happy to give it a try but right now I don't see how to do it. >Therefore there is no need to bother glibc (or any iconv() implementation) to achieve your desired feature. I'm using PHP so it is complicated/unefficient to implement and it is always good to bother people to improve things ;P >I don't think we can offer a general-purpose solution here. >If you translate from UTF-8 to ASCII, "€" and "EUR" both have same number of bytes. I'm not interested in the number of bytes but in the number of characters. TPiconv as I suggested would be a general solution. Best regards, Laurent Lyaudet
Hello, I looked at the source code of libiconv and glibc. Libiconv applies a hardcoded map with translit_data and translit_index before using any locale. Glibc only applies the transliteration rules from the locale. I have found how the locales define transliteration. I'll do some further testing at work to see if PHP uses glibc for iconv on Debian and if the locale approach works with PHP. I'll post the results of my experiments here. Thank you, best regards, Laurent Lyaudet
(In reply to Laurent Lyaudet from comment #5) > I have found how the locales define transliteration. Try this: $ echo ə | LC_ALL=az_AZ iconv -f UTF-8 -t ISO-8859-1//TRANSLIT | iconv -f ISO-8859-1 ä That's a locale-specific transliteration rule. Quoting from localedata/locales/az_AZ: translit_start % schwa -> a: <U0259> "<U00E4>" <U018F> "<U00C4>" translit_end
Hello, I did some testing at work. Here are the results: root@StretchDevLaurent:/home/web/test_iconv_locale# php test_iconv_locale.php Sizes: - Before: 27 - After: -- iconv_translit_str_replace: 27 -- iconv_translit_locale_ko: 27 -- iconv_translit_locale_ok: 27 -- iconv_translit_str_replace_opt: 27 Results between iconv_translit_str_replace and iconv_translit_locale_ko are different: abcdef▒+'''''------- abcdef abcdef▒+'''''-------▒abcdef Results between iconv_translit_str_replace and iconv_translit_locale_ok are identical: abcdef▒+'''''------- abcdef Results between iconv_translit_str_replace and iconv_translit_str_replace_opt are identical: abcdef▒+'''''------- abcdef Test perf. iconv_translit_str_replace(): 0.723961 Test perf. iconv_translit_locale_ko(): 0.51453 Test perf. iconv_translit_locale_ok(): 0.55342 Test perf. iconv_translit_str_replace_opt(): 0.310989 iconv_translit_locale_ko is pure locale and doesn't work exactly as I would since I wan't to transliterate non-breakable spaces also, altough they are valid ISO-8859-1. iconv_translit_locale_ok adds a str_replace to deal with nb-spaces As you can see the fastest way is to use str_replace optimized with arrays without using locale. Below is my test script, best regards, Laurent Lyaudet <?php $sInputTest = 'abcdef€➕’ʼʹˈ′−➖‐‒–—― abcdef'; function iconv_translit_str_replace($sString){ //On convertit le symbole de l'euro à la main $sString = str_replace('€', '¤', $sString); //On remplace le plus lourd avec le plus normal $sString = str_replace('➕', '+', $sString); //On remplace les variantes d'apostrophes $sString = str_replace('’', '\'', $sString);//marque de citation unique droite $sString = str_replace('ʼ', '\'', $sString);//lettre modificatrice apostrophe $sString = str_replace('ʹ', '\'', $sString);//lettre modificatrice prime $sString = str_replace('ˈ', '\'', $sString);//lettre modificatrice ligne verticale $sString = str_replace('′', '\'', $sString);//prime //On remplace les variantes de tiret $sString = str_replace('−', '-', $sString);//MINUS SIGN $sString = str_replace('➖', '-', $sString);//HEAVY MINUS SIGN $sString = str_replace('‐', '-', $sString);//HYPHEN $sString = str_replace('‒', '-', $sString);//FIGURE DASH $sString = str_replace('–', '-', $sString);//EN DASH $sString = str_replace('—', '-', $sString);//EM DASH $sString = str_replace('―', '-', $sString);//HORIZONTAL BAR //On remplace les espaces insécables $sString = str_replace(' ', ' ', $sString); $sString = iconv('UTF-8', 'ISO-8859-1//IGNORE', $sString); return $sString; } function iconv_translit_str_replace_opt($sString){ $sString = str_replace( array( '€', '➕', '’', 'ʼ', 'ʹ', 'ˈ', '′', '−', '➖', '‐', '‒', '–', '—', '―', ' ', ), array( '¤', '+', '\'', '\'', '\'', '\'', '\'', '-', '-', '-', '-', '-', '-', '-', ' ', ), $sString ); $sString = iconv('UTF-8', 'ISO-8859-1//IGNORE', $sString); return $sString; } function iconv_translit_locale_ko($sString){ $currentLocal = setlocale(LC_ALL, 0); setlocale(LC_ALL, "fr_FR@test"); /* translit_start %euro <U20AC> "<U00A4>" %heavy plus <U2795> "<U002B>" %apostrophe <U2019> "<U0027>" <U02BC> "<U0027>" <U02B9> "<U0027>" <U02C8> "<U0027>" <U2032> "<U0027>" %dash-hyphen-minus <U2212> "<U002D>" <U2796> "<U002D>" <U2010> "<U002D>" <U2012> "<U002D>" <U2013> "<U002D>" <U2014> "<U002D>" <U2015> "<U002D>" %nb space <U00A0> "<U0020>" translit_end */ $sString = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $sString); setlocale(LC_ALL, $currentLocal); return $sString; } function iconv_translit_locale_ok($sString){ $currentLocal = setlocale(LC_ALL, 0); setlocale(LC_ALL, "fr_FR@test"); /* translit_start %euro <U20AC> "<U00A4>" %heavy plus <U2795> "<U002B>" %apostrophe <U2019> "<U0027>" <U02BC> "<U0027>" <U02B9> "<U0027>" <U02C8> "<U0027>" <U2032> "<U0027>" %dash-hyphen-minus <U2212> "<U002D>" <U2796> "<U002D>" <U2010> "<U002D>" <U2012> "<U002D>" <U2013> "<U002D>" <U2014> "<U002D>" <U2015> "<U002D>" %nb space <U00A0> "<U0020>" translit_end */ //On remplace les espaces insécables $sString = str_replace(' ', ' ', $sString); $sString = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $sString); setlocale(LC_ALL, $currentLocal); return $sString; } //Tests $sOutputTest1 = iconv_translit_str_replace($sInputTest); $sOutputTest2 = iconv_translit_locale_ko($sInputTest); $sOutputTest3 = iconv_translit_locale_ok($sInputTest); $sOutputTest4 = iconv_translit_str_replace_opt($sInputTest); echo "Sizes:\n", "- Before: ", mb_strlen($sInputTest), "\n", "- After:\n", "-- iconv_translit_str_replace: ", strlen($sOutputTest1), "\n", "-- iconv_translit_locale_ko: ", strlen($sOutputTest2), "\n", "-- iconv_translit_locale_ok: ", strlen($sOutputTest3), "\n", "-- iconv_translit_str_replace_opt: ", strlen($sOutputTest4), "\n" ; echo "Results between iconv_translit_str_replace and iconv_translit_locale_ko"; if($sOutputTest1 === $sOutputTest2){ echo " are identical: ", $sOutputTest1, "\n"; } else{ echo " are different:\n", $sOutputTest1, "\n", $sOutputTest2, "\n"; } echo "Results between iconv_translit_str_replace and iconv_translit_locale_ok"; if($sOutputTest1 === $sOutputTest3){ echo " are identical: ", $sOutputTest1, "\n"; } else{ echo " are different:\n", $sOutputTest1, "\n", $sOutputTest3, "\n"; } echo "Results between iconv_translit_str_replace and iconv_translit_str_replace_opt"; if($sOutputTest1 === $sOutputTest4){ echo " are identical: ", $sOutputTest1, "\n"; } else{ echo " are different:\n", $sOutputTest1, "\n", $sOutputTest4, "\n"; } function getFDifferenceMicrotime($p_sMicrotimeDebut, $p_sMicrotimeFin){ list($sMicroSecondes1, $sSecondes1) = explode(' ', $p_sMicrotimeDebut); list($sMicroSecondes2, $sSecondes2) = explode(' ', $p_sMicrotimeFin); //On retranche d'abord les secondes pour ne pas pénaliser la précision du calcul return ((float)($sSecondes2 - $sSecondes1)) + $sMicroSecondes2 - $sMicroSecondes1; } $iMax = 100000; echo "Test perf. iconv_translit_str_replace(): "; $microtimeBefore = microtime(); for($i = 0; $i < $iMax; ++$i){ $sOutputTest1 = iconv_translit_str_replace($sInputTest); } $microtimeAfter = microtime(); echo getFDifferenceMicrotime($microtimeBefore, $microtimeAfter), "\n"; echo "Test perf. iconv_translit_locale_ko(): "; $microtimeBefore = microtime(); for($i = 0; $i < $iMax; ++$i){ $sOutputTest1 = iconv_translit_locale_ko($sInputTest); } $microtimeAfter = microtime(); echo getFDifferenceMicrotime($microtimeBefore, $microtimeAfter), "\n"; echo "Test perf. iconv_translit_locale_ok(): "; $microtimeBefore = microtime(); for($i = 0; $i < $iMax; ++$i){ $sOutputTest1 = iconv_translit_locale_ok($sInputTest); } $microtimeAfter = microtime(); echo getFDifferenceMicrotime($microtimeBefore, $microtimeAfter), "\n"; echo "Test perf. iconv_translit_str_replace_opt(): "; $microtimeBefore = microtime(); for($i = 0; $i < $iMax; ++$i){ $sOutputTest1 = iconv_translit_str_replace_opt($sInputTest); } $microtimeAfter = microtime(); echo getFDifferenceMicrotime($microtimeBefore, $microtimeAfter), "\n"; ?>
It's unclear to me if the current locale facilities work for you. Do they?
Hello, >It's unclear to me if the current locale facilities work for you. Do they? They work partly. I have to do an str_replace for the character "non breakable space" since I want to transliterate it altough it could be converted. I started a library to give a proof of concept of what I mean by transliteration profile. Source code is here https://github.com/LLyaudet/transliteration_profile_iconv It is still work in progress but you could be able to understand more precisely the idea. I welcome any feedback. Best regards, Laurent Lyaudet
(In reply to Laurent Lyaudet from comment #9) > Hello, > > >It's unclear to me if the current locale facilities work for you. Do they? > They work partly. I have to do an str_replace for the character "non > breakable space" since I want to transliterate it altough it could be > converted. But couldn't use a custom charmap for that, one that drops the non-breakable space?
(In reply to Florian Weimer from comment #10) > (In reply to Laurent Lyaudet from comment #9) > > Hello, > > > > >It's unclear to me if the current locale facilities work for you. Do they? > > They work partly. I have to do an str_replace for the character "non > > breakable space" since I want to transliterate it altough it could be > > converted. > > But couldn't use a custom charmap for that, one that drops the non-breakable > space? Hello, I tried to redefine the transliteration of the non-breakable space in the locale file : (extract from comment 7) > %nb space > <U00A0> "<U0020>" but it didn't work since transliteration is used when converting fails only. I don't know exactly what you mean by using a custom charmap: a)- is it modifying the source code of iconv ? (In libiconv I saw that charmaps are compiled). b)- or is there some file that defines the charmap that is read from at execution time ? One of the main ideas of "transliteration_profile_iconv" is that everything is user defined by profiles at execution time because I thought that was not possible with current iconv. However if b) applies there may be already enough flexibility. Thanks, best regards, Laurent Lyaudet
(In reply to Laurent Lyaudet from comment #11) > I don't know exactly what you mean by using a custom charmap: > a)- is it modifying the source code of iconv ? (In libiconv I saw that > charmaps are compiled). > b)- or is there some file that defines the charmap that is read from at > execution time ? The charmap is another input file for localedef.
Hello, I tried to use charmap as you suggested but I did not succeed. I also finished my library transliteration_profile_iconv. It is available here : https://github.com/LLyaudet/transliteration_profile_iconv. It solves the problem of tight coupling with locales. Can you help me for the benchmark and/or give me feedback on my library? I tried to benchmark my library against glibc iconv to see how much slower my library was, compared to glibc. Here is what I tried : I edited a locale definition file fr_FR@test in this directory (the Makefile puts it in the correct directory on Debian). I edited a charmap definition file ISO-8859-1-test in this directory (the Makefile puts it in the correct directory on Debian). The makefile then edits the file /etc/locale.gen and executes locale-gen. Everything so far goes well. But iconv_open() below fails because there is no gconv module for ISO-8859-1-test. If I rename ISO-8859-1-test everywhere with ISO-8859-1 which I don't recommend (make copies of the correct files for ISO-8859-1) iconv_open() works but iconv() does not transliterate as specified by the fr_FR@test file. (You can check that the setlocale is correct since it returns "fr_FR@test". Euro symbol is transliterated to EUR instead of the currency symbol... What I could test with PHP is not working with C.) I tried also adding the locale in /usr/share/i18n/SUPPORTED, then running locale-gen and iconvconfig. But it didn't generate a new gconv module for ISO-8859-1-test. So far I don't know if it's possible to generate a new gconv module for a new charmap without compiling glibc. I didn't found any command to do so. Thanks, best regards, Laurent Lyaudet
Sorry, I was mistaken about the charmap approach. glibc currently does not have a way to alter charmaps based on locale definitions. I still think that the configurable charmaps would go a long way towards solving this issue, but declarative charmaps processing is a lot of work (and we need to keep support for gconv modules around, for backwards compatibility).
A note about transliteration in general: glibc implements transliteration regarding a character set, and thus responds to the frequent case (in the years 2000-2005) that an application needs to process a file in UTF-8, while the locale is an 8-bit locale. Nowadays, the more frequent use of transliteration is a culture-aware transliteration from one script to another script. The charset is not the important factor here. For example, when doing transliteration from Punjabi in Gurmukhi script to Punjabi in Shahmukhi (Arabic) script, the input and output are both UTF-8, therefore the glibc's transliteration system does not help. And such use-cases are outside glibc anyway, since 99.9% of the applications don't need culture-aware transliteration. Therefore I think there is no need to extend glibc's transliteration facilities any more. Separate packages are the way to go (and also easier to implement if you can assume Unicode input and Unicode output, without the charset-related baggage that glibc's transliteration carries).