23076 – Iconv translitterate with profile

Bug 23076 - Iconv translitterate with profile

Summary: Iconv translitterate with profile

Status:	NEW

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	libc (show other bugs)
Version:	unspecified

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-17 18:04 UTC by Laurent Lyaudet
Modified:	2019-05-11 13:32 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:	2018-04-17 00:00:00

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Laurent Lyaudet 2018-04-17 18:04:12 UTC

Hello,

I have a feature request for iconv.
It would be useful if translitteration could be done only from one
character to one character long string.
In the context of EDIFACT, or flat files, the length in number of
characters matters.
So translitterating "€" to "EUR" is not an option.
Hence I deactivated //TRANSLIT in our source code.

However if translitteration could be done only when the result is one
character long that would still be useful.
It could also truncate the translitteration to one character, like "€" to "E".
I don't know what would be the best way to implement a feature
//TRANSLITONETOONE.

Maybe the more general way of implementing it would be to have a new
parameter for choosing the "translitteration profile", which yields
two choices : either predefined translitteration profiles (classic,
onetoonetruncate, onetoonediscard), integrated in iconv, or user
defined profiles.

Let me know what you think about it, please.

Best regards,
   Laurent Lyaudet

PS : You can also iconv first, using translitteration, and then
truncate your fields for EDIFACT or flat files, but it's not a
satisfactory method since information at the end of a field may be
more important than 10 new characters from the translitteration part.

PPS : I sent also this feature request on bug-gnu-libiconv@gnu.org

Comment 1 Carlos O'Donell 2018-04-17 20:50:00 UTC

Laurent,

You could write a custom locale that does what you want.

Create a new locale that transliterates "€" to "E"?

Is that a solution for you?

It is not possible to simply truncate a transliteration to the first character, it really requires context and understanding of the character being converted.

Comment 2 Bruno Haible 2018-04-17 21:42:14 UTC

You can implement your desired feature as a string processing function in C, based on iconv(), in a way that
  1. works with all iconv() implementations that support '//TRANSLIT' (that include glibc and GNU libiconv),
  2. works with all encodings and in all locales.

The starting point of this implementation is a function that converts one character at a time, like the function iconv_carefully_1 in http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/striconveh.c . Then you need a function that determines the number of characters in the conversion result of a character; this one too is based on iconv_carefully_1, but on iconv_open(DESTINATION,DESTINATION). With these two building bricks you can do it.

Therefore there is no need to bother glibc (or any iconv() implementation) to achieve your desired feature.

Comment 3 Florian Weimer 2018-04-18 09:53:14 UTC

I don't think we can offer a general-purpose solution here.  If you translate from UTF-8 to ASCII, "€" and "EUR" both have same number of bytes.

Comment 4 Laurent Lyaudet 2018-04-18 18:00:15 UTC

Hello,

Thanks for the quick answers.

First of, I must say that I'm using PHP at work where I need this feature and that I was hoping that the additional parameter "transliteration profile" could be added to iconv function in PHP in a few years. I'm not in a hurry ;).

In fact, when you convert "é" in UTF8 to "é" in ISO-8859-1 it is also transliteration at the bytes level.
We don't see it immediately because we have the ASCII characters that are transliterated with the identity function.
But everything fits in transliteration with multi-byte character input and multi-byte character output.
Maybe what is needed is a new library or a new function in iconv that takes a single parameter that is a transliteration profile:
function TPiconv(tp, iStream, oStream)
It shouldn't be hard to generate the standard transliteration profiles for all couples of standard encodings.

>You could write a custom locale that does what you want.
>Create a new locale that transliterates "€" to "E"?
I looked at the files in /usr/share/i18n/locales.
I didn't see any transliteration defined here.
Moreover I thought that transliteration had to be defined for a couple of encodings, not a single encoding.
I don't see how I can make the iconv() function in PHP use a locale depending on the output encoding.
I would be happy to give it a try but right now I don't see how to do it.

>Therefore there is no need to bother glibc (or any iconv() implementation) to achieve your desired feature.
I'm using PHP so it is complicated/unefficient to implement and it is always good to bother people to improve things ;P

>I don't think we can offer a general-purpose solution here.
>If you translate from UTF-8 to ASCII, "€" and "EUR" both have same number of bytes.
I'm not interested in the number of bytes but in the number of characters.
TPiconv as I suggested would be a general solution.

Best regards,
   Laurent Lyaudet

Comment 5 Laurent Lyaudet 2018-04-20 20:03:06 UTC

Hello,

I looked at the source code of libiconv and glibc.
Libiconv applies a hardcoded map with translit_data and translit_index before using any locale.
Glibc only applies the transliteration rules from the locale.

I have found how the locales define transliteration.

I'll do some further testing at work to see if PHP uses glibc for iconv on Debian and if the locale approach works with PHP.

I'll post the results of my experiments here.

Thank you, best regards,
   Laurent Lyaudet

Comment 6 Florian Weimer 2018-04-20 20:09:29 UTC

(In reply to Laurent Lyaudet from comment #5)
> I have found how the locales define transliteration.

Try this:

$ echo ə | LC_ALL=az_AZ iconv -f UTF-8 -t ISO-8859-1//TRANSLIT | iconv -f ISO-8859-1
ä

That's a locale-specific transliteration rule.  Quoting from localedata/locales/az_AZ:

translit_start

% schwa -> a:
<U0259> "<U00E4>"
<U018F> "<U00C4>"

translit_end

Comment 7 Laurent Lyaudet 2018-04-23 11:56:56 UTC

Hello,

I did some testing at work. Here are the results:
  root@StretchDevLaurent:/home/web/test_iconv_locale# php test_iconv_locale.php
  Sizes:
  - Before: 27
  - After:
  -- iconv_translit_str_replace: 27
  -- iconv_translit_locale_ko: 27
  -- iconv_translit_locale_ok: 27
  -- iconv_translit_str_replace_opt: 27
  Results between iconv_translit_str_replace and iconv_translit_locale_ko are different:
  abcdef▒+'''''------- abcdef
  abcdef▒+'''''-------▒abcdef
  Results between iconv_translit_str_replace and iconv_translit_locale_ok are identical: abcdef▒+'''''------- abcdef
  Results between iconv_translit_str_replace and iconv_translit_str_replace_opt are identical: abcdef▒+'''''------- abcdef
  Test perf. iconv_translit_str_replace(): 0.723961
  Test perf. iconv_translit_locale_ko(): 0.51453
  Test perf. iconv_translit_locale_ok(): 0.55342
  Test perf. iconv_translit_str_replace_opt(): 0.310989

iconv_translit_locale_ko is pure locale and doesn't work exactly as I would since I wan't to transliterate non-breakable spaces also, altough they are valid ISO-8859-1.
iconv_translit_locale_ok adds a str_replace to deal with nb-spaces

As you can see the fastest way is to use str_replace optimized with arrays without using locale.

Below is my test script, best regards,
   Laurent Lyaudet

<?php

$sInputTest = 'abcdef€➕’ʼʹˈ′−➖‐‒–—― abcdef';

function iconv_translit_str_replace($sString){
  //On convertit le symbole de l'euro à la main
  $sString = str_replace('€', '¤', $sString);
  //On remplace le plus lourd avec le plus normal
  $sString = str_replace('➕', '+', $sString);
  //On remplace les variantes d'apostrophes
  $sString = str_replace('’', '\'', $sString);//marque de citation unique droite
  $sString = str_replace('ʼ', '\'', $sString);//lettre modificatrice apostrophe
  $sString = str_replace('ʹ', '\'', $sString);//lettre modificatrice prime
  $sString = str_replace('ˈ', '\'', $sString);//lettre modificatrice ligne verticale
  $sString = str_replace('′', '\'', $sString);//prime
  //On remplace les variantes de tiret
  $sString = str_replace('−', '-', $sString);//MINUS SIGN
  $sString = str_replace('➖', '-', $sString);//HEAVY MINUS SIGN
  $sString = str_replace('‐', '-', $sString);//HYPHEN
  $sString = str_replace('‒', '-', $sString);//FIGURE DASH
  $sString = str_replace('–', '-', $sString);//EN DASH
  $sString = str_replace('—', '-', $sString);//EM DASH
  $sString = str_replace('―', '-', $sString);//HORIZONTAL BAR
  //On remplace les espaces insécables
  $sString = str_replace(' ', ' ', $sString);

  $sString = iconv('UTF-8', 'ISO-8859-1//IGNORE', $sString);
  return $sString;
}



function iconv_translit_str_replace_opt($sString){
  $sString = str_replace(
    array(
      '€',
      '➕',
      '’',
      'ʼ',
      'ʹ',
      'ˈ',
      '′',
      '−',
      '➖',
      '‐',
      '‒',
      '–',
      '—',
      '―',
      ' ',
    ), 
    array(
      '¤',
      '+',
      '\'',
      '\'',
      '\'',
      '\'',
      '\'',
      '-',
      '-',
      '-',
      '-',
      '-',
      '-',
      '-',
      ' ',
    ), 
    $sString
  );

  $sString = iconv('UTF-8', 'ISO-8859-1//IGNORE', $sString);
  return $sString;
}



function iconv_translit_locale_ko($sString){
  $currentLocal = setlocale(LC_ALL, 0);
  setlocale(LC_ALL, "fr_FR@test");
  /*
translit_start
%euro
<U20AC> "<U00A4>"
%heavy plus
<U2795> "<U002B>"
%apostrophe
<U2019> "<U0027>"
<U02BC> "<U0027>"
<U02B9> "<U0027>"
<U02C8> "<U0027>"
<U2032> "<U0027>"
%dash-hyphen-minus
<U2212> "<U002D>"
<U2796> "<U002D>"
<U2010> "<U002D>"
<U2012> "<U002D>"
<U2013> "<U002D>"
<U2014> "<U002D>"
<U2015> "<U002D>"
%nb space
<U00A0> "<U0020>"
translit_end
  */
  $sString = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $sString);
  setlocale(LC_ALL, $currentLocal);
  return $sString;
}



function iconv_translit_locale_ok($sString){
  $currentLocal = setlocale(LC_ALL, 0);
  setlocale(LC_ALL, "fr_FR@test");
  /*
translit_start
%euro
<U20AC> "<U00A4>"
%heavy plus
<U2795> "<U002B>"
%apostrophe
<U2019> "<U0027>"
<U02BC> "<U0027>"
<U02B9> "<U0027>"
<U02C8> "<U0027>"
<U2032> "<U0027>"
%dash-hyphen-minus
<U2212> "<U002D>"
<U2796> "<U002D>"
<U2010> "<U002D>"
<U2012> "<U002D>"
<U2013> "<U002D>"
<U2014> "<U002D>"
<U2015> "<U002D>"
%nb space
<U00A0> "<U0020>"
translit_end
  */
    //On remplace les espaces insécables
  $sString = str_replace(' ', ' ', $sString);
  $sString = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $sString);
  setlocale(LC_ALL, $currentLocal);
  return $sString;
}



//Tests

$sOutputTest1 = iconv_translit_str_replace($sInputTest);
$sOutputTest2 = iconv_translit_locale_ko($sInputTest);
$sOutputTest3 = iconv_translit_locale_ok($sInputTest);
$sOutputTest4 = iconv_translit_str_replace_opt($sInputTest);

echo "Sizes:\n",
     "- Before: ", mb_strlen($sInputTest), "\n",
     "- After:\n",
     "-- iconv_translit_str_replace: ", strlen($sOutputTest1), "\n",
     "-- iconv_translit_locale_ko: ", strlen($sOutputTest2), "\n",
     "-- iconv_translit_locale_ok: ", strlen($sOutputTest3), "\n",
     "-- iconv_translit_str_replace_opt: ", strlen($sOutputTest4), "\n"
;
echo "Results between iconv_translit_str_replace and iconv_translit_locale_ko";
if($sOutputTest1 === $sOutputTest2){
  echo " are identical: ", $sOutputTest1, "\n";
}
else{
  echo " are different:\n",
       $sOutputTest1, "\n",
       $sOutputTest2, "\n";
}

echo "Results between iconv_translit_str_replace and iconv_translit_locale_ok";
if($sOutputTest1 === $sOutputTest3){
  echo " are identical: ", $sOutputTest1, "\n";
}
else{
  echo " are different:\n",
       $sOutputTest1, "\n",
       $sOutputTest3, "\n";
}

echo "Results between iconv_translit_str_replace and iconv_translit_str_replace_opt";
if($sOutputTest1 === $sOutputTest4){
  echo " are identical: ", $sOutputTest1, "\n";
}
else{
  echo " are different:\n",
       $sOutputTest1, "\n",
       $sOutputTest4, "\n";
}

function getFDifferenceMicrotime($p_sMicrotimeDebut, $p_sMicrotimeFin){
  list($sMicroSecondes1, $sSecondes1) = explode(' ', $p_sMicrotimeDebut);
  list($sMicroSecondes2, $sSecondes2) = explode(' ', $p_sMicrotimeFin);

  //On retranche d'abord les secondes pour ne pas pénaliser la précision du calcul
  return ((float)($sSecondes2 - $sSecondes1)) + $sMicroSecondes2 - $sMicroSecondes1;
}

$iMax = 100000;

echo "Test perf. iconv_translit_str_replace(): ";
$microtimeBefore = microtime();
for($i = 0; $i < $iMax; ++$i){
  $sOutputTest1 = iconv_translit_str_replace($sInputTest);
}
$microtimeAfter = microtime();
echo getFDifferenceMicrotime($microtimeBefore, $microtimeAfter), "\n";

echo "Test perf. iconv_translit_locale_ko(): ";
$microtimeBefore = microtime();
for($i = 0; $i < $iMax; ++$i){
  $sOutputTest1 = iconv_translit_locale_ko($sInputTest);
}
$microtimeAfter = microtime();
echo getFDifferenceMicrotime($microtimeBefore, $microtimeAfter), "\n";

echo "Test perf. iconv_translit_locale_ok(): ";
$microtimeBefore = microtime();
for($i = 0; $i < $iMax; ++$i){
  $sOutputTest1 = iconv_translit_locale_ok($sInputTest);
}
$microtimeAfter = microtime();
echo getFDifferenceMicrotime($microtimeBefore, $microtimeAfter), "\n";

echo "Test perf. iconv_translit_str_replace_opt(): ";
$microtimeBefore = microtime();
for($i = 0; $i < $iMax; ++$i){
  $sOutputTest1 = iconv_translit_str_replace_opt($sInputTest);
}
$microtimeAfter = microtime();
echo getFDifferenceMicrotime($microtimeBefore, $microtimeAfter), "\n";

?>

Comment 8 Florian Weimer 2018-04-30 09:31:19 UTC

It's unclear to me if the current locale facilities work for you.  Do they?

Comment 9 Laurent Lyaudet 2018-05-13 19:57:55 UTC

Hello,

>It's unclear to me if the current locale facilities work for you.  Do they?
They work partly. I have to do an str_replace for the character "non breakable space" since I want to transliterate it altough it could be converted.

I started a library to give a proof of concept of what I mean by transliteration profile.
Source code is here
https://github.com/LLyaudet/transliteration_profile_iconv
It is still work in progress but you could be able to understand more precisely the idea.
I welcome any feedback.

Best regards,
  Laurent Lyaudet

Comment 10 Florian Weimer 2018-05-14 07:55:04 UTC

(In reply to Laurent Lyaudet from comment #9)
> Hello,
> 
> >It's unclear to me if the current locale facilities work for you.  Do they?
> They work partly. I have to do an str_replace for the character "non
> breakable space" since I want to transliterate it altough it could be
> converted.

But couldn't use a custom charmap for that, one that drops the non-breakable space?

Comment 11 Laurent Lyaudet 2018-05-14 19:17:49 UTC

(In reply to Florian Weimer from comment #10)
> (In reply to Laurent Lyaudet from comment #9)
> > Hello,
> > 
> > >It's unclear to me if the current locale facilities work for you.  Do they?
> > They work partly. I have to do an str_replace for the character "non
> > breakable space" since I want to transliterate it altough it could be
> > converted.
> 
> But couldn't use a custom charmap for that, one that drops the non-breakable
> space?

Hello,

I tried to redefine the transliteration of the non-breakable space in the locale file : (extract from comment 7)
> %nb space
> <U00A0> "<U0020>"
but it didn't work since transliteration is used when converting fails only.
I don't know exactly what you mean by using a custom charmap:
a)- is it modifying the source code of iconv ? (In libiconv I saw that charmaps are compiled).
b)- or is there some file that defines the charmap that is read from at execution time ?
One of the main ideas of "transliteration_profile_iconv" is that everything is user defined by profiles at execution time because I thought that was not possible with current iconv.
However if b) applies there may be already enough flexibility.

Thanks, best regards,
   Laurent Lyaudet

Comment 12 Florian Weimer 2018-05-15 11:31:40 UTC

(In reply to Laurent Lyaudet from comment #11)
> I don't know exactly what you mean by using a custom charmap:
> a)- is it modifying the source code of iconv ? (In libiconv I saw that
> charmaps are compiled).
> b)- or is there some file that defines the charmap that is read from at
> execution time ?

The charmap is another input file for localedef.

Comment 13 Laurent Lyaudet 2018-07-28 17:38:02 UTC

Hello,

I tried to use charmap as you suggested but I did not succeed.
I also finished my library transliteration_profile_iconv.
It is available here : https://github.com/LLyaudet/transliteration_profile_iconv.
It solves the problem of tight coupling with locales.
Can you help me for the benchmark and/or give me feedback on my library?
I tried to benchmark my library against glibc iconv to see how much slower my library was, compared to glibc.
Here is what I tried :

I edited a locale definition file fr_FR@test in this directory (the Makefile puts it in the correct directory on Debian).
I edited a charmap definition file ISO-8859-1-test in this directory (the Makefile puts it in the correct directory on Debian).
The makefile then edits the file /etc/locale.gen and executes locale-gen.
Everything so far goes well.
But iconv_open() below fails because there is no gconv module for ISO-8859-1-test.
If I rename ISO-8859-1-test everywhere with ISO-8859-1 which I don't recommend (make copies of the correct files for ISO-8859-1) iconv_open() works but iconv() does not transliterate as specified by the fr_FR@test file.
(You can check that the setlocale is correct since it returns "fr_FR@test".
Euro symbol is transliterated to EUR instead of the currency symbol...
What I could test with PHP is not working with C.)
I tried also adding the locale in /usr/share/i18n/SUPPORTED,
then running locale-gen and iconvconfig.
But it didn't generate a new gconv module for ISO-8859-1-test.
So far I don't know if it's possible to generate a new gconv module for a new charmap without compiling glibc.
I didn't found any command to do so.

Thanks, best regards,
Laurent Lyaudet

Comment 14 Florian Weimer 2018-10-24 11:30:41 UTC

Sorry, I was mistaken about the charmap approach.  glibc currently does not have a way to alter charmaps based on locale definitions.

I still think that the configurable charmaps would go a long way towards solving this issue, but declarative charmaps processing is a lot of work (and we need to keep support for gconv modules around, for backwards compatibility).

Comment 15 Bruno Haible 2019-05-11 13:32:37 UTC

A note about transliteration in general:

glibc implements transliteration regarding a character set, and thus responds to the frequent case (in the years 2000-2005) that an application needs to process a file in UTF-8, while the locale is an 8-bit locale.

Nowadays, the more frequent use of transliteration is a culture-aware transliteration from one script to another script. The charset is not the important factor here. For example, when doing transliteration from Punjabi in Gurmukhi script to Punjabi in Shahmukhi (Arabic) script, the input and output are both UTF-8, therefore the glibc's transliteration system does not help. And such use-cases are outside glibc anyway, since 99.9% of the applications don't need culture-aware transliteration.

Therefore I think there is no need to extend glibc's transliteration facilities any more. Separate packages are the way to go (and also easier to implement if you can assume Unicode input and Unicode output, without the charset-related baggage that glibc's transliteration carries).