This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH COMMITTED] locale/C-translit.h.in: Cyrillic -> ASCII transliteration [BZ #2872]


Egor,

Here are my doubts and questions about the patch which I have
committed.  If they are resolved before the final release,
it will be fine.  If not - fine as well.

Sorry if they were discussed and answered before, my memory
is getting lost in these.


20.07.2019 22:01 Rafal Luzynski <digitalfreak@lingonborough.com> wrote:
>  [...]
>  	* sysdeps/unix/sysv/linux/syscall-names.list: Add system calls
> diff --git a/locale/C-translit.h.in b/locale/C-translit.h.in
> index d5f00df0f3..758171c394 100644
> --- a/locale/C-translit.h.in
> +++ b/locale/C-translit.h.in
> @@ -56,6 +56,175 @@
>  "\x02cd"	"_"	# <U02CD> MODIFIER LETTER LOW MACRON
>  "\x02d0"	":"	# <U02D0> MODIFIER LETTER TRIANGULAR COLON
>  "\x02dc"	"~"	# <U02DC> SMALL TILDE

There are gaps.  For example, here
<U0400> CYRILLIC CAPITAL LETTER IE WITH GRAVE (Ѐ)
is missing.  Should we add it and transliterate as, e.g., "E`"?

> +"\x0401"	"YO"	# <U0401> CYRILLIC CAPITAL LETTER IO
> +"\x0402"	"DJ"	# <U0402> CYRILLIC CAPITAL LETTER DJE
> +"\x0403"	"G`"	# <U0403> CYRILLIC CAPITAL LETTER GJE
> +"\x0404"	"YE"	# <U0404> CYRILLIC CAPITAL LETTER UKRAINIAN IE
> +"\x0405"	"Z`"	# <U0405> CYRILLIC CAPITAL LETTER DZE
> +"\x0406"	"I"	# <U0406> CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
> +"\x0407"	"YI"	# <U0407> CYRILLIC CAPITAL LETTER YI
> +"\x0408"	"J"	# <U0408> CYRILLIC CAPITAL LETTER JE
> +"\x0409"	"L`"	# <U0409> CYRILLIC CAPITAL LETTER LJE
> +"\x040a"	"N`"	# <U040A> CYRILLIC CAPITAL LETTER NJE

Isn't this ambiguous if we transliterate:

"Љ" -> "L`"
"Њ" -> "N`"

but also:

"Ль" -> "L`"
"Нь" -> "N`"

?

> +"\x040b"	"TSH"	# <U040B> CYRILLIC CAPITAL LETTER TSHE
> +"\x040c"	"K`"	# <U040C> CYRILLIC CAPITAL LETTER KJE
> +"\x040e"	"U`"	# <U040E> CYRILLIC CAPITAL LETTER SHORT U

<U040D> CYRILLIC CAPITAL LETTER I WITH GRAVE (Ѝ)
is missing here.  Shouldn't we add it?  "I`" maybe?

> +"\x040f"	"DH"	# <U040F> CYRILLIC CAPITAL LETTER DZHE
> +"\x0410"	"A"	# <U0410> CYRILLIC CAPITAL LETTER A
> +"\x0411"	"B"	# <U0411> CYRILLIC CAPITAL LETTER BE
> [...]

> [...]
> +"\x042a"	"A`"	# <U042A> CYRILLIC CAPITAL LETTER HARD SIGN
> [...]
> +"\x044a"	"``"	# <U044A> CYRILLIC SMALL LETTER HARD SIGN
> [...]

This is slightly reordered to illustrate my question.  Isn't it a problem
that uppercase hard sigh is transliterated to "A`" while the lowercase
is transliterated to "``"?  My doubt is that the transliterated graphemes
are not each others' upper/lower case variants.  If you look at the soft
sign:

> [...]
> +"\x042c"	"`"	# <U042C> CYRILLIC CAPITAL LETTER SOFT SIGN
> [...]
> +"\x044c"	"`"	# <U044C> CYRILLIC SMALL LETTER SOFT SIGN
> [...]

they don't have this problem.

> [...]
> +"\x042d"	"E`"	# <U042D> CYRILLIC CAPITAL LETTER E
> [...]
> +"\x044d"	"e`"	# <U044D> CYRILLIC SMALL LETTER E
> [...]
> +"\x048c"	"E`"	# <U048C> CYRILLIC CAPITAL LETTER SEMISOFT SIGN
> +"\x048d"	"e`"	# <U048D> CYRILLIC SMALL LETTER SEMISOFT SIGN
> [...]

Isn't this again an ambiguity problem?

> +"\x045c"	"k`"	# <U045C> CYRILLIC SMALL LETTER KJE
> +"\x045e"	"u`"	# <U045E> CYRILLIC SMALL LETTER SHORT U
> +"\x045f"	"dh"	# <U045F> CYRILLIC SMALL LETTER DZHE

Here is a gap which is not critical because here is a place for some
archaic letters which are hardly used and probably it is difficult to find
the correct transliterations for them.  But somehow you have managed to
find a transliteration for this:

> +"\x046a"	"O`"	# <U046A> CYRILLIC CAPITAL LETTER BIG YUS
> +"\x046b"	"o`"	# <U046B> CYRILLIC SMALL LETTER BIG YUS

Similarly, is it possible to find and provide tranlisterations for:

- little yus (Ѧ/ѧ)?
- iotified big yus (Ѭ/ѭ) and little yus (Ѩ/ѩ)?

While at this, the transliteration of big yus ("O`"/"o`")
is again ambiguous because it is the same as Abkhasian Ha (Ҩ),
O with diaeresis (Ӧ), and barred O (Ө).

> [...]
> +"\x049a"	"K`"	# <U049A> CYRILLIC CAPITAL LETTER KA WITH DESCENDER
> +"\x049b"	"k`"	# <U049B> CYRILLIC SMALL LETTER KA WITH DESCENDER
> +"\x049e"	"K`"	# <U049E> CYRILLIC CAPITAL LETTER KA WITH STROKE
> +"\x049f"	"k`"	# <U049F> CYRILLIC SMALL LETTER KA WITH STROKE
> +"\x04a2"	"N`"	# <U04A2> CYRILLIC CAPITAL LETTER EN WITH DESCENDER
> +"\x04a3"	"n`"	# <U04A3> CYRILLIC SMALL LETTER EN WITH DESCENDER
> [...]

As you can see, there are many more ambiguities.  But while here, wouldn't
"K," be a better transliteration for Ka with descender (Қ), and "N," for
En with descender (Ң)?

> [...]
> +"\x04a8"	"O`"	# <U04A8> CYRILLIC CAPITAL LETTER ABKHASIAN HA
> +"\x04a9"	"o`"	# <U04A9> CYRILLIC SMALL LETTER ABKHASIAN HA

Is Abkhasian Ha (Ҩ) pronounced like "H"?  Then why is it transliterated
as "O" (with some additional punctuation character) instead of "H"?

There are more doubts about ambiguous transliterations and gaps which
I don't list here for the sake of brevity.  They can be easily found.

Regards,

Rafal


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]