This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29


Hi,

I have now implemented all the changes requested for translit_cyrillic
file but started hitting what seems like a bug:

- If the line <U0425> <U0048>;<U0058> is present in translt_cyrillic the
locale compilation fails i.e. grep CYRILLIC < $testfile |
LOCPATH=$workdir/compiled_locales/"$locale"/ LC_ALL="$locale".UTF-8
iconv -f UTF-8 -t ASCII//TRANSLIT is hanging frozen.

- If the line <U0425> <U0048>;<U0058> is absent from translit_cyrillic
everything works, just the transliteration of <U0425> fails as expected
(? is displayed)

- If translit_cyrillic contains <U0425> <U0048>;<U0058> as the _only_
line the transliteration of <U0425> works again (others as ?).

Would you have any idea into what direction should I look? The new
translit_cyrillic is attached.

(<U0425> is % CYRILLIC CAPITAL LETTER HA)

Best regards,
Egor

On 09.10.2018 01:35, Egor Kobylkin wrote:
> On 09.10.2018 00:23, Rafal Luzynski wrote:
>> 8.10.2018 14:40 Marko Myllynen <myllynen@redhat.com> wrote:
>>> Hi,
>>>
>>> Thanks for the update. I have few mostly cosmetic comments below,
>>> hopefully we'll hear from others whether they agree with this direction.
>>>
> 
> Yeah, the earlier we have feedback the more productive we are. I'd be
> happy to get much feedback on this as early as possible. So please
> everybody concerned please chime in.
> 
>>
>>> - No duplicates:
>>>
>>> % CYRILLIC SMALL LETTER IE
>>> <U0435> <U0065>; <U0065>
>>>
>>> should become:
>>>
>>> % CYRILLIC SMALL LETTER IE
>>> <U0435> <U0065>
>>>
>>> - There are few issues with the definitions:
>>>
>>> % CYRILLIC CAPITAL LETTER U
>>> <U0423> <U0055>; <U0055>
>>> % CYRILLIC UNDEFINED
>>> <U0423><U0423> <U00DA>; "<U0055><U0060>"
>>>
>>> % CYRILLIC SMALL LETTER U
>>> <U0443> <U0075>; <U0075>
>>> % CYRILLIC UNDEFINED
>>> <U0443><U0443> <U00FA>; "<U0075><U0060>"
>>
>> Are the duplicates here because some Cyrillic letters may have multiple
>> Latin transliterations depending on the context, for example Cyrillic IE
>> must be transliterated sometimes as "e", sometimes as "ie", sometimes
>> as "ye" or "je"?  Can we provide rules for groups of characters instead?
> No, the duplicates are just by design of my line generating logic. I
> have fixed (removed) them. The varying transcription between
> languages/locales can not be handled in one file at all as far as I
> understood.
> 
>>
>>> I wonder would it be possible to automate generation of this file so
>>> that issues like the above could avoided? But perhaps that could be the
>>> next step once this initial patch lands.
> 
> I am generating the content part of the translit_cyrillc from the
> LibreOffice Spreadsheet. Not sure if you had time to view it by now?
> https://sourceware.org/bugzilla/attachment.cgi?id=11299
> 
> Anyway I have just fixed the issues identified by Marko above in that
> spreadsheet. I will do the changes for the below request and then upload
> the new translit_cyrillic file to the bugzilla.
> 
> 
>>> - Please add the standard glibc locale header (see the existing
>>> translit_* files for reference)
>>> - Consider wrapping the header lines at or around column 70-72
>>> - Consider describing which characters, character ranges, or blocks are
>>> supported (perhaps also describe why some of those are not included, see
>>> e.g. https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode)
>>> - Please remove trailing whitespaces and spaces after ;
>>
>> Thanks for this, Marko.  While at this, in the ChangeLog and in the commit
>> message these paths:
>>
>> 	* locales/aa_DJ: likewise
>>
>> 1. Should be a relative path starting in the root directory of glibc
> source,
>>    that is: "* localedata/locales/aa_DJ".
>> 2. Should be "Likewise." (starting with an uppercase and ending with a
> dot).
> 
> will do.
> 
> Bests,
> Egor
> 

Attachment: translit_cyrillic
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]