Bug 11532

Summary: Support old DOS Lithuanian character sets in iconv
Product: glibc Reporter: Rimas Kudelis <rimas>
Component: localedataAssignee: GNU C Library Locale Maintainers <libc-locales>
Status: RESOLVED FIXED    
Severity: normal CC: drepper.fsp, glibc-bugs
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: Mapping of CP770
Mapping of CP771
Mapping of CP772
Mapping of CP773
Mapping of CP774
Mapping of CP773 (corrected)

Description Rimas Kudelis 2010-04-23 14:24:55 UTC
Back in DOS days, a few different character sets were more or less used in
Lithuania: 770, 771, 772, 773, 774, and 775. Iconv currently only supports the
latter. It would be nice to get support for others on the list too.

Links:
http://www.likit.lt/nostyle/770.htm
http://www.likit.lt/nostyle/771.htm
http://www.likit.lt/nostyle/772.htm
<no illustration for cp773>
http://www.likit.lt/nostyle/774.htm

If adding these character sets to iconv is generally acceptable, I think I could
try to generate mappings from all these charsets to UTF-8
Comment 1 Ulrich Drepper 2010-05-03 17:01:50 UTC
Then provide mapping tables.
Comment 2 Rimas Kudelis 2010-05-04 09:12:57 UTC
Created attachment 4762 [details]
Mapping of CP770
Comment 3 Rimas Kudelis 2010-05-04 09:13:16 UTC
Created attachment 4763 [details]
Mapping of CP771
Comment 4 Rimas Kudelis 2010-05-04 09:13:33 UTC
Created attachment 4764 [details]
Mapping of CP772
Comment 5 Rimas Kudelis 2010-05-04 09:13:49 UTC
Created attachment 4765 [details]
Mapping of CP773
Comment 6 Rimas Kudelis 2010-05-04 09:14:08 UTC
Created attachment 4766 [details]
Mapping of CP774
Comment 7 Rimas Kudelis 2010-05-04 09:28:45 UTC
I've attached five files with mapping tables for each codepage.
Their format is: 
[octal code]: [UTF-8 character]

Lower 127 positions (0000-0177) match ASCII in all cases, so only the positions
starting 0200 matter.

It seems like these charsets are (or maybe were) supported by ICU (see [1]). The
page also has some further descriptions that could be used when forming alias
names for cp77x charsets:

CP770 	Lithuanian Standard RST 1095-89
CP771 	KBL (Lithuanian and Russian characters)
CP772 	Lithuanian Standard LST 1284:1993
CP773 	Lithuanian (Mix of 771 and 775)
CP774 	Lithuanian Standard 1283:1993

Unfortunately, I couldn't find source files of ICU mappings of these character
sets at [2], so I can't attach them. Instead, I used a small program found at
[3], developed a few years ago specifically to act as a converter among
different character sets used in Lithuania (note: I changed one symbol in
CP770.txt to match with the actual standard).

If it's possible to find ICU mappings, I think most likely they should be used
as a basis for conversion. Otherwise, the files attached should be fine.

[1]
http://publib.boulder.ibm.com/infocenter/tivihelp/v24r1/index.jsp?topic=/com.ibm.itcama.doc_6.2.3/itcam_oraclerac63200.htm
[2] http://source.icu-project.org/repos/icu/data/trunk/charset/data/
[3] https://www3.mruni.lt/~rims/kodav/#Diegimas
Comment 8 Rimas Kudelis 2010-05-04 11:08:37 UTC
Created attachment 4767 [details]
Mapping of CP773 (corrected)

According to the name mentioned on ICU page, and cp773.acm (CP773 mapping for
Linux console) found on [1], this is the more correct mapping of codepage 773.

[1] http://gedmin.as/lit-con/
Comment 9 Ulrich Drepper 2011-05-10 03:16:46 UTC
The files aren't usable in that form.  It was quite a lot of work to make all the transformations.  Support is in git now.
Comment 10 Rimas Kudelis 2011-05-10 05:05:18 UTC
Wow, thanks!

I've actually got hold of the paper standards, at least some of them, so I should be able to check the validity of our mappings when time permits.