Bug 23048 - iconv: add more macintosh tables and aliases
Summary: iconv: add more macintosh tables and aliases
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.28
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-11 19:33 UTC by Jan Engelhardt
Modified: 2023-05-23 11:48 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Engelhardt 2018-04-11 19:33:38 UTC
The following names are not recognized by iconv as of glibc 2.27, but I think they should be implemented (some of them are already implemented with other names):

* cp10000 (mac-roman)
* cp10006 (mac-greek)
* cp10029 (mac-latin2/mac-centraleurope)
* cp10079 (mac-iceland)
* cp10081 (mac-turkish)
Comment 1 Andreas Schwab 2018-04-16 10:05:56 UTC
Do you have a reference document?
Comment 2 Jan Engelhardt 2018-04-16 10:32:29 UTC
Something like this?
https://en.wikipedia.org/wiki/Category:Mac_OS_character_encodings
Comment 3 Florian Weimer 2018-04-18 12:29:34 UTC
(In reply to Jan Engelhardt from comment #2)
> Something like this?
> https://en.wikipedia.org/wiki/Category:Mac_OS_character_encodings

Wikipedia is only a tertiary source (derived from secondary sources).  We need the original source.  Unicode has tables for some Mac codepages, but I don't know if they reflect reality.  Apple doesn't seem to publish anything.
Comment 4 Jan Engelhardt 2018-04-18 12:41:45 UTC
>original source

Like, http://downloads.sf.net/cdrtools/cdrtools-3.02a09.tar.bz2 contains cdrtools-3.02a09/libsiconv/tables/cp* ?

If there's no primary source but all secondary and tertiary sources agree with one another, isn't that reason enough to pick the secondary source? Essentially all ISOs were generated that way..
Comment 5 Jascha Eliano Paetzold 2023-03-21 21:04:07 UTC
https://www.gnu.org/software/libiconv supports all these encodings according to its website. Maybe the implementation over there could be used as a reference?

Apart from that, Wikipedia lists the following secondary/primary sources:

mac-roman: https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf#page=89
mac-greek: https://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/GREEK.TXT
mac-centraleurope: https://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CENTEURO.TXT
mac-iceland: https://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ICELAND.TXT
mac-turkish: https://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/TURKISH.TXT
Comment 6 Jascha Eliano Paetzold 2023-03-22 09:13:14 UTC
(In reply to Florian Weimer from comment #3)

Would you trust me if I verified these tables work on an Apple machine?
Comment 7 Florian Weimer 2023-04-03 14:18:51 UTC
(In reply to Jascha Eliano Paetzold from comment #6)
> (In reply to Florian Weimer from comment #3)
> 
> Would you trust me if I verified these tables work on an Apple machine?

I think we could use iconv (the function) on macOS to enumerate the 8-bit bit range, and then double-check the 20.1-bit Unicode space for any unexpected mappings in the other direction. That would be enough verification for me.
Comment 8 Jascha Eliano Paetzold 2023-05-23 11:07:47 UTC
(In reply to Florian Weimer from comment #7)
> (In reply to Jascha Eliano Paetzold from comment #6)
> > (In reply to Florian Weimer from comment #3)
> > 
> > Would you trust me if I verified these tables work on an Apple machine?
> 
> I think we could use iconv (the function) on macOS to enumerate the 8-bit
> bit range, and then double-check the 20.1-bit Unicode space for any
> unexpected mappings in the other direction. That would be enough
> verification for me.

Thanks for your offer to collaborate on that topic!

I have little to no experience with encodings, but I assume that it would be sufficient for me to generate a text file spanning all possible characters in the 8 bit range and pass it to the iconv command in the MacOS shell (setting it to convert from unicode to mac-greek etc.) and then post the output here?
Comment 9 Florian Weimer 2023-05-23 11:48:39 UTC
(In reply to Jascha Eliano Paetzold from comment #8)
> I have little to no experience with encodings, but I assume that it would be
> sufficient for me to generate a text file spanning all possible characters
> in the 8 bit range and pass it to the iconv command in the MacOS shell
> (setting it to convert from unicode to mac-greek etc.) and then post the
> output here?

For 8-bit input lie that, you'd have to convert *from* mac-greek.

For conversion to mac-greek, you'd have to enumerate all Unicode codepoints (or maybe just the BMP), and somehow skip over unconvertable characters. That is probably best done programmatically because iconv (the shell command) may not provide enough information for skipped/non-convertable Unicode codepoints.