This is the mail archive of the
mailing list for the glibc project.
[Bug libc/20538] New: Update EUC-KR?
- From: "jehan.marmottard at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Tue, 30 Aug 2016 16:16:26 +0000
- Subject: [Bug libc/20538] New: Update EUC-KR?
- Auto-submitted: auto-generated
Bug ID: 20538
Summary: Update EUC-KR?
Assignee: unassigned at sourceware dot org
Reporter: jehan.marmottard at gmail dot com
CC: drepper.fsp at gmail dot com
Target Milestone: ---
I have a bunch of files in EUC-KR which breaks on iconv with "illegal input
sequence" (tested with master of glibc as well).
They convert OK with CP949 so my first idea would be to assume that the files
are Microsoft CodePage 949, and not EUC-KR (Unified Hangeul Code/CodePage 949
is said to be a superset of EUC-KR so that would explain why conversion is
still globally good).
But I can also find various literature which seems to indicate that maybe
glibc's iconv implementation may not be up-to-date.
As an example, iconv blocked on the character '됀' (unicode 0xB400), encoded as
0x89c2. I can see that euckr_from_ucs4() would just let the first byte pass
through, so obviously it breaks just after:
> if (ch <= 0x9f)
And clearly the rest of the code does not work either for these 2 bytes. But
according to some references, EUC-KR should actually be able to encode this
* The WhatWG describes a EUC-KR decoding algorithm quite different from glib's
iconv implementation: https://encoding.spec.whatwg.org/#euc-kr
And this character is in the list:
* I also found this Unicode mapping, apparently a 1992 revision of KSC5601:
It also lists this character, with the same coding as WhatWG.
Now I am a little lost since I don't manage to find a single official reference
spec for EUC-KR. All official listings will cite RFC 1557
(https://tools.ietf.org/html/rfc1557) which just does give no real details
about the EUC-KR encoding. So I can't know for sure if EUC-KR (de)coding in
glibc is right or not, and all texts I can find about this encoding are
extremely messy and incomplete.
Could you shed some light on this issue please?
If it turns out that the EUC-KR algorithm in glibc should be updated, I would
be OK to do this patch if needed. I'd appreciate a hint to the right
specification to be followed though. :-)
You are receiving this mail because:
You are on the CC list for the bug.