This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?


https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #5 from Mike Frysinger <vapier at gentoo dot org> ---
i don't think we have a choice here.  if the rest of the world is converging on
the unicode standard view of the world, and it says 0, then we should do that
as well.  trying to "take a stand" here won't help as long as the unicode
consortium doesn't change, and i think they've settled the matter in their
eyes.  if you want to deliberate the topic further, it'd probably be better
spent doing so on their lists.

the unicode FAQ includes this entry [1] (which the korpela page called out):
Q: Unicode now treats the SOFT HYPHEN as format control (Cf) character when
formerly it was a punctuation character (Pd). Doesn't this break ISO 8859-1
compatibility?
A: No. The ISO 8859-1 standard defines the SOFT HYPHEN as "[a] graphic
character that is imaged by a graphic symbol identical with, or similar to,
that representing hyphen" (section 6.3.3), but does not specify details of how
or when it is to be displayed, nor other details of its semantics. The soft
hyphen has had a long history of legacy implementation in two or more
incompatible ways.
Unicode clarifies the semantics of this character for Unicode implementations,
but this does not affect its usage in ISO 8859-1 implementations. Processes
that convert back and forth may need to pay attention to semantic differences
between the standards, just as for any other character.
In a terminal emulation environment, particularly in ISO-8859-1 contexts, one
could display the soft hyphen as a hyphen in all circumstances. The change in
semantics of the Unicode character does not require that implementations of
terminal emulators in other environments, such as ISO 8859-1, make any change
in their current behavior.

[1] http://www.unicode.org/faq/casemap_charprop.html#18

i think that answers the question here: in our UTF-8 charmaps, we should mark
U+00AD as 0, but in our ISO 8859-1 (and other applicable legacy) charmaps, we
should mark it as 1.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]