This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

charset documentation patches


I just had a quick look at charset.texi and noticed quite a number of
minor mistakes that I fixed. Mostly UCS4 -> UCS-4 (that how it is
written in all the standards), second amendment -> Amendment 1 (there
was no second amendment to ISO C90), etc. I also slightly modernized the
description of the relationship between Unicode and ISO 10646.

Patch attached.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Index: charset.texi
===================================================================
RCS file: /cvs/glibc/libc/manual/charset.texi,v
retrieving revision 1.23
diff -u -r1.23 charset.texi
--- charset.texi	2000/09/27 00:44:57	1.23
+++ charset.texi	2000/09/30 19:05:40
@@ -15,7 +15,7 @@
 grappled with non-Roman character sets, where not all the characters
 that make up a language's character set can be represented by @math{2^8}
 choices.  This chapter shows the functionality which was added to the C
-library to correctly support multiple character sets.
+library to support multiple character sets.
 
 @menu
 * Extended Char Intro::              Introduction to Extended Characters.
@@ -46,13 +46,13 @@
 representations include files lying in a directory that are going to be
 read and parsed.
 
-Traditionally there was no difference between the two representations.
-It was equally comfortable and useful to use the same one-byte
+Traditionally there has been no difference between the two representations.
+It was equally comfortable and useful to use the same single-byte
 representation internally and externally.  This changes with more and
 larger character sets.
 
 One of the problems to overcome with the internal representation is
-handling text which is externally encoded using different character
+handling text that is externally encoded using different character
 sets.  Assume a program which reads two texts and compares them using
 some metric.  The comparison can be usefully done only if the texts are
 internally kept in a common format.
@@ -69,14 +69,28 @@
 As shown in some other part of this manual,
 @c !!! Ahem, wide char string functions are not yet covered -- drepper
 there exists a completely new family of functions which can handle texts
-of this kind in memory.  The most commonly used character set for such
-internal wide character representations are Unicode and @w{ISO 10646}.
-The former is a subset of the latter and used when wide characters are
-chosen to by 2 bytes (@math{= 16} bits) wide.  The standard names of the
-@cindex UCS2
-@cindex UCS4
-encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4
-(@math{= 32} bits).
+of this kind in memory.  The most commonly used character sets for such
+internal wide character representations are Unicode and @w{ISO 10646}
+(also known as UCS for Universal Character Set). Unicode was originally
+planned as a 16-bit character set, whereas @w{ISO 10646} was designed to
+be a 31-bit large code space. The two standards are practically identical.
+They have the same character repertoire and code table, but Unicode specifies
+added semantics.  At the moment, only characters in the first @code{0x10000}
+code positions (the so-called Basic Multilingual Plane, BMP) have been
+assigned, but the assignment of more specialized characters outside this
+16-bit space is already in progress. A number of encodings have been
+defined for Unicode and @w{ISO 10646} characters:
+@cindex UCS-2
+@cindex UCS-4
+@cindex UTF-8
+@cindex UTF-16
+UCS-2 is a 16-bit word that can only represent characters
+from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
+and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
+ASCII characters are represented by ASCII bytes and non-ASCII characters
+by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
+of UCS-2 in which pairs of certain UCS-2 words can be used to encode
+non-BMP characters up to @code{0x10ffff}.
 
 To represent wide characters the @code{char} type is not suitable.  For
 this reason the @w{ISO C} standard introduces a new type which is
@@ -93,18 +107,18 @@
 
 The @w{ISO C90} standard, where this type was introduced, does not say
 anything specific about the representation.  It only requires that this
-type is capable to store all elements of the basic character set.
+type is capable of storing all elements of the basic character set.
 Therefore it would be legitimate to define @code{wchar_t} as
 @code{char}.  This might make sense for embedded systems.
 
 But for GNU systems this type is always 32 bits wide.  It is therefore
-capable to represent all UCS4 value therefore covering all of @w{ISO
-10646}.  Some Unix systems define @code{wchar_t} as a 16 bit type and
+capable of representing all UCS-4 values and  therefore covering all of
+@w{ISO 10646}.  Some Unix systems define @code{wchar_t} as a 16-bit type and
 thereby follow Unicode very strictly.  This is perfectly fine with the
 standard but it also means that to represent all characters from Unicode
-and @w{ISO 10646} one has to use surrogate character which is in fact a
-multi-wide-character encoding.  But this contradicts the purpose of the
-@code{wchar_t} type.
+and @w{ISO 10646} one has to use UTF-16 surrogate characters which is in
+fact a multi-wide-character encoding.  But this contradicts the purpose
+of the @code{wchar_t} type.
 @end deftp
 
 @comment wchar.h
@@ -119,8 +133,8 @@
 @code{int} due to the parameter promotion.
 
 @pindex wchar.h
-This type is defined in @file{wchar.h} and got introduced in the second
-amendment to @w{ISO C90}.
+This type is defined in @file{wchar.h} and got introduced in
+@w{Amendment 1} to @w{ISO C90}.
 @end deftp
 
 As there are for the @code{char} data type there also exist macros
@@ -133,7 +147,7 @@
 The macro @code{WCHAR_MIN} evaluates to the minimum value representable
 by an object of type @code{wint_t}.
 
-This macro got introduced in the second amendment to @w{ISO C90}.
+This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
 @end deftypevr
 
 @comment wchar.h
@@ -142,7 +156,7 @@
 The macro @code{WCHAR_MIN} evaluates to the maximum value representable
 by an object of type @code{wint_t}.
 
-This macro got introduced in the second amendment to @w{ISO C90}.
+This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
 @end deftypevr
 
 Another special wide character value is the equivalent to @code{EOF}.
@@ -180,7 +194,7 @@
 @end smallexample
 
 @pindex wchar.h
-This macro was introduced in the second amendment to @w{ISO C90} and is
+This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
 defined in @file{wchar.h}.
 @end deftypevr
 
@@ -198,7 +212,7 @@
 @cindex multibyte character
 @cindex EBCDIC
    For all the above reasons, an external encoding which is different
-from the internal encoding is often used if the latter is UCS2 or UCS4.
+from the internal encoding is often used if the latter is UCS-2 or UCS-4.
 The external encoding is byte-based and can be chosen appropriately for
 the environment and for the texts to be handled.  There exist a variety
 of different character sets which can be used for this external
@@ -215,7 +229,7 @@
 
 @itemize @bullet
 @item
-The simplest character sets are one-byte character sets.  There can be
+The simplest character sets are single-byte character sets.  There can be
 only up to 256 characters (for @w{8 bit} character sets) which is not
 sufficient to cover all languages but might be sufficient to handle a
 specific text.  Another reason to choose this is because of constraints
@@ -240,7 +254,7 @@
 sequence of a character one can interpret a text correctly.  Examples of
 character sets using this policy are the various EUC character sets
 (used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
-or SJIS (Shift JIS, a Japanese encoding).
+or SJIS (Shift-JIS, a Japanese encoding).
 
 But there are also character sets using a state which is valid for more
 than one character and has to be changed by another byte sequence.
@@ -257,23 +271,23 @@
 acute'' character.  To get the acute accent character on its on one has
 to write @code{0xc2 0x20} (the non-spacing acute followed by a space).
 
-This type of characters sets is quite frequently used in embedded
-systems such as video text.
+This type of character set is used in some embedded systems such as
+teletex.
 
 @item
 @cindex UTF-8
-Instead of converting the Unicode or @w{ISO 10646} text used internally
+Instead of converting the Unicode or @w{ISO 10646} text used internally,
 it is often also sufficient to simply use an encoding different than
-UCS2/UCS4.  The Unicode and @w{ISO 10646} standards even specify such an
+UCS-2/UCS-4.  The Unicode and @w{ISO 10646} standards even specify such an
 encoding: UTF-8.  This encoding is able to represent all of @w{ISO
-10464} 31 bits in a byte string of length one to seven.
+10464} 31 bits in a byte string of length one to six.
 
 @cindex UTF-7
 There were a few other attempts to encode @w{ISO 10646} such as UTF-7
 but UTF-8 is today the only encoding which should be used.  In fact,
-UTF-8 will hopefully soon be the only external which has to be
+UTF-8 will hopefully soon be the only external encoding that has to be
 supported.  It proves to be universally usable and the only disadvantage
-is that it favor Roman languages very much by making the byte string
+is that it favors Roman languages by making the byte string
 representation of other scripts (Cyrillic, Greek, Asian scripts) longer
 than necessary if using a specific character set for these scripts.
 Methods like the Unicode compression scheme can alleviate these
@@ -324,7 +338,7 @@
 The second family of functions got introduced in the early Unix standards
 (XPG2) and is still part of the latest and greatest Unix standard:
 @w{Unix 98}.  It is also the most powerful and useful set of functions.
-But we will start with the functions defined in the second amendment to
+But we will start with the functions defined in @w{Amendment 1} to
 @w{ISO C90}.
 
 @node Restartable multibyte conversion
@@ -377,7 +391,7 @@
 by the functions we are about to describe.  Each locale uses its own
 character set (given as an argument to @code{localedef}) and this is the
 one assumed as the external multibyte encoding.  The wide character
-character set always is UCS4, at least on GNU systems.
+character set always is UCS-4, at least on GNU systems.
 
 A characteristic of each multibyte character set is the maximum number
 of bytes which can be necessary to represent one character.  This
@@ -456,8 +470,8 @@
 function to another.
 
 @pindex wchar.h
-This type is defined in @file{wchar.h}.  It got introduced in the second
-amendment to @w{ISO C90}.
+This type is defined in @file{wchar.h}.  It got introduced in
+@w{Amendment 1} to @w{ISO C90}.
 @end deftp
 
 To use objects of this type the programmer has to define such objects
@@ -495,7 +509,7 @@
 it is zero.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -559,7 +573,7 @@
 any static state.
 
 @pindex wchar.h
-This function was introduced in the second amendment of @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -608,7 +622,7 @@
 @code{EOF}.
 
 @pindex wchar.h
-This function was introduced in the second amendment of @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -655,7 +669,7 @@
 @code{(size_t) -1}.  The conversion state is afterwards undefined.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -733,7 +747,7 @@
 object local to @code{mbrlen} is used.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -839,7 +853,7 @@
 available, otherwise buffer overruns can occur.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
 declared in @file{wchar.h}.
 @end deftypefun
 
@@ -977,7 +991,7 @@
 following the last converted multibyte character.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
 declared in @file{wchar.h}.
 @end deftypefun
 
@@ -1058,7 +1072,7 @@
 converted.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
 declared in @file{wchar.h}.
 @end deftypefun
 
@@ -1231,8 +1245,8 @@
 @node Non-reentrant Conversion
 @section Non-reentrant Conversion Function
 
-The functions described in the last chapter are defined in the second
-amendment to @w{ISO C90}.  But the original @w{ISO C90} standard also
+The functions described in the last chapter are defined in
+@w{Amendment 1} to @w{ISO C90}.  But the original @w{ISO C90} standard also
 contained functions for character set conversion.  The reason that they
 are not described in the first place is that they are almost entirely
 useless.
@@ -1369,8 +1383,8 @@
 
 For convenience reasons the @w{ISO C90} standard defines also functions
 to convert entire strings instead of single characters.  These functions
-suffer from the same problems as their reentrant counterparts from the
-second amendment to @w{ISO C90}; see @ref{Converting Strings}.
+suffer from the same problems as their reentrant counterparts from
+@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
 
 @comment stdlib.h
 @comment ISO
@@ -1513,7 +1527,7 @@
 specified by the functions.  The multibyte encoding used is specified by
 the currently selected locale for the @code{LC_CTYPE} category.  The
 wide character set is fixed by the implementation (in the case of GNU C
-library it always is UCS4 encoded @w{ISO 10646}.
+library it always is UCS-4 encoded @w{ISO 10646}.
 
 This has of course several problems when it comes to general character
 conversion:
@@ -1806,12 +1820,12 @@
   int result = 0;
   iconv_t cd;
 
-  cd = iconv_open ("UCS4", charset);
+  cd = iconv_open ("UCS-4", charset);
   if (cd == (iconv_t) -1)
     @{
       /* @r{Something went wrong.}  */
       if (errno == EINVAL)
-        error (0, 0, "conversion from `%s' to `UCS4' no available",
+        error (0, 0, "conversion from '%s' to 'UCS-4' not available",
                charset);
       else
         perror ("iconv_open");
@@ -2024,7 +2038,7 @@
 
 Unfortunately, the answer is: there is no general solution.  On some
 systems guessing might help.  On those systems most character sets can
-convert to and from UTF8 encoded @w{ISO 10646} or Unicode text.
+convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text.
 Beside this only some very system-specific methods can help.  Since the
 conversion functions come from loadable modules and these modules must
 be stored somewhere in the filesystem, one @emph{could} try to find them
@@ -2082,7 +2096,7 @@
 
 @cindex triangulation
 This is achieved by providing for each character set a conversion from
-and to UCS4 encoded @w{ISO 10646}.  Using @w{ISO 10646} as an
+and to UCS-4 encoded @w{ISO 10646}.  Using @w{ISO 10646} as an
 intermediate representation it is possible to @dfn{triangulate}, i.e.,
 converting with an intermediate representation.
 
@@ -2210,15 +2224,15 @@
 @code{INTERNAL} mentioned.  From the discussion above and the chosen
 name it should have become clear that this is the name for the
 representation used in the intermediate step of the triangulation.  We
-have said that this is UCS4 but actually it is not quite right.  The
-UCS4 specification also includes the specification of the byte ordering
-used.  Since a UCS4 value consists of four bytes a stored value is
+have said that this is UCS-4 but actually it is not quite right.  The
+UCS-4 specification also includes the specification of the byte ordering
+used.  Since a UCS-4 value consists of four bytes a stored value is
 effected by byte ordering.  The internal representation is @emph{not}
-the same as UCS4 in case the byte ordering of the processor (or at least
-the running process) is not the same as the one required for UCS4.  This
+the same as UCS-4 in case the byte ordering of the processor (or at least
+the running process) is not the same as the one required for UCS-4.  This
 is done for performance reasons as one does not want to perform
 unnecessary byte-swapping operations if one is not interested in actually
-seeing the result in UCS4.  To avoid trouble with endianess the internal
+seeing the result in UCS-4.  To avoid trouble with endianess the internal
 representation consistently is named @code{INTERNAL} even on big-endian
 systems where the representations are identical.
 
@@ -2570,7 +2584,7 @@
 character can consist of one to four bytes.  Therefore the
 @code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
 this way.  The output is always the @code{INTERNAL} character set (aka
-UCS4) and therefore each character consists of exactly four bytes.  For
+UCS-4) and therefore each character consists of exactly four bytes.  For
 the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
 account that escape sequences might be necessary to switch the character
 sets.  Therefore the @code{__max_needed_to} element for this direction

Index: ctype.texi
===================================================================
RCS file: /cvs/glibc/libc/manual/ctype.texi,v
retrieving revision 1.23
diff -u -r1.23 ctype.texi
--- ctype.texi  2000/05/21 21:21:56     1.23
+++ ctype.texi  2000/09/30 19:12:41
@@ -265,8 +265,8 @@
 @node Classification of Wide Characters, Using Wide Char Classes, Case Conversion, Character Handling
 @section Character class determination for wide characters
 
-The second amendment to @w{ISO C89} defines functions to classify wide
-characters.  Although the original @w{ISO C89} standard already defined
+@w{Amendment 1} to @w{ISO C90} defines functions to classify wide
+characters.  Although the original @w{ISO C90} standard already defined
 the type @code{wchar_t}, no functions operating on them were defined.
 
 The general design of the classification functions for wide characters

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]