This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

regression caused by fix of bug #13691

From: Bruno Haible <bruno at clisp dot org>
To: Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
Date: Mon, 14 May 2012 01:41:58 +0200
Subject: regression caused by fix of bug #13691
Bcc: bruno at haible dot de

Hi,

The "fix" of BZ #13691 committed on 2012-04-02
<http://sourceware.org/bugzilla/show_bug.cgi?id=13691>
<http://sourceware.org/ml/libc-alpha/2012-03/msg01110.html>
<http://sourceware.org/ml/libc-alpha/2012-04/msg00039.html>
<http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=39c59c35723120c32dc42dde4115bba92305179f>

is broken: it introduces a regression in the iconv() function for
this encoding.

Facts to remember when pushing changes into iconvdata/*:

  * The iconvdata/* converters are used by the gconv_* functions, which
    are internal to glibc and used by
      1) iconv(),
      2) the wcsmbs/* directory: wide string conversion functins.

  * The converters support feeding the input and pulling out the output
    in chunks, as this is needed for iconv(). The converters are optimized
    for large chunks, and for getting all input in a single chunk. But
    in the worst case they can get their input 1 byte at a time, and
    the output must be the same as when all the input it fed at once.

========================== regression.c ==========================
#include <iconv.h>
#include <stdio.h>
#include <stdlib.h>

#define ASSERT(expr) \
  do                                                                         \
    {                                                                        \
      if (!(expr))                                                           \
        {                                                                    \
          fprintf (stderr, "%s:%d: assertion failed\n",                      \
                   __FILE__, __LINE__);                                      \
          fflush (stderr);                                                   \
          abort ();                                                          \
        }                                                                    \
    }                                                                        \
  while (0)

int
main ()
{
  /* Test that 0x41 0xB2 = U+0041 U+0303 combines to U+00C3.  */
  {
    iconv_t cd = iconv_open ("UTF-8", "TCVN-5712");
    ASSERT (cd != (iconv_t) -1);
    {
      char in[2] = { (char) 0x41, (char) 0xB2 };
      char out[10];
      char *inptr = in;
      size_t inbufsize = sizeof (in);
      char *outptr = out;
      size_t outbufsize = sizeof (out);
      size_t ret = iconv (cd, &inptr, &inbufsize, &outptr, &outbufsize);
      ASSERT (ret == 0);
      ASSERT (inptr == in + 2);
      ASSERT (inbufsize == sizeof (in) - 2);
      ASSERT (outptr == out + 2);
      ASSERT (outbufsize == sizeof (out) - 2);
      ASSERT (out[0] == (char) 0xC3 && out[1] == (char) 0x83);
    }
    iconv_close (cd);
  }
  /* Likewise, in two steps.  */
  {
    iconv_t cd = iconv_open ("UTF-8", "TCVN-5712");
    ASSERT (cd != (iconv_t) -1);
    {
      char in1[1] = { (char) 0x41 };
      char in2[1] = { (char) 0xB2 };
      char out[10];
      char *inptr = in1;
      size_t inbufsize = sizeof (in1);
      char *outptr = out;
      size_t outbufsize = sizeof (out);
      size_t ret = iconv (cd, &inptr, &inbufsize, &outptr, &outbufsize);
      ASSERT (ret == 0);
      ASSERT (inptr == in1 + 1);
      ASSERT (inbufsize == sizeof (in1) - 1);
      ASSERT (outptr == out);
      ASSERT (outbufsize == sizeof (out));
      inptr = in2;
      inbufsize = sizeof (in2);
      ret = iconv (cd, &inptr, &inbufsize, &outptr, &outbufsize);
      ASSERT (ret == 0);
      ASSERT (inptr == in2 + 1);
      ASSERT (inbufsize == sizeof (in2) - 1);
      ASSERT (outptr == out + 2);
      ASSERT (outbufsize == sizeof (out) - 2);
      ASSERT (out[0] == (char) 0xC3 && out[1] == (char) 0x83);
    }
    iconv_close (cd);
  }
  return 0;
}
========================================================================
Before Tulio's "fix":

$ ./regression ; echo $?
0

After Tulio's "fix":

$ ./regression ; echo $?
regression.c:58: assertion failed
Aborted
134

Tulio's "fix" is also totally besides the point because he has not
understood the issue.

The issue is that when the gconv, iconv and wcsmbs functionality was
added to glibc, the assumption was that wcsmbs needs to support only
stateless encodings.

Whereas TCVN-5712 and CP1258 are stateful encodings. You can recognize
stateful encodings in glibc by the fact that they define a nontrivial
EMIT_SHIFT_TO_INIT macro.

But localedata/SUPPORTED lists vi_VN.TCVN/TCVN5712-1 as a supported
locale encoding. Although this is rather academic. The number of users
that use this locale nowadays is likely smaller than 1.

There are two possible fixes:

1) Back out Tulio's changes to iconvdata/ and remove vi_VN.TCVN/TCVN5712-1
   from localedata/SUPPORTED.

2) Back out Tulio's changes to iconvdata/ and add support in wcsmbs/* for
   stateful encodings.

What would it take to go for option 2? Here are test programs for
mbrtowc(), mbsrtowcs(), mbsnrtowcs().

=======================================================================
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main ()
{
  mbstate_t ps;
  wchar_t wc;
  size_t ret;

  setlocale (LC_ALL, "vi_VN.TCVN");
  memset (&ps, '\0', sizeof (ps));

  wc = 0xdddd;
  ret = mbrtowc (&wc, "A", 1, &ps);
  printf ("ret = %zu, wc = %x\n", ret, wc);
  return 0;
}
=======================================================================
Result:
ret = 1, wc = dddd
This is wrong: mbrtowc cannot return 1 but not set the output wide
character.

=======================================================================
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main ()
{
  mbstate_t ps;
  const char *src = "A";
  wchar_t wc[5];
  size_t ret;

  setlocale (LC_ALL, "vi_VN.TCVN");
  memset (&ps, '\0', sizeof (ps));

  wc[0] = 0xdddd;
  wc[1] = 0xeeee;
  ret = mbsrtowcs (wc, &src, 1, &ps);
  printf ("ret = %zu, wc = %x %x\n", ret, wc[0], wc[1]);
  return 0;
}
=======================================================================
Result:
ret = 1, wc = 41 eeee
This is OK.

=======================================================================
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main ()
{
  mbstate_t ps;
  const char *src = "A";
  const char *tmp = src;
  wchar_t wc[5];
  size_t ret;

  setlocale (LC_ALL, "vi_VN.TCVN");
  memset (&ps, '\0', sizeof (ps));

  wc[0] = 0xdddd;
  wc[1] = 0xeeee;
  ret = mbsnrtowcs (wc, &tmp, src + 1 - tmp, 1, &ps);
  printf ("ret = %zu, wc = %x %x\n", ret, wc[0], wc[1]);
  return 0;
}
=======================================================================
Result:
a.out: mbsnrtowcs.c:123: __mbsnrtowcs: Assertion `result > 0' failed.
This is wrong and was the subject of BZ #13691.

Most programs uses either mbrtowc or mbstowcs to convert strings to
wide-char encoding. Few programs use mbsnrtowcs. Therefore *if* someone
wants to support stateful encodings in wcsmbs/*, the most important
place to fix is the mbrtowc() behaviour. But this is also the most
difficult one. I cannot see how to make the following requirements
coexist:

  * mbrtowc(&wc, "A", 1, &ps) shall set wc = L'A'.

  * mbrtowc(&wc, "A\xb0", 2, &ps) shall set wc = 0x00C0
    (LATIN CAPITAL LETTER A WITH GRAVE)

  * mbrtowc can be used to process a string byte for byte; it returns -2
    when a byte sequence is incomplete. In particular this means that the
    sequence of calls
      mbrtowc(&wc, "A", 1, &ps) => -2
      mbrtowc(&wc, "\xb0", 1, &ps) => 1, wc = 0x00C0
    produces an intermediate -2 without setting wc and then sets wc in the
    second call.

The problem is that when a string like "CBA" is being processed byte for byte,
we want to see a wc = L'A' come out at the end. But there is no way for the
programmer to tell the mbrtowc() function that the string is terminated.

In other words, mbrtowc() assumes that the multibyte sequence of an entire
wide character does not start with the multibyte sequence for another wide
character. The TCVN-5712 and CP1258 encodings don't have this property.

So, I don't see a way to implement option 2.

This also means that there are likely zero users of the vi_VN.TCVN locale,
because many programs would misbehave in this locale.

I therefore propose to
  - revert Tulio's "fix",
  - remove vi_VN.TCVN/TCVN5712-1
  - from localedata/SUPPORTED, and resolve BZ #13691 as "Won't fix".

Patch attached.

Bruno

>From 146360c19ffcc0c702b04dc6851242a5c196fdc0 Mon Sep 17 00:00:00 2001
From: Bruno Haible <bruno@clisp.org>
Date: Mon, 14 May 2012 01:35:15 +0200
Subject: [PATCH] [BZ #13691], Revert breakage of iconv() converter for TCVN-5712.

---
 ChangeLog              |    7 +++++++
 iconvdata/tcvn5712-1.c |    5 ++---
 localedata/SUPPORTED   |    1 -
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 6f0b685..cf64db4 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,10 @@
+2012-05-13  Bruno Haible  <bruno@clisp.org>
+
+	[BZ #13691], Revert breakage of iconv() converter for TCVN-5712.
+	* iconvdata/tcvn5712-1.c (BODY for FROM_LOOP): Don't consider
+	inptr and inend for must_buffer_ch.
+	* localedata/SUPPORTED: Remove vi_VN.TCVN/TCVN5712-1.
+
 2012-05-12  Andreas Schwab  <schwab@linux-m68k.org>
 
 	* sysdeps/powerpc/memmove.c (MEMMOVE): Don't return a value if
diff --git a/iconvdata/tcvn5712-1.c b/iconvdata/tcvn5712-1.c
index 90c8610..f6a0c4f 100644
--- a/iconvdata/tcvn5712-1.c
+++ b/iconvdata/tcvn5712-1.c
@@ -378,9 +378,8 @@ static const struct
     /* Determine whether there is a buffered character pending.  */	      \
     last_ch = *statep >> 3;						      \
 									      \
-    /* We have to buffer ch if it is a possible match in comp_table_data      \
-       and if it isn't the last char of the string.  */			      \
-    must_buffer_ch = (ch >= 0x0041 && ch <= 0x01b0) && (inptr + 1 != inend);  \
+    /* We have to buffer ch if it is a possible match in comp_table_data.  */ \
+    must_buffer_ch = (ch >= 0x0041 && ch <= 0x01b0);                          \
 									      \
     if (last_ch)							      \
       {									      \
diff --git a/localedata/SUPPORTED b/localedata/SUPPORTED
index 1fd7847..b33d5b8 100644
--- a/localedata/SUPPORTED
+++ b/localedata/SUPPORTED
@@ -409,7 +409,6 @@ ur_PK/UTF-8 \
 uz_UZ/ISO-8859-1 \
 uz_UZ@cyrillic/UTF-8 \
 ve_ZA/UTF-8 \
-vi_VN.TCVN/TCVN5712-1 \
 vi_VN/UTF-8 \
 wa_BE/ISO-8859-1 \
 wa_BE@euro/ISO-8859-15 \
-- 
1.6.3.2

Follow-Ups:
- Re: regression caused by fix of bug #13691
  - From: David Miller
- Re: regression caused by fix of bug #13691
  - From: Bruno Haible
- Re: regression caused by fix of bug #13691
  - From: Bruno Haible
- Re: regression caused by fix of bug #13691
  - From: Andreas Jaeger

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]