[PATCH] Reset converter state after second wchar_t output (Bug 25734)

Carlos O'Donell carlos@redhat.com
Mon Mar 30 17:52:24 GMT 2020


On 3/30/20 11:28 AM, Andreas Schwab wrote:
> On Mär 30 2020, Carlos O'Donell via Libc-alpha wrote:
> 
>> On 3/30/20 10:28 AM, Andreas Schwab wrote:
>>> On Mär 30 2020, Florian Weimer wrote:
>>>
>>>> I'm not sure if the C committee wants implementations to be able to
>>>> support Big5 (without Unicode changes first, to add characters which
>>>> avoid the two-codepoint special cases).
>>>
>>> Are you saying mbrtowc should return -1 here?
>>
>> No. That indicates an invalid multibyte sequence was found.
> 
> It is not representable, thus invalid.

Sorry, I think I misunderstood your question.

I think you are actually asking what should a hypothetically correct
implementation do in this case?

If that is your question, then I agree, it should return -1 when it
finds any input that violates the C requirements.

I would *not* change glibc to do this though since BIG5-HKSCS is
supported and in use in glibc.

A simple converter can be written that goes through all input bytes
bytes until the input is at the end or errors out (rather than stopping
at the observed L'\0'), but it requires you know the length of the input.

I have seen many examples looking for result > 0 though, so I expect
such codes would immediately stop when encountering such BIG5-HKSCS
input that generates a 0 return.

The Microsoft docs have a similar example stopping the conversion when
0 is returned, but using -2 to continue stepping through the input,
advancing by one byte to attempt to put together the incomplete sequence
(expecting the state to accrue).

The Microsoft docs are here:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/mbrtowc?view=vs-2019

Example here:
https://rextester.com/UYPGU65292

Windows:
1st mbrtowc call: 0xF325
  result: 2
2nd mbrtowc call: 0x0062
  result: 1
3rd mbrtowc call: 0x0058
  result: 1

Linux:
1st mbrtowc call: 0x00CA
  result: 2
2nd mbrtowc call: 0x0304
  result: 0
3rd mbrtowc call: 0x0058
  result: 1

Note that in the Microsoft implementation you *can't* use the
value of the return from mbrtowc to walk the input forward, and
that seems like a mistake to me, at least 0 is an honest (if
wrong) answer.

-- 
Cheers,
Carlos.



More information about the Libc-alpha mailing list