[PATCH] Reset converter state after second wchar_t output (Bug 25734)
Carlos O'Donell
carlos@redhat.com
Mon Mar 30 17:52:24 GMT 2020
On 3/30/20 11:28 AM, Andreas Schwab wrote:
> On Mär 30 2020, Carlos O'Donell via Libc-alpha wrote:
>
>> On 3/30/20 10:28 AM, Andreas Schwab wrote:
>>> On Mär 30 2020, Florian Weimer wrote:
>>>
>>>> I'm not sure if the C committee wants implementations to be able to
>>>> support Big5 (without Unicode changes first, to add characters which
>>>> avoid the two-codepoint special cases).
>>>
>>> Are you saying mbrtowc should return -1 here?
>>
>> No. That indicates an invalid multibyte sequence was found.
>
> It is not representable, thus invalid.
Sorry, I think I misunderstood your question.
I think you are actually asking what should a hypothetically correct
implementation do in this case?
If that is your question, then I agree, it should return -1 when it
finds any input that violates the C requirements.
I would *not* change glibc to do this though since BIG5-HKSCS is
supported and in use in glibc.
A simple converter can be written that goes through all input bytes
bytes until the input is at the end or errors out (rather than stopping
at the observed L'\0'), but it requires you know the length of the input.
I have seen many examples looking for result > 0 though, so I expect
such codes would immediately stop when encountering such BIG5-HKSCS
input that generates a 0 return.
The Microsoft docs have a similar example stopping the conversion when
0 is returned, but using -2 to continue stepping through the input,
advancing by one byte to attempt to put together the incomplete sequence
(expecting the state to accrue).
The Microsoft docs are here:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/mbrtowc?view=vs-2019
Example here:
https://rextester.com/UYPGU65292
Windows:
1st mbrtowc call: 0xF325
result: 2
2nd mbrtowc call: 0x0062
result: 1
3rd mbrtowc call: 0x0058
result: 1
Linux:
1st mbrtowc call: 0x00CA
result: 2
2nd mbrtowc call: 0x0304
result: 0
3rd mbrtowc call: 0x0058
result: 1
Note that in the Microsoft implementation you *can't* use the
value of the return from mbrtowc to walk the input forward, and
that seems like a mistake to me, at least 0 is an honest (if
wrong) answer.
--
Cheers,
Carlos.
More information about the Libc-alpha
mailing list