This is the mail archive of the
libc-help@sourceware.org
mailing list for the glibc project.
UTF-8: Invalid multibyte sequence
- From: Felix Natter <felix dot natter at smail dot inf dot fh-brs dot de>
- To: libc-help at sourceware dot org
- Date: Sun, 19 Jun 2011 16:47:51 +0200
- Subject: UTF-8: Invalid multibyte sequence
hello,
I am trying to experiment with utf-8 in glibc 2.13 (Debian testing).
For this purpose, I created a simple multibyte utf-8 sequence using
gedit:
----------
aÃaÃ
----------
(a followed by a-umlaut followed by a followed by a-umlaut)
The following program:
----------
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL,"en_US.UTF-8");
FILE* f = (FILE*)fopen("utf-8.txt", "r");
char buffer[1024];
fscanf(f, "%s", buffer);
//buffer[6] = 0xC0;
//buffer[7] = 0x80;
buffer[6] = '\0';
close(f);
printf("buffer='%s' strlen(buffer)=%d, numChars=%d\n",
buffer,
strlen(buffer),
mbstowcs(NULL, buffer, 0));
return 0;
}
----------
outputs:
----------
buffer='aÃaÃ' strlen(buffer)=6, numChars=-1
----------
mbstowcs(NULL, buffer, 0) is a standard solution for getting the number
of characters in a multibyte string. -1 return value means "An invalid
multibyte sequence has been encountered".
Could the problem be the termination sequence? I tried both 0x00 and
0xC0,0x80...
Next, I tried to generate a widechar-sequence using L"..." and use
wcsrtombs() to convert it to a multibyte sequence:
----------
#include <wchar.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <errno.h>
int main()
{
setlocale(LC_ALL,"en_US.UTF-8");
char buffer[1024];
const wchar_t* WCS = L"aÃaÃ";
size_t result = wcsrtombs(buffer, &WCS, 1024, NULL);
printf("result=%d, errno=%d\n", result, errno);
wprintf(WCS);
printf("buffer='%s' strlen(buffer)=%d, numChars=%d\n",
buffer,
strlen(buffer),
mbstowcs(NULL, buffer, 0));
return 0;
}
----------
The output is:
----------
result=-1, errno=84
buffer='a' strlen(buffer)=1, numChars=1
----------
errno=84 means EILSEQ = "Illegal byte sequence (POSIX.1, C99)"
What am I doing wrong? What's the best way to generate a valid
multibyte sequence?
Thanks,
--
Felix Natter