View Bug Activity | Format For Printing
In GNU sed 4.1.4 (regex.c is drawn from glibc), japanse half-width caracter in SJIS locale doesn't match character class with range. LC_ALL=ja_JP.SJIS export LC_ALL echo ±²³´µ | sed -ne '/[±-µ]\+/p' above shell script print nothing. any other japanese full-width kana character match correctly. note: echo ±²³´µ | sed -ne '/[±²³´µ]\+/p' is print correctly.
You really cannot use character ranges outside the C locale since the definition depends on the locale description, more specifically the collation data. It currently doesn't contain anything for these characters. And even if they would, there is no guarantee that the result would be as you expect. Just don't use ranges.
You say that I shoud not use character ranges in not C locale. But I have a question yet. Why characters wchich is start/end of range are not printed? Half-width katakana characters in SJIS locale has one-byte width (codepoint is under 0xff) but has large codepoint in Unicode (over U+0100). In regcomp.c, I guess half-width katakana characters should register as single byte character to fastmap. And in regexec.c, half-width katakana characters shoud treat as single-byte character and call bitset_set() function to register to bitmap.
(In reply to comment #2) I guess I found point of problem. Here is patch. --- regcomp.c.1~ 2005-07-18 11:51:43.000000000 +0900 +++ regcomp.c 2006-02-01 13:26:41.078750000 +0900 @@ -397,9 +397,13 @@ re_compile_fastmap_iter (bufp, init_stat } # else if (dfa->mb_cur_max > 1) - for (i = 0; i < SBC_MAX; ++i) - if (__btowc (i) == WEOF) - re_set_fastmap (fastmap, icase, i); + for (i = 0; i < SBC_MAX; ++i) { + wint_t wc; + wc = __btowc (i); + + if (wc == WEOF || wc >= SBC_MAX) + re_set_fastmap (fastmap, icase, i); + } # endif /* not _LIBC */ } for (i = 0; i < cset->nmbchars; ++i) --- regexec.c.1~ 2005-07-18 11:51:42.000000000 +0900 +++ regexec.c 2006-02-01 13:26:44.016250000 +0900 @@ -3715,6 +3715,7 @@ check_node_accept_bytes (dfa, node_idx, const re_token_t *node = dfa->nodes + node_idx; int char_len, elem_len; int i; + wchar_t wc; if (BE (node->type == OP_UTF8_PERIOD, 0)) { @@ -3784,7 +3785,8 @@ check_node_accept_bytes (dfa, node_idx, } elem_len = re_string_elem_size_at (input, str_idx); - if ((elem_len <= 1 && char_len <= 1) || char_len == 0) + wc = __btowc(*(input->mbs+str_idx)); + if ((elem_len <= 1 && char_len <= 1) || char_len == 0) && (wc != WEOF && wc < SBC_MAX)) return 0; if (node->type == COMPLEX_BRACKET) This patch is for non-_LIBC part since I could not follow the _LIBC part flow.
Patches for non-_LIBC shouldn't be sent here. This is the *libc* bugzilla. Send them to the sed list and let those people look at them.
So you WONTFIX a bug just because the patch sent is not for glibc? Either the bug is invalid, and you mark it as INVALID; or you just ignore the patch. But not WONTFIX. The patch is not ok because it slows down unnecessarily the function, and regex is already slow enough. We probably should cache the results of btowc (at least for the non _LIBC case).
This is glibc's bugzilla. I mark it WONTFIX because I have nothing to do with the non-glibc code. Stop reopening.