This is sources Bugzilla
Bugzilla Version 2.17.5
Bugzilla Bug 1149
  character class with range doesn't match half-width kana in SJIS locale Last modified: 2006-05-02 22:33
     Query page      Enter new bug
Bug#: 1149   Hardware:   Reporter: Koichi Kimura <kimura.koichi@canon.co.jp>
Host: Target: Build:
Product:     Add CC:
Component:   Version:   CC:
Remove selected CCs
Status: RESOLVED   Priority:  
Resolution: WONTFIX   Severity:  
Assigned To: GOTO Masanori <gotom@debian.or.jp>   Target Milestone:  
Flags: Requestee:
  backport ()
  examined ()
  testsuite ()
Summary:
Keywords:

Attachment Description Type Created Actions
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 1149 depends on: Show dependency tree
Show dependency graph
Bug 1149 blocks:

Additional Comments:


Leave as RESOLVED WONTFIX
Reopen bug
Mark bug as VERIFIED

View Bug Activity   |   Format For Printing


Description:   Last confirmed: 0000-00-00 00:00 Opened: 2005-08-02 04:37
In GNU sed 4.1.4 (regex.c is drawn from glibc), 
japanse half-width caracter in SJIS locale doesn't match
character class with range.

LC_ALL=ja_JP.SJIS
export LC_ALL
echo ±²³´µ | sed -ne '/[±-µ]\+/p'

above shell script print nothing.
any other japanese full-width kana character match correctly.

note:

echo ±²³´µ | sed -ne '/[±²³´µ]\+/p'

is print correctly.

------- Additional Comment #1 From Ulrich Drepper 2005-09-27 20:05 -------
You really cannot use character ranges outside the C locale since the definition
depends on the locale description, more specifically the collation data.  It
currently doesn't contain anything for these characters.  And even if they
would, there is no guarantee that the result would be as you expect.  Just don't
use ranges.

------- Additional Comment #2 From Koichi Kimura 2006-01-27 06:00 -------
You say that I shoud not use character ranges in not C locale.
But I have a question yet.

Why characters wchich is start/end of range are not printed?

Half-width katakana characters in SJIS locale has one-byte width 
(codepoint is under 0xff) but has large codepoint in Unicode (over U+0100). 
In regcomp.c, I guess half-width katakana characters should register as single 
byte character to fastmap.
And in regexec.c, half-width katakana characters shoud treat as single-byte
character and call bitset_set() function to register to bitmap.

------- Additional Comment #3 From Koichi Kimura 2006-02-01 04:48 -------
(In reply to comment #2)
I guess I found point of problem.
Here is patch.

--- regcomp.c.1~	2005-07-18 11:51:43.000000000 +0900
+++ regcomp.c	2006-02-01 13:26:41.078750000 +0900
@@ -397,9 +397,13 @@ re_compile_fastmap_iter (bufp, init_stat
 		}
 # else
 	      if (dfa->mb_cur_max > 1)
-		for (i = 0; i < SBC_MAX; ++i)
-		  if (__btowc (i) == WEOF)
-		    re_set_fastmap (fastmap, icase, i);
+                  for (i = 0; i < SBC_MAX; ++i) {
+		    wint_t wc;
+		    wc = __btowc (i);
+
+		    if (wc == WEOF || wc >= SBC_MAX)
+		      re_set_fastmap (fastmap, icase, i);
+		  }
 # endif /* not _LIBC */
 	    }
 	  for (i = 0; i < cset->nmbchars; ++i)

--- regexec.c.1~	2005-07-18 11:51:42.000000000 +0900
+++ regexec.c	2006-02-01 13:26:44.016250000 +0900
@@ -3715,6 +3715,7 @@ check_node_accept_bytes (dfa, node_idx, 
   const re_token_t *node = dfa->nodes + node_idx;
   int char_len, elem_len;
   int i;
+  wchar_t wc;
 
   if (BE (node->type == OP_UTF8_PERIOD, 0))
     {
@@ -3784,7 +3785,8 @@ check_node_accept_bytes (dfa, node_idx, 
     }
 
   elem_len = re_string_elem_size_at (input, str_idx);
-  if ((elem_len <= 1 && char_len <= 1) || char_len == 0)
+  wc = __btowc(*(input->mbs+str_idx));
+  if ((elem_len <= 1 && char_len <= 1) || char_len == 0) && (wc != WEOF && wc <
SBC_MAX))
     return 0;
 
   if (node->type == COMPLEX_BRACKET)

This patch is for non-_LIBC part since I could not follow the _LIBC part flow.

------- Additional Comment #4 From Ulrich Drepper 2006-04-25 18:12 -------
Patches for non-_LIBC shouldn't be sent here.  This is the *libc* bugzilla. 
Send them to the sed list and let those people look at them.

------- Additional Comment #5 From Paolo Bonzini 2006-04-26 07:04 -------
So you WONTFIX a bug just because the patch sent is not for glibc?  Either the
bug is invalid, and you mark it as INVALID; or you just ignore the patch.  But
not WONTFIX.

The patch is not ok because it slows down unnecessarily the function, and regex
is already slow enough.  We probably should cache the results of btowc (at least
for the non _LIBC case).

------- Additional Comment #6 From Ulrich Drepper 2006-05-02 22:33 -------
This is glibc's bugzilla.  I mark it WONTFIX because I have nothing to do with
the non-glibc code.  Stop reopening.

     Query page      Enter new bug
Actions: New | Query | bug # | Reports | Requests   New Account | Log In