Bug 11053 - Wrong results with backreferences
Summary: Wrong results with backreferences
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: regex (show other bugs)
Version: 2.11
: P2 normal
Target Milestone: ---
Assignee: eggert
URL:
Keywords:
Depends on: 25149
Blocks:
  Show dependency treegraph
 
Reported: 2009-12-04 19:35 UTC by Paolo Bonzini
Modified: 2022-11-11 16:29 UTC (History)
10 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
eggert: security+


Attachments
C code to reproduce the bug (257 bytes, text/x-csrc)
2017-01-17 22:01 UTC, Paul Eggert
Details
This test case silently returns the wrong answer (203 bytes, text/x-csrc)
2017-12-08 18:32 UTC, eggert
Details
regex: fix undefined backref behavior (5.83 KB, patch)
2021-02-06 07:37 UTC, eggert
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Paolo Bonzini 2009-12-04 19:35:39 UTC
$ echo 87654321 | \
  grep -E -e '^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1$' 
Segmentation fault

Will work on a C reproducer soon.
Comment 1 Paolo Bonzini 2010-04-09 17:46:04 UTC
Minimized testcases (same regex):

$ echo 8 | grep -E -e "$regex"
8          # >>> okay
$ echo 87 | grep -E -e "$regex"
Segmentation fault

$ echo 88 | grep -E -e "$regex"
88         # >>> okay
$ echo 887 | grep -E -e "$regex"
Segmentation fault

Also, everything I tried to feed that is of length 9 or higher and should not
match, gives either a false positive or a segfault:

$ echo 987654321 | grep -E -e "$regex"
887654321
$ echo 484635532 | grep -E -e "$regex"
484635532
$ echo 0123454321 | grep -E -e "$regex"
Segmentation fault
$ echo 0000123454321 | grep -E -e "$regex"
Segmentation fault
Comment 2 Paul Eggert 2014-09-23 02:28:01 UTC
I ran into what appears to be the same bug independently and came up with a simpler (all-C) reproducer; please see Bug#17356.  I tried to merge the two bug reports via the web interface, but failed to do so.
Comment 3 Florian Weimer 2014-09-23 07:55:27 UTC
*** Bug 17356 has been marked as a duplicate of this bug. ***
Comment 4 Paul Eggert 2017-01-17 21:24:04 UTC
This bug causes GNU coreutils Bug#22793 "grep -E assertion failure with back references"; see <https://bugs.gnu.org/22793>. I'm adding comments to both bug reports so that the connection between the two bugs is clearer.

Although this bug's current assignee is Paolo Bonzini (the original reporter), I think Paolo is pretty busy doing other stuff. Is someone else available to work on regex bugs? I suspect the fix for this bug will not be trivial.
Comment 5 Paul Eggert 2017-01-17 22:01:52 UTC
Created attachment 9758 [details]
C code to reproduce the bug

I attached a slightly-simpler C-language reproducer for the bug, derived from the attachment in Bug#17356. If I compile and run this program, it outputs "a.out: regexec.c:1375: pop_fail_stack: Assertion `num >= 0' failed." and then aborts.
Comment 6 eggert 2017-12-08 18:32:01 UTC
Created attachment 10674 [details]
This test case silently returns the wrong answer

Following up on a 'grep' bug report here:

https://debbugs.gnu.org/29613

attached is a seemingly-related test case which illustrates a bug that causes 'grep' to quietly return the wrong answer instead of dumping core. This test case should exit successfully, but because of the bug regexec returns 0 so the test case exits with status 1. I compiled and ran it on Fedora 27 x86-64 with "gcc regbug.c; ./a.out".
Comment 7 eggert 2018-09-22 21:27:12 UTC
Another test case for this bug can be found here:

https://debbugs.gnu.org/32806
Comment 8 eggert 2019-01-31 03:20:02 UTC
Another test case for this bug can be found here:

https://debbugs.gnu.org/34238
Comment 9 Sourceware Commits 2019-11-11 11:27:12 UTC
The master branch has been updated by Andreas Schwab <schwab@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fc141ea78ee3d87c67b18488827fe2d89c9343e7

commit fc141ea78ee3d87c67b18488827fe2d89c9343e7
Author: Andreas Schwab <schwab@suse.de>
Date:   Wed Oct 30 10:38:36 2019 +0100

    Fix array bounds violation in regex matcher (bug 25149)
    
    If the regex has more subexpressions than the number of elements allocated
    in the regmatch_t array passed to regexec then proceed_next_node may
    access the regmatch_t array outside its bounds.
    
    No testcase added because even without this bug it would then crash in
    pop_fail_stack which is bug 11053.
Comment 10 eggert 2021-02-06 07:37:36 UTC
Created attachment 13204 [details]
regex: fix undefined backref behavior

I am attaching a proposed patch for this longstanding bug. I plan to email this to libc-alpha shortly.
Comment 11 Michael Hudson-Doyle 2021-08-25 05:10:18 UTC
Did the patch ever get sent to libc-alpha?
Comment 12 eggert 2021-08-25 18:09:07 UTC
(In reply to Michael Hudson-Doyle from comment #11)
> Did the patch ever get sent to libc-alpha?

Unfortunately I never got around to it.

Someone else can shepherd it if it's urgent; otherwise I suppose it can wait until someone gets around to syncing Gnulib with glibc.
Comment 13 Sourceware Commits 2021-09-21 15:00:49 UTC
The master branch has been updated by Paul Eggert <eggert@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0b5ca7c3e551e5502f3be3b06453324fe8604e82

commit 0b5ca7c3e551e5502f3be3b06453324fe8604e82
Author: Paul Eggert <eggert@cs.ucla.edu>
Date:   Tue Sep 21 07:47:45 2021 -0700

    regex: copy back from Gnulib
    
    Copy regex-related files back from Gnulib, to fix a problem with
    static checking of regex calls noted by Martin Sebor.  This merges the
    following changes:
    
    * New macro __attribute_nonnull__ in misc/sys/cdefs.h, for use later
    when copying other files back from Gnulib.
    
    * Use __GNULIB_CDEFS instead of __GLIBC__ when deciding
    whether to include bits/wordsize.h etc.
    
    * Avoid duplicate entries in epsilon closure table.
    
    * New regex.h macro _REGEX_NELTS to let regexec say that its pmatch
    arg should contain nmatch elts.  Use that for regexec, instead of
    __attr_access (which is incorrect).
    
    * New regex.h macro _Attr_access_ which is like __attr_access except
    portable to non-glibc platforms.
    
    * Add some DEBUG_ASSERTs to pacify gcc -fanalyzer and to catch
    recently-fixed performance bugs if they recur.
    
    * Add Gnulib-specific stuff to port the dynarray- and lock-using parts
    of regex code to non-glibc platforms.
    
    * Fix glibc bug 11053.
    
    * Avoid some undefined behavior when popping an empty fail stack.
Comment 14 Vincent Lefèvre 2022-09-05 23:06:50 UTC
(In reply to cvs-commit@gcc.gnu.org from comment #13)
> The master branch has been updated by Paul Eggert <eggert@sourceware.org>:
> 
> https://sourceware.org/git/gitweb.cgi?p=glibc.git;
> h=0b5ca7c3e551e5502f3be3b06453324fe8604e82
> 
> commit 0b5ca7c3e551e5502f3be3b06453324fe8604e82
> Author: Paul Eggert <eggert@cs.ucla.edu>
> Date:   Tue Sep 21 07:47:45 2021 -0700
[...]
>     * Fix glibc bug 11053.

What is the status of this bug? The comment says that it is fixed, and I could check on an Ubuntu 22.04.1 LTS machine with libc6 2.35-0ubuntu3.1 that regbug.c and rebug2.c no longer fail, but the result is still incorrect with the grep example from Debian bug 884075:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=884075

vinc17@gcc92:~$ echo 11111111111 | grep -E '^(11+)\1+$|^1?$' ; echo $?
11111111111
0
Comment 15 eggert 2022-09-06 00:37:24 UTC
On 9/5/22 18:06, vincent-srcware at vinc17 dot net wrote:
>
> What is the status of this bug? The comment says that it is fixed, and I could
> check on an Ubuntu 22.04.1 LTS machine with libc6 2.35-0ubuntu3.1 that regbug.c
> and rebug2.c no longer fail, but the result is still incorrect with the grep
> example from Debian bug 884075:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=884075
>
> vinc17@gcc92:~$ echo 11111111111 | grep -E '^(11+)\1+$|^1?$' ; echo $?
> 11111111111
> 0
>
It looks like my comment 
<https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=884075#27> was 
incorrect, in that the two bugs are different bugs. glibc bug 11053 is 
fixed, but Debian bug 884075 is not fixed. Perhaps a better match for 
Debian bug 884075 is glibc bug 10844.

It's not an important bug. However, if you have time to fix it please 
feel free to send in a fix.
Comment 16 Vincent Lefèvre 2022-09-06 02:47:38 UTC
(In reply to eggert from comment #15)
> glibc bug 11053 is fixed,

Shouldn't this bug be resolved as fixed, then?

> but Debian bug 884075 is not fixed. Perhaps a better match for 
> Debian bug 884075 is glibc bug 10844.

It seems different. With Debian bug 884075, the "|^1?$" part is important (it yields the incorrect output, even though this part isn't matched), and there is nothing like that in glibc bug 10844:

vinc17@gcc92:~$ echo 11111111111 | grep --color=auto -E '^(11+)\1+$|^1?$'
11111111111
vinc17@gcc92:~$ echo 11111111111 | grep --color=auto -E '^(11+)\1+$'
vinc17@gcc92:~$ 

Note that for the first command, nothing is colored in "11111111111", i.e. the line is output as appeared to be matched, but no matches are shown by colors. As a comparison, with ten 1s instead of eleven, the line is output and the ten 1s are colored.
Comment 17 Vincent Lefèvre 2022-09-06 02:59:06 UTC
This could be simplified a bit:

vinc17@gcc92:~$ echo 11111111111 | grep --color=auto -E '^(11+)\1+$|^$'
11111111111

(nothing colored).
Comment 18 eggert 2022-09-06 18:47:41 UTC
(In reply to Vincent Lefèvre from comment #16)
> (In reply to eggert from comment #15)
> > glibc bug 11053 is fixed,
> 
> Shouldn't this bug be resolved as fixed, then?

OK, done.


> > Perhaps a better match for 
> > Debian bug 884075 is glibc bug 10844.
> 
> It seems different.

In that case it might be better to file a new glibc bug report, one that corresponds more closely to Debian bug 884075.
Comment 19 Vincent Lefèvre 2022-09-06 22:56:39 UTC
Sorry, actually both regbug.c and rebug2.c fail as they return the exit status 1 (with my usual configuration, my prompt shows any non-zero exit status, but this is not the case of the machine on which I had done the test, so that I missed the failure initially):

vinc17@gcc92:~$ ./regbug
vinc17@gcc92:~$ echo $?
1
vinc17@gcc92:~$ ./rebug2
vinc17@gcc92:~$ echo $?
1

However, in the test from Paolo Bonzini's bug report (comment 0), grep no longer crashes (while it still crashes with glibc 2.34, which does not have the fix).

regbug.c is derived from the attachment in Bug#17356 (as said in comment 5). I've tested this original testcase: with glibc 2.34 on x86_64, it crashes (segmentation fault); with glibc 2.35 on riscv64 (host gcc92), it outputs "no match (incorrect)".

So it seems that the fix mentioned in comment 13 fixed the crashes (which was the initial bug report), but not the misbehavior.

Now, with these new details, is it still OK to regard this bug as fixed and that the misbehavior (rebug.c from Bug#17356; regbug.c and rebug2.c from this bug) is actually a new bug?
Comment 20 eggert 2022-09-06 23:41:30 UTC
(In reply to Vincent Lefèvre from comment #19)

> regbug.c is derived from the attachment in Bug#17356 (as said in comment 5).
> I've tested this original testcase: with glibc 2.34 on x86_64, it crashes
> (segmentation fault); with glibc 2.35 on riscv64 (host gcc92), it outputs
> "no match (incorrect)".
> 
> So it seems that the fix mentioned in comment 13 fixed the crashes (which
> was the initial bug report), but not the misbehavior.

OK, so in that case how about if we update Bug#17356 by (1) saying it is no longer a duplicate of Bug#11053 (as we've fixed the latter but not the former), and (2) reopening Bug#17536? If I understand you correctly, that would match the symptoms you describe.
Comment 21 Vincent Lefèvre 2022-09-07 00:17:12 UTC
(In reply to eggert from comment #20)
> OK, so in that case how about if we update Bug#17356 by (1) saying it is no
> longer a duplicate of Bug#11053 (as we've fixed the latter but not the
> former), and (2) reopening Bug#17536? If I understand you correctly, that
> would match the symptoms you describe.

Yes, I think that this is the best solution.
Comment 22 eggert 2022-09-07 04:31:20 UTC
(In reply to Vincent Lefèvre from comment #21)
> (In reply to eggert from comment #20)
> > OK, so in that case how about if we update Bug#17356 by (1) saying it is no
> > longer a duplicate of Bug#11053 (as we've fixed the latter but not the
> > former), and (2) reopening Bug#17536? If I understand you correctly, that
> > would match the symptoms you describe.
> 
> Yes, I think that this is the best solution.

OK, done.
Comment 23 Vincent Lefèvre 2022-09-07 10:31:57 UTC
What about attachment 10674 [details] ("This test case silently returns the wrong answer"), with the pattern "^(11+)\\1+$|^1?$" and the string "1111111111111"?

Should it be regarded as part of Bug#17356 or another bug? This case seems quite different from Bug#10844 and Bug#17356. Unless the intent is to group all the bugs about regexp involving backreferences giving a wrong answer[*] (in which case Bug#10844 and Bug#17356 should be regarded as duplicates to each other), I think that this should be a new bug.

[*] as opposed to a crash like in this bug 11053.
Comment 24 eggert 2022-09-07 20:57:53 UTC
Sure, feel free to file it as a new bug.
Comment 25 Vincent Lefèvre 2022-09-08 11:44:47 UTC
(In reply to eggert from comment #24)
> Sure, feel free to file it as a new bug.

Bug 29560.
Comment 26 Sourceware Commits 2022-11-11 16:29:02 UTC
The release/2.34/master branch has been updated by Florian Weimer <fw@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=86a701a20479dfbc23540b3143fd5b28660a2447

commit 86a701a20479dfbc23540b3143fd5b28660a2447
Author: Paul Eggert <eggert@cs.ucla.edu>
Date:   Tue Sep 21 07:47:45 2021 -0700

    regex: copy back from Gnulib
    
    Copy regex-related files back from Gnulib, to fix a problem with
    static checking of regex calls noted by Martin Sebor.  This merges the
    following changes:
    
    * New macro __attribute_nonnull__ in misc/sys/cdefs.h, for use later
    when copying other files back from Gnulib.
    
    * Use __GNULIB_CDEFS instead of __GLIBC__ when deciding
    whether to include bits/wordsize.h etc.
    
    * Avoid duplicate entries in epsilon closure table.
    
    * New regex.h macro _REGEX_NELTS to let regexec say that its pmatch
    arg should contain nmatch elts.  Use that for regexec, instead of
    __attr_access (which is incorrect).
    
    * New regex.h macro _Attr_access_ which is like __attr_access except
    portable to non-glibc platforms.
    
    * Add some DEBUG_ASSERTs to pacify gcc -fanalyzer and to catch
    recently-fixed performance bugs if they recur.
    
    * Add Gnulib-specific stuff to port the dynarray- and lock-using parts
    of regex code to non-glibc platforms.
    
    * Fix glibc bug 11053.
    
    * Avoid some undefined behavior when popping an empty fail stack.
    
    (cherry picked from commit 0b5ca7c3e551e5502f3be3b06453324fe8604e82)