Created attachment 11665 [details] regexec_bug.c Hi! Not sure if this is a long-known issue, so reporting just in case. Glibc implementation of regexec() violates POSIX longest match rule: http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, matches the longest possible string. For example, consider RE (a|ab)(bc|c) and string abc: first submatch group should match (ab), and second group should match (c). However, glibc match is different, as demonstrated by the following program (regexec_bug.c): #include <sys/types.h> #include <regex.h> #include <stdio.h> #include <stdlib.h> static void test(const char *pattern, const char *string) { regex_t regex; int nmatch = 0, e = 0; regmatch_t *pmatch = 0; e = regcomp(®ex, pattern, REG_EXTENDED); if (e != 0) { fprintf(stderr, "regcomp failed for RE %s\n", pattern); goto end; } nmatch = regex.re_nsub + 1; pmatch = malloc(sizeof(regmatch_t) * nmatch); e = regexec(®ex, string, nmatch, pmatch, 0); if (e != 0) { fprintf(stderr, "regexec failed for RE %s\n", pattern); goto end; } fprintf(stderr, "\nRE %s:\n", pattern); for (int i = 0; i < nmatch; ++i) { const regoff_t x = pmatch[i].rm_so, y = pmatch[i].rm_eo; if (x != -1) { fprintf(stderr, "%d: %.*s\n", i, y - x, string + x); } } end: free(pmatch); regfree(®ex); } int main(void) { test("^(a|ab)(bc|c)$", "abc"); test("^(ab|a)(bc|c)$", "abc"); test("^(a|aa)*$", "aa"); test("^(aa|a)*$", "aa"); return 0; } $ gcc regexec_bug.c -Wall -O2 -g -oregexec_bug Actual result: $ ./regexec_bug RE ^(a|ab)(bc|c)$: 0: abc 1: a 2: bc RE ^(ab|a)(bc|c)$: 0: abc 1: ab 2: c RE ^(a|aa)*$: 0: aa 1: a RE ^(aa|a)*$: 0: aa 1: aa Expected result: $ ./regexec_bug RE ^(a|ab)(bc|c)$: 0: abc 1: ab 2: c RE ^(ab|a)(bc|c)$: 0: abc 1: ab 2: c RE ^(a|aa)*$: 0: aa 1: aa RE ^(aa|a)*$: 0: aa 1: aa Exact glibc version: glibc-2.29.9000-118-g86bdd49d93
Also, is there any formal description of regcomp()/regexec() implementation, other than the source code itself? I can try to reverse-engineer it from the source code, but it would be helpful to understand the original author's design. Commit 3b0bdc723579a7c6df2cace0115a6ca0977d73f9 points to Isamu Hasegawa <isamu@yamato.ibm.com>, but that email address no longer exists.