The regex code sometimes shifts a word by a value greater than the word size, which has undefined behavior. While fixing this, I also fixed a few other porting glitches that are related. I'll attach a patch.
Created attachment 633 [details] shift-related patches for regex
Subject: Re: regex undefined behavior with shifting past word length The last hunk is surely wrong. I really meant ~0. Paolo
-1 is better.
The last hunk is purely for ports to ones' complement and signed-magnitude hosts. It has no effect in the normal case. For example, on a one's complement host, ~0 has the numeric value zero, i.e., ~0 == 0. Also, ~0 is of type int. When ~0 is converted to unsigned int, it is converted by value, not by bit-pattern. (The C Standard requires this.) Hence ((unsigned) ~0) is equivalent to ((unsigned) 0), which in turn is equivalent to 0u, which is zero. The same problem occurs with signed-magnitude hosts. It also occurs with unsigned short int (the type being used here). Admittedly this is a minor point since such hosts are rare, but it's easy to do portably so we might as well do it that way.
Subject: Re: regex undefined behavior with shifting past word length >For example, on a one's complement host, ~0 has the numeric value >zero, i.e., ~0 == 0. Also, ~0 is of type int. When ~0 is converted >to unsigned int, it is converted by value, not by bit-pattern. (The C >Standard requires this.) Hence ((unsigned) ~0) is equivalent to >((unsigned) 0), which in turn is equivalent to 0u, which is zero. > > So you want ~0u, but not -1. Paolo
-1 when cast to unsigned is exactly the same as ~0u and also works with any other unsigned type regardless of its width, whereas ~0u doesn't.
Andreas is right. For example, "unsigned long int x = ~0u;" will not have an all-1s value on most 64-bit hosts. In this particular hunk, ~0u would also work since the destination type is unsigned short int. So if you'd really rather use ~0u I guess that would be OK. However, as a style matter, it is confusing to use ~0u in some unsigned contexts, while using -1 in other unsigned contexts. Since -1 always works, it's more consistent to use it in all unsigned contexts. For example, suppose someone later changes eps_reachable_subexps_map from unsigned short int to unsigned long int, for performance reasons. If the code used ~0u here, it would have to be changed to ~ (unsigned long int) 0, and it's quite possible that people would forget to make that change. Whereas if we simply change it to -1 now, it will work regardless of later changes like this. I should mention that the situation is different in signed contexts. In general one must use ~ (SIGNED_TYPE) 0 in that case to get an all-1s pattern. But signed bit-twiddling is trickier (since one must in general worry about ~0 == 0 and overflow issues), and I'd rather that the regex code stuck with unsigned unsigned bit-twiddling.
It is ridicuous to care about 1-complement and "signed-magnitude" hosts. I've applied most of the patch.