Heads up: possible bug in cygwin

Tue Dec 31 20:26:00 GMT 2002

Christopher Faylor wrote:
> On Tue, Dec 31, 2002 at 09:43:50AM -0500, Charles Wilson wrote:
> 
>>Did gcc (pre 3.2) automatically initialize data to 0, while gcc-3.2 does 
>>not?  Hmmm...waitaminute, I do have gcc2 installed...
> 
> 
> If gcc/ld is not initializing static data to zero then there are some
> pretty serious problems.  Neither gcc, nor any other compiler that I
> am aware of, is supposed to initialize automatic data to zero.

You're right, as usual.  (Plus empirical evidence: no change when 
compiling with gcc-2.  Still get the SIGSEGV)

You're also right [I think] about the buffer overrun/whatever problem in 
glib.  I haven't found the specific, offending command yet, but it seems 
pretty obvious from the postmortem that that is what has happened.

Single-stepping thru the code shows some interesting things.  Basically, 
we have a g_string (structure that contains a char* field, a length 
field, plus some other fields that are hidden from the public interface. 
  The whole structure is dynamically allocated, and the char* field 
points to additional dynamically allocation storage.

This g_string is initially allocated as a minimum size string, where the 
char* points to a minimum size buffer (4 bytes, it appears), but 
contains "" (e.g. *char = '\0', len = 0).  [as far as glib is concerned, 
the char* points to a chunk of memory 4 bytes long.  But dlmalloc 
actually uses a chunk that is 16 bytes long == 3 32bit words of dlmalloc 
overhead, plus the user data]

Then, the testsuite attempts to append a char* that contains approx 20k 
characters.  This means that glib must realloc a larger buffer for the 
g_string's internal storage.  It rounds up to the nearest power of two 
-- 32768 -- and attempts to realloc the space:

line 220: gstring.c (g_string_maybe_expand)
      string->str = g_realloc (string->str, string->alloc);
        "string" is the g_string, recast to GRealString* to reveal the
         private members of the structure (like 'alloc', which is how
         much storage glib originally requested for this buffer, even
         though it only needed 1 byte for the 0-length string "".)

in g_realloc, we simply thunk to the system realloc
line 338: gmem.c (g_realloc)
      p = (gpointer) realloc (mem, size);
         where mem is the string->str argument above, and size is
         the string->alloc argument (it equals 32768)

mem, at this point (or string->str, if you like) is a viable pointer to 
a previously allocated chunk of memory (4 bytes long) that contains '\0' 
and 3 unknown other values.  It has not previously been freed.

But now, as I said, we're in the bowels of dlmalloc.

line 146: malloc_wrapper.cc (realloc)
      res = dlrealloc (p, size);
       again, p (== mem == string->str) is a valid pointer. size is
       32768.

line 4078: malloc.cc (dlrealloc == rEALLOc)
      newmem = mALLOc(nb - MALLOC_ALIGN_MASK);
we've already determined that the request cannot be satisfied by 
expanding the allocation within the current chunk, nor can it be 
satisfied by expanding the allocation into the next chunk.  So, we fall 
back on the "allocate-copy-free" algorithm, and start by allocating a 
new chunk of memory.

line 3516: malloc.cc (dlmalloc == mALLOc)
      bck->fd = unsorted_chunks(av);
      How we got to this line:  there have been (other pointers) freed, 
so we have to first search the list of recently freed memory for 
appropriately sized chunks.
   1) this request is not a 'fast bin' size, so skip that codeblock
   2) it's not a small request, so skip the smallbin check
   3) so, we consolidate all fastbins.  That succeeds.
   4) We enter a while loop, as described by this comment:

/* Process recently freed or remaindered chunks, taking one only if it 
is exact fit, or, if this a small request, the chunk is remainder from 
the most recent non-exact fit.  Place other traversed chunks in bins. 
Note that this step is the only place in any routine where chunks are 
placed in bins. */

while ( (victim = unsorted_chunks(av)->bk) != unsorted_chunks(av)) {
   bck = victim->bk;
   size = chunksize(victim);
   ...(not a small request, so skip the next codeblock)...
   (now we pull "victim" out of the doubly-linked list.)
   unsorted_chunks(av)->bk = bck;
   bck->fd = unsorted_chunks(av);

And it's here that the error occurs: trying to hook up the forward 
pointer in victim's predecessor to point to victim's successor.

The problem is, 'victim' is full of garbage.  Supposedly this free'd 
chunk of memory is 808,464,432 bytes long.  Its predecessor is also 
808,464,432 bytes long.  And its predecessor and successor are both 
located at 0x30303030.  Which is bogus, becuase dlmalloc hasn't obtained 
that region of memory from the system (and besides, 0x30303030 is 
suspiciously like "    ")

So it looks like glib's test program has clobbered out-of-bounds memory 
in a given malloc'ed block -- but only a few bytes, enough to clobber 
dlmalloc's bookkeeping but not enough to trigger an actual segfaut.  And 
then it freed the (clobbered) block -- but we don't notice THAT until 
the NEXT time we try to malloc/realloc some memory.  E.g. now.

So the error is not in
   string-test.c (main) line 109: g_string_sprintf() --> 
g_string_sprintfa_int() --> g_string_append()...

It might be in
   string-test.c (main) line 109: g_string_sprintf() --> 
g_string_sprintfa_int() --> g_strdup_vprintf(), which is immediately 
before the line above,

or it might be in some earlier test in string-test.c (main).

Have I mentioned how much I hate memory allocation?  I'm starting to 
think, since the problem DOES seem to be in glib, that it'd be faster to 
compile this mess on linux and use a debugging malloc like 
electricfence.  But I am having a hard time believing that of the 
thousands of glib/gtk/gnome developers out there, nobody has noticed a 
fundamental problem in [glib|glib-test-suite] before now.

[...but I can't reproduce the fault on linux.  Even if I link in 
dlmalloc.  Bleah.  ElectricFence on linux couldn't find anything 
suspicious either.]

But I'm still confused as to why this worked in May.  When cgf 
introduced Doug Lea's malloc on 2002/08/16, what were we using before? 
Lea's malloc is interesting, in that memory fragments obtained from the 
systems memorymanager contain both dlmalloc's own bookkeeping inline and 
user data.  So **small** buffer overruns by a userland program will 
screw up dlmalloc's own accounting without triggering a segfault 
immediately.  Other malloc algorithms maintain the bookkeeping 
elsewhere, and buffer overruns will (usually) only clobber other user 
data.  (According to Lea's documentation, this is a feature of dlmalloc, 
not a bug).

So if cygwin's earlier malloc implementation had "separate books" as it 
were, then the buffer overrun (which I have not yet found) could have 
occurred without creating an obvious runtime error.

But that still doesn't explain why it works without fault on linux. 
Mebbe I'll try to build efence on cygwin.  But not tonight; that's 
enough.  Happy New Year, everybody.

--Chuck

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Bug reporting:         http://cygwin.com/bugs.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

Heads up: *possible* bug in cygwin

Heads up: possible bug in cygwin