This is the mail archive of the gdb@sources.redhat.com mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Checking function calls

From: Fredrik Tolf <fredrik at dolda2000 dot cjb dot net>
To: gdb at sources dot redhat dot com
Date: 06 Dec 2002 00:14:56 +0100
Subject: Re: Checking function calls
References: <200212050451.gB54ppB31800@duracef.shout.net>

On Thu, 2002-12-05 at 05:51, Michael Elizabeth Chastain wrote:
> Hi Fredrik,
> 
> I'm throwing out a bunch of ideas here, take whatever looks useful
> and discard the rest.
> 
> > Therefore, the failure has to be that a called
> > function doesn't restore EBX correctly, on rare occasions, right?
> 
> I have seen this happen in a mixed programming environment,
> with a Cygwin program that used a Windows DLL.  The Windows DLL
> had subtly different calling conventions where it did not preserve
> %ebx, %esi, and %edi across function calls.  Perhaps you have some
> kind of third party library in your program which has a similar
> compatibilty issue?

The only libraries are libc, libpthread, libdl and libpam. In the
affected function, only libc and libpthread are used. Therefore I don't
think it's calling convention incompatibility, unless they in turn call
functions in third party libraries, which I find very unlikely.

> 
> > My question is thus: Is there any way of debugging this with GDB? Can I
> > make GDB check that EBX is the same before and after every function call
> > from that frame in this thread to isolate the failing function? The
> > frame never exits (until the program exits, that is), if that helps.
> 
> You could set a bunch of conditional breakpoints with "break if %ebx !=
> saved_ebx", where you add code to your program to initialize saved_ebx.
> Or you could say "break if %ebx < 0x1000" or some convenient constant.
> 

That would, of course, be a good thing. It's only that I'd have to do
that after every single function call... That would take some time.
Maybe I'll do it, anyway. I was actually thinking of doing something
like that, but with code instead, and making the thread SIGSTOP itself
when EBX is invalid.

> You could also try forcing your variable to be on the stack instead of a
> register.  Remove the "register" attribute from the declaration of "next"
> if you have one.  Then add a "do_nothing(&next)" call to your function,
> to force "next" to be on the stack instead of in a register.  If the
> symptoms go away then it's more likely to really be a register clobber.

That just doesn't feel like a very elegant solution, though. And, this
bug does actually surface in another function as well, only even more
seldomly. There it also affects a variable stored in EBX, but it gets
set to 0 instead. So, I would prefer actually solving the problem, so
that it doesn't show up anywhere else. I have noticed no similarities
between the two places where the bugs shows itself.

> > At first I was expecting that another thread somehow gets there and
> > modifies the storage memory of next.
> 
> I still suspect this.  It's more likely that memory gets clobbered rather
> than a register value.
> 

But next isn't stored in memory at any place, so it cannot be that.

> Perhaps you need a function that locks the whole list and walks it for
> a sanity check, without deleting anything?
> 

I always check the list with gdb when the program crashes, and it's
always correct. That's why I think that it's impossible that next is
loaded when the list is in an unstable state. The only times I actually
set the next element, I always set it to NULL or a pointer returned by
malloc(). If the list was to be made unstable by a buggy function
somewhere, it would have to restored again by the same function (since
it's always consistent when I look at it), and I just don't see that
happening.

> Here is another wild lead: if, somehow, a block gets freed and then
> you read it, many implementations of malloc keep housekeeping information
> in the first word or two of a freed block.  That would explain why the
> value is always 0x10 to 0x30 (that could be block size, especially if it is
> rounded up to a multiple of 4 or 8) and why only 1-2 words are clobbered.
> If you manage your blocks with malloc/free, you could try turning on any
> malloc debugging facilities that you have.

I also suspected that something like that might happen, and therefore I
lock the elements one element ahead of the block I'm currently looking
at, so that the current block and the next are always locked. That's why
I have:

for(cur = list; cur != NULL; cur = next)
{
    if((next = cur->next) != NULL)
        pthread_mutex_lock(&next->mutex);
    ...
}

That is also a reason why the next variable has to be clobbered at some
later point, since pthread_mutex_lock succeeds on it. The program always
on the line "if((next = cur->next) != NULL)", since it segfaults when it
looks up cur->next, i.e. at that point cur has been set to the invalid
next as directed by the loop. Therefore, when the program crashes, next
and cur are equal, and I cannot see what element it was at before.

By the way, if you want to look at the code, it's available at
http://sourceforge.net/projects/dcprod/. I don't know if it's the latest
version, though.

References:
- Re: Checking function calls
  - From: Michael Elizabeth Chastain

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]