This is the mail archive of the glibc-linux@ricardo.ecn.wfu.edu mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: glibc 2.1.3pre1 with Linuxthreads 2.1.3pre2 made my applicationwork again

To: glibc-linux at ricardo dot ecn dot wfu dot edu
Subject: Re: glibc 2.1.3pre1 with Linuxthreads 2.1.3pre2 made my applicationwork again
From: Kai Engert <engert at isg dot de>
Date: Sun, 30 Jan 2000 16:14:25 +0100
References: <Pine.LNX.4.10.10001291157270.13106-100000@ashi.FootPrints.net>
Reply-To: glibc-linux at ricardo dot ecn dot wfu dot edu

Kaz Kylheku wrote:
> 
> On Fri, 28 Jan 2000, Kai Engert wrote:
> 
> > condition occurred, that looked like an glibc internal inconsistency:
> >
> > pthread_mutex_t Handle = {__m_reserved = 1074995304, __m_count = 0,
> > __m_owner = 0x0, __m_kind = 2, __m_lock = {__status = 1, __spinlock = 0}
> >
> > Looking at the glibc-source, for my understanding, it's not correct to
> > have this kind of value combination in a pthread_mutex_t struct. While
> > __status == 1, __m_owner should contain a value != 0.
> 
> That is not strictly true. The __m_owner field is set to a non-null
> value (in error checking or recursive mutexes only) when the new owner
> acquires the fastlock. So there is a brief window of execution during which the
> lock is already held but the __m_owner field is not yet updated. And also
> __m_owner is cleared just before the fastlock is given up.
> 
> Of course, the state combination is not valid under any other circumstances
> other than these two!

Ok, thanks for explaining this, I did not dig deeply enough into the
glibc sources.

> > I recognized this problem because pthread_mutex_unlock returned "EPERM
> > (the calling thread does not own the mutex)".
> 
> Which strongly suggests that two (or more) threads were fooled by the bug into
> thinking they all have the mutex. The first one of them released it; and the
> subsequent release by one of the other threads produced the error.

That explains my problems.

> The only other way that could happen is if a thread tried to release a mutex
> which it does not own; that can't be the case because you sound like you know
> very well what you are doing, and the application runs fine for weeks with a
> fixed library.

Yes, I'm using some C++ wrapper classes. They help me to assure that I
always lock first and do all locking and unlocking in pairs.

> > I was able to reproduce this condition multiple times.
> >
> > Reading the mailing list, I found your recent bugfix
> >        * spinlock.c: __pthread_lock queues back any received restarts
> >          that don't belong to it instead of assuming ownership of lock
> >          upon any restart; fastlock can no longer be acquired by two
> > threads
> >          simultaneously.
> 
> You are most likely having the same problem that is addressed by this
> fix.  The problems that are addressed only occur in programs that:
> 
> -- cancel threads that are waiting on pthread_cond_wait,
>    pthread_cond_timedwait, sem_wait or pthread_join; or

No, but I'm using pthread_cond_timedwait.

Just for your information, I'm telling you about some internals of my
application.

Some time ago I read, it's not a good idea to use canceling of threads.
To many unforseeable conditions could arise when using cancelcation. I
kept that in mind and tried to program around the need to cancel
threads, and in fact, are not using any thread cancellation.

All my threads are detached threads.

My worker threads check each second whether they should terminate, as
indicated by a boolean termination flag (each thread has it's own),
which my manager thread sets when needed.

All my threads register themselves in a global ThreadCounter. On program
termation my master thread sets the termination flag for all threads and
waits until the counter becomes zero.

> -- use pthread_cond_timedwait

Yes, I'm using this one.

My central data manager thread, the one doing all disk i/o, passes the
result data to the requesting connection handler thread, which
afterwards sends out the data, without blocking the data manager thread. 

This way I'm assuring the data manager thread never wastes available
disk seeking time by waiting for tcp/ip requests to complete. Disk
seeking time is the bottleneck in my application.

Unfortunately, it's not possible (at least I saw no possibility) to
simultaneously 
- select on a socket (another asynchronous request to be processed comes
in)
- wait on a thread condition variable (result data is ready to be sent
out)
and resume as soon as one of the events occurs.

Therefore I decided to jump around selecting (with timeout) and
pthread_cond_timedwait within each thread that handles a client
connection.

Using this approach I always have the possibility of delaying one event
for a small amount of time, but it seemed to be the easiest solution.

To optimize in the future, maybe I should divide the two waiting jobs
into different threads, and maybe I should combine all the selecting
work (incoming data) for all my connections in a single thread, instead
of selecting in all my threads.

But as it's currently working quite satisfyingly, I will keeps things as
they are for the moment.

> Is this the case in your application? If so, I'm quite confident that you
> ran into the bugs that were addressed.

As I'm using pthread_cond_timedwait, yes, this is the case.

It's a good feeling to know the cause of a problem has been found.

> A program that does not use timed out condition waits or which does not use
> thread cancellation should not have problems, because it will not trigger the
> race conditions that will cause a dangling restart.
> 
> For example, recycling threads using a thread pool instead of cancelling
> threads and making new ones should make the problem go away (assuming you don't
> also use pthread_cond_timedwait).

Ok, so people not using this mechanism shouldn't be required to update
:-)
However, users of third party developed multithreaded software can't be
sure whether they are affected.

> This is probably the reason why I haven't seen crash reports from the
> maintainers of certain high profile projects; for example, I didn't hear about
> any issues related to the AOL multithreaded web server. You'd expect them to
> run into these bugs; but then again they probably create a pool of threads that
> are never canceled, and don't use pthread_cond_timewait.  So they avoid
> triggering the problems by dumb luck.

:-)

I read the Oracle database servers are using LinuxThreads, too.
I'm curious whether they experienced problems.

Do you think, as soon as glibc 2.1.3 is released, someone should make an
announcement to comp.programming.threads, mentioning the fix and
recommending to update?

Cheers
Kai

Follow-Ups:
- Re: glibc 2.1.3pre1 with Linuxthreads 2.1.3pre2 made my applicationwork again
  - From: Kaz Kylheku

References:
- Re: glibc 2.1.3pre1 with Linuxthreads 2.1.3pre2 made my applicationwork again
  - From: Kaz Kylheku

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]