When most public API operations on pthread_t's execute, the per-thread lock (struct pthread.lock) is acquired to enforce consistency of the kernel and userspace data structures. This can cause a problem when a thread (t1) lowers its own priority and some other thread (t2, high priority) then immediately becomes runnable as a result of the priority shift. The scenario would look like this: * There are three threads: T1 (low prio), T2 (mid prio), T3 (high prio) * T1 is initially running at some value higher than its permanent priority, to do some startup work * T2 is executing some CPU-bound job that is always runnable * T1 finishes initialization, sets itself to its lower (permanent) priority. This requires grabbing the locking its own per-thread futex ("lock" in struct pthread). The syscall to alter scheduling parameters will immediately result in T2 being put on the CPU, so the lock is not yet dropped. * T3 eventually needs to do some adjustment of T1's scheduling options. So it tries to grab T1's per-thread lock, but can't since T1 still holds it because its scheduling syscall hasn't returned to userspace yet. * Priority inversion. T2 continues to run unchallenged. Can the pthread.lock be treated as a PI futex instead of a standard futex, in order to get priority inheritance and work around this inversion? I'll attach an example program shortly.
Created attachment 2048 [details] Example to illustrate priority inversion in NPTL pthread internals
Example test case (priority-inversion.c) confirmed. A priority inversion occurs, causing the high-priority third thread to wait on the low-priority first thread.
I have a few points worth noting: (1) I was unable to find anywhere in the POSIX specification that states the per-thread mutex must not cause a priority inversion. (2) Changing the pthread implementation so that every thread has a priority-inheritance mutex, instead of a standard mutex, would cause some performance loss due to the extra overhead associated with using priority-inheritance mutexes. (3) The priority inversion situation you have described does not cause any of the threads to hold onto a resource indefinitely, thereby preventing some other thread from ever making forward progress. All threads eventually make forward progress, therefore, this is more of a performance issue than a correctness issue. So I am marking this bug as an enhancement.
BZ flagged for Ulrich's attention: http://www.sourceware.org/ml/libc-alpha/2008-04/msg00094.html