Pthread mutexes and condition variables are very often used in combination in multithreaded software. In this combination pthread_cond_signal() is used to wake up another thread. Optimally this wakeup would cause only one context switch. Unfortunately pthread_cond_signal() seems to cause two context switches on average when using linuxthreads, and three when using the NPTL. $ uname -a Linux pc-101 2.6.18.6 #8 Sun Feb 4 11:17:43 CET 2007 i686 athlon i386 GNU/Linux $ /home/bart/glibc236/lib/libc.so.6 GNU C Library stable release version 2.3.6, by Roland McGrath et al. Copyright (C) 2005 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 4.1.2 20061115 (prerelease) (SUSE Linux). Compiled on a Linux >>2.6.18.6<< system on 2007-03-17. Available extensions: GNU libio by Per Bothner crypt add-on version 2.1 by Michael Glad and others linuxthreads-0.10 by Xavier Leroy BIND-8.2.3-T5B libthread_db work sponsored by Alpha Processor Inc NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Thread-local storage support included. For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. $ /lib/libc.so.6 GNU C Library stable release version 2.5 (20061011), by Roland McGrath et al. Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Configured for i686-suse-linux. Compiled by GNU CC version 4.1.2 20061115 (prerelease) (SUSE Linux). Compiled on a Linux 2.6.18 system on 2006-11-26. Available extensions: crypt add-on version 2.1 by Michael Glad and others GNU Libidn by Simon Josefsson GNU libio by Per Bothner NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk NoVersion patch for broken glibc 2.0 binaries Native POSIX Threads Library by Ulrich Drepper et al BIND-8.2.3-T5B Thread-local storage support included. For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. $ LD_LIBRARY_PATH=/home/bart/glibc236/lib: /home/bart/glibc236/lib/ld-linux.so.2 ./glibc236-condvar-perf linuxthreads mutex elapsed: 5287 us; per iteration: 52 ns / 0 context switches. c.v. ping-pong test elapsed: 772543 us; per iteration: 7725 ns / 4 context switches. signal ping-pong test elapsed: 913529 us; per iteration: 9135 ns / 4 context switches. $ ./condvar-perf NPTL mutex elapsed: 4000 us; per iteration: 40 ns / 0 context switches. c.v. ping-pong test elapsed: 461905 us; per iteration: 4619 ns / 6 context switches. signal ping-pong test elapsed: 424947 us; per iteration: 4249 ns / 4 context switches.
Created attachment 1647 [details] Source code of test program.
Why are you testing this with more than 2 years old codebase? This has been fixed in August 2005.
Created attachment 1648 [details] Source code of test program. Added fourth test.
Are you sure that this is already fixed ? These are the results with the latest CVS version (2007-03-25): $ /home/bart/glibc-cvs/lib/libc.so.6 GNU C Library development release version 2.5.90, by Roland McGrath et al. Copyright (C) 2007 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 4.1.2 20061115 (prerelease) (SUSE Linux). Compiled on a Linux >>2.6.18.6<< system on 2007-03-25. Available extensions: crypt add-on version 2.1 by Michael Glad and others Native POSIX Threads Library by Ulrich Drepper et al BIND-8.2.3-T5B For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. $ LD_LIBRARY_PATH=/home/bart/glibc-cvs/lib: /home/bart/glibc-cvs/lib/ld-linux.so.2 ./glibc-cvs-condvar-perf NPTL mutex elapsed: 4054 us; per iteration: 40 ns / 5e-05 csw c.v. ping-pong test elapsed: 421526 us; per iteration: 4215 ns / 6 csw signal ping-pong test elapsed: 430622 us; per iteration: 4306 ns / 4 csw signal ping-pong no mtx elapsed: 207861 us; per iteration: 2078 ns / 2 csw
$ gcc -g -o /tmp/condvar-perf{,.c} -lpthread -std=gnu99 -lrt -O2 -m64 -W -Wall $ /tmp/condvar-perf NPTL mutex elapsed: 6790 us; per iteration: 67 ns / 0 csw c.v. ping-pong test elapsed: 1176092 us; per iteration: 11760 ns / 4 csw signal ping-pong test elapsed: 727195 us; per iteration: 7271 ns / 4 csw signal ping-pong no mtx elapsed: 971107 us; per iteration: 9711 ns / 4 csw $ gcc -g -o /tmp/condvar-perf{,.c} -lpthread -std=gnu99 -lrt -O2 -m32 -W -Wall $ /tmp/condvar-perf NPTL mutex elapsed: 7980 us; per iteration: 79 ns / 2e-05 csw c.v. ping-pong test elapsed: 1328961 us; per iteration: 13289 ns / 4 csw signal ping-pong test elapsed: 704607 us; per iteration: 7046 ns / 4 csw signal ping-pong no mtx elapsed: 748424 us; per iteration: 7484 ns / 3.9 csw $ /lib64/libc.so.6 | head -n 7; /lib/libc.so.6 | head -n 7; uname -a GNU C Library stable release version 2.5, by Roland McGrath et al. Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 4.1.1 20061011 (Red Hat 4.1.1-30). Compiled on a Linux 2.6.9 system on 2007-01-05. GNU C Library stable release version 2.5, by Roland McGrath et al. Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 4.1.1 20061011 (Red Hat 4.1.1-30). Compiled on a Linux 2.6.9 system on 2007-01-05. Linux tucnak.ms.mff.cuni.cz 2.6.18-1.2869.fc6 #1 SMP Wed Dec 20 14:51:34 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
Can you strace -f it? If it doesn't use futex (..., 0x5, ...) (aka FUTEX_WAKE_OP), then it is just a badly built glibc or your kernel is missing that functionality.
(In reply to comment #6) > Can you strace -f it? If it doesn't use futex (..., 0x5, ...) (aka > FUTEX_WAKE_OP), then it is just a badly built glibc or your kernel is missing > that functionality. After adding an fflush(stdout) just before the end of run_test I get the following (this test is done with openSUSE's glibc-2.5-25): $ strace -f ./condvar-perf 222 2>&1 | less ... clock_gettime(CLOCK_MONOTONIC, {832, 851599805}) = 0 clock_gettime(CLOCK_MONOTONIC, {832, 851627494}) = 0 ... write(1, "NPTL\nmutex ela"..., 85NPTL mutex elapsed: 28 us; per iteration: 126 ns / 0.11 csw ) = 85 ... [pid 6781] futex(0xbfb95840, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 6783] futex(0xbfb95840, FUTEX_WAKE, 1 <unfinished ...> [pid 6781] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavai lable) [pid 6783] <... futex resumed> ) = 0 [pid 6781] futex(0xbfb95858, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 6783] futex(0xbfb95858, FUTEX_WAKE, 1 <unfinished ...> [pid 6781] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavai lable) [pid 6783] <... futex resumed> ) = 0 [pid 6781] futex(0xbfb9585c, 0x5 /* FUTEX_??? */, 1 <unfinished ...> [pid 6783] futex(0xbfb9585c, FUTEX_WAIT, 871, NULL <unfinished ...> [pid 6781] <... futex resumed> ) = 0 [pid 6783] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavai lable) [pid 6781] futex(0xbfb95840, FUTEX_WAKE, 1 <unfinished ...> [pid 6783] futex(0xbfb95858, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 6781] <... futex resumed> ) = 0 [pid 6781] futex(0xbfb95858, FUTEX_WAKE, 1 <unfinished ...> [pid 6783] <... futex resumed> ) = 0 [pid 6781] <... futex resumed> ) = 1 ... write(1, "c.v. ping-pong test elapsed:"..., 88c.v. ping-pong test elapse d: 82901328 us; per iteration: 373429405 ns / 3.8e+02 csw ) = 88 ...[pid 6796] rt_sigtimedwait([ALRM], <unfinished ...> [pid 6781] rt_sigtimedwait([ALRM], <unfinished ...> [pid 6796] <... rt_sigtimedwait resumed> 0, 0, 8) = 14 [pid 6796] tgkill(6781, 6781, SIGALRM <unfinished ...> [pid 6781] <... rt_sigtimedwait resumed> 0, 0, 8) = 14 [pid 6796] <... tgkill resumed> ) = 0 [pid 6781] futex(0xbfb9586c, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 6796] futex(0xbfb9586c, FUTEX_WAKE, 1 <unfinished ...> [pid 6781] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavai lable) [pid 6796] <... futex resumed> ) = 0 [pid 6781] tgkill(6781, 6796, SIGALRM <unfinished ...> ... write(1, "signal ping-pong test elapsed:"..., 89signal ping-pong test elapse d: 116494476 us; per iteration: 524749891 ns / 5.5e+02 csw ) = 89 ... [pid 6781] rt_sigtimedwait([ALRM], <unfinished ...> [pid 7207] rt_sigtimedwait([ALRM], <unfinished ...> [pid 6781] <... rt_sigtimedwait resumed> 0, 0, 8) = 14 [pid 7207] <... rt_sigtimedwait resumed> 0, 0, 8) = 14 [pid 6781] tgkill(6781, 7207, SIGALRM <unfinished ...> [pid 7207] tgkill(6781, 6781, SIGALRM <unfinished ...> [pid 6781] <... tgkill resumed> ) = 0 [pid 7207] <... tgkill resumed> ) = 0 ... write(1, "signal ping-pong no mtx elapsed:"..., 90signal ping-pong no mtx elapse d: 503406198 us; per iteration: 2267595486 ns / 1.6e+03 csw ) = 90 exit_group(0) = ? Process 6781 detached Similar result with glibc built from CVS sources: $ strace -f bash -c 'LD_LIBRARY_PATH=/home/bart/glibc-cvs/lib: /home/bart/glibc-cvs/lib/ld-linux.so.2 ./glibc-cvs-condvar-perf 2' 2>&1|grep 'FUTEX_???' [pid 7467] futex(0xbfea12fc, 0x5 /* FUTEX_??? */, 1 <unfinished ...> [pid 7465] futex(0xbfea12fc, 0x5 /* FUTEX_??? */, 1) = 0 [pid 7467] futex(0xbfea12fc, 0x5 /* FUTEX_??? */, 1 <unfinished ...> [pid 7465] futex(0xbfea12fc, 0x5 /* FUTEX_??? */, 1) = 0
Whatever you think you are measuring has nothing to do with glibc. Should there actually be extra wake-ups this happens entirely in the kernel. glibc uses FUTEX_WAKE_OP which prevents unnecessary delays. If the kernel schedules freshly woken threads right away before finishing the futex call something of course can go wrong. But all this happens in the kernel. glibc gives the kernel all the information it needs to avoid any problem. Talk to the kernel people.
(In reply to comment #8) > Whatever you think you are measuring has nothing to do with glibc. Should there > actually be extra wake-ups this happens entirely in the kernel. glibc uses > FUTEX_WAKE_OP which prevents unnecessary delays. If the kernel schedules > freshly woken threads right away before finishing the futex call something of > course can go wrong. But all this happens in the kernel. glibc gives the > kernel all the information it needs to avoid any problem. > > Talk to the kernel people. So what we agree about is that one context switch for waking up another thread would be better than two or three. What we don't agree about is where this should be solved. My opinion about this is as follows: - This is a performance issue that can be solved in glibc. - This is a performance issue that should be solved in glibc and not in the kernel. Let me further clarify this by starting from typical POSIX threads code for waking up a thread via a signal: Thread 1: pthread_mutex_lock(&m); pthread_cond_wait(&cv, &m); pthread_mutex_unlock(&m); Thread 2: pthread_mutex_lock(&m); pthread_cond_signal(&cv); pthread_mutex_unlock(&m); Let us assume that thread two obtains a lock on mutex m after thread 1 entered pthread_cond_wait(). In that case thread 2 will wake up thread 1 twice instead of just one time: after pthread_cond_signal() wakes up pthread_cond_wait() via the futex() system call, pthread_cond_wait() will try to lock the mutex m. But since the mutex m is still locked at that time, thread 2 will be scheduled again until it unlocks mutex m via another futex() system call. At that time thread 1 will be rescheduled. Or: three context switches have happened where one context switch would have been sufficient. At least this is what I assume that happens. My proposal is to change the implementation of the POSIX threads mutex and condition variable primitives in such a way that waking up another thread is deferred from pthread_cond_wait(cv) or pthread_cond_broadcast(cv) until pthread_mutex_unlock(m) is called. And this for the mutex m that was associated to condition variable cv via one or more threads calling pthread_cond_wait(). I know this is nontrivial, and that several cases have to be considered: - signal is sent from same process as the process waiting in pthread_cond_wait(). - signal is sent from another process than the one waiting in pthread_cond_wait(). - signal is sent to condition variable cv while the mutex m from the call to pthread_cond_wait(cv, m) is locked. - signal is sent to condition variable cv while the mutex m from the call to pthread_cond_wait(cv, m) is not locked (software that shows such runtime behavior is highly suspicious since this is a race condition, but the POSIX standard does not forbid this kind of behavior). - Note: in POSIX threads software where signals are only sent while the appropriate mutex is held, there is exactly one mutex m associated with each condition variable cv for threads blocked in pthread_cond_wait(cv, m). One way to implement a solution is to keep track for each mutex m of all condition variables cv for which a thread is blocked in pthread_cond_wait(cv, m). In case only one process is involved this could be implemented by modifying the pthread_cond_wait() implementation such that it allocates space for a list element on the stack of the thread it is executed on. These list elements can than be linked together via pointers, with a pointer to the head of the list stored in the mutex data structure. When pthread_cond_signal(cv2) is called with a lock held on mutex m2 by the same thread, this information is stored in the data structures of cv2 and m2. At the time pthread_mutex_unlock() is called this last function can then wake up the appropriate threads blocked in pthread_cond_wait(). Or: no thread wakeup by pthread_cond_signal(), only by pthread_mutex_unlock(). This optimization would have been a lot easier to implement if the POSIX threads standard would have defined one single operation pthread_cond_signal_and_mutex_unlock(cv, m).
If you want to avoid that, simply don't call pthread_cond_signal with the mutex held.
(In reply to comment #10) > If you want to avoid that, simply don't call pthread_cond_signal with the > mutex held. Sorry Jakub, but IMHO that is the wrong way to go. It is not possible to write a reliable POSIX threads program when not holding a mutex while calling pthread_cond_signal(). Signals can easily be lost when you do that.
Calling pthread_cond_signal without holding the mutex is perfectly valid and cannot result in signals being lost unless the implementation is buggy (which glibc may be; see bug #13165). What it can result in, however, is deviation from the expected priority behavior guaranteed by the priority scheduling options of POSIX.
A new condition variable implementation has been committed, which as different performance properties. The statements made in comment #9 do not really consider what is actually going on (e.g., waking another thread does not generally require the kernel to context switch to it immediately; not being able to acquire a mutex does not mean the respective thread has to block using futexes right away; the scenario plays out differently on a multiprocessor that is not oversubscribed; ...). The new condition variable implementation contains code comments that discuss related aspects in some more detail. Therefore, I'm closing this bug. If you observe a potential performance problem with the new condition variable, please open a new bug.