Bug 4274 - Performance issue: pthread_cond_signal() causes three context switches instead of one
Summary: Performance issue: pthread_cond_signal() causes three context switches instea...
Status: REOPENED
Alias: None
Product: glibc
Classification: Unclassified
Component: nptl (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Ulrich Drepper
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-03-25 10:16 UTC by Bart Van Assche
Modified: 2014-07-17 04:50 UTC (History)
2 users (show)

See Also:
Host: i686-suse-linux-gnu
Target: i686-suse-linux-gnu
Build: i686-suse-linux-gnu
Last reconfirmed:
fweimer: security-


Attachments
Source code of test program. (1.50 KB, text/x-c++src)
2007-03-25 10:17 UTC, Bart Van Assche
Details
Source code of test program. (1.63 KB, text/x-csrc)
2007-03-25 15:24 UTC, Bart Van Assche
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bart Van Assche 2007-03-25 10:16:29 UTC
Pthread mutexes and condition variables are very often used in combination in
multithreaded software. In this combination pthread_cond_signal() is used to
wake up another thread. Optimally this wakeup would cause only one context
switch. Unfortunately pthread_cond_signal() seems to cause two context switches
on average when using linuxthreads, and three when using the NPTL.

$ uname -a
Linux pc-101 2.6.18.6 #8 Sun Feb 4 11:17:43 CET 2007 i686 athlon i386 GNU/Linux

$ /home/bart/glibc236/lib/libc.so.6
GNU C Library stable release version 2.3.6, by Roland McGrath et al.
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.1.2 20061115 (prerelease) (SUSE Linux).
Compiled on a Linux >>2.6.18.6<< system on 2007-03-17.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        linuxthreads-0.10 by Xavier Leroy
        BIND-8.2.3-T5B
        libthread_db work sponsored by Alpha Processor Inc
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Thread-local storage support included.
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

$ /lib/libc.so.6
GNU C Library stable release version 2.5 (20061011), by Roland McGrath et al.
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Configured for i686-suse-linux.
Compiled by GNU CC version 4.1.2 20061115 (prerelease) (SUSE Linux).
Compiled on a Linux 2.6.18 system on 2006-11-26.
Available extensions:
        crypt add-on version 2.1 by Michael Glad and others
        GNU Libidn by Simon Josefsson
        GNU libio by Per Bothner
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
        NoVersion patch for broken glibc 2.0 binaries
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
Thread-local storage support included.
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

$ LD_LIBRARY_PATH=/home/bart/glibc236/lib: /home/bart/glibc236/lib/ld-linux.so.2
./glibc236-condvar-perf
linuxthreads
mutex                 elapsed:     5287 us; per iteration:   52 ns / 0 context
switches.
c.v. ping-pong test   elapsed:   772543 us; per iteration: 7725 ns / 4 context
switches.
signal ping-pong test elapsed:   913529 us; per iteration: 9135 ns / 4 context
switches.

$ ./condvar-perf
NPTL
mutex                 elapsed:     4000 us; per iteration:   40 ns / 0 context
switches.
c.v. ping-pong test   elapsed:   461905 us; per iteration: 4619 ns / 6 context
switches.
signal ping-pong test elapsed:   424947 us; per iteration: 4249 ns / 4 context
switches.
Comment 1 Bart Van Assche 2007-03-25 10:17:24 UTC
Created attachment 1647 [details]
Source code of test program.
Comment 2 Jakub Jelinek 2007-03-25 12:50:13 UTC
Why are you testing this with more than 2 years old codebase?
This has been fixed in August 2005.
Comment 3 Bart Van Assche 2007-03-25 15:24:18 UTC
Created attachment 1648 [details]
Source code of test program.

Added fourth test.
Comment 4 Bart Van Assche 2007-03-25 15:28:34 UTC
Are you sure that this is already fixed ? These are the results with the latest
CVS version (2007-03-25):

$ /home/bart/glibc-cvs/lib/libc.so.6
GNU C Library development release version 2.5.90, by Roland McGrath et al.
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.1.2 20061115 (prerelease) (SUSE Linux).
Compiled on a Linux >>2.6.18.6<< system on 2007-03-25.
Available extensions:
        crypt add-on version 2.1 by Michael Glad and others
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

$ LD_LIBRARY_PATH=/home/bart/glibc-cvs/lib:
/home/bart/glibc-cvs/lib/ld-linux.so.2 ./glibc-cvs-condvar-perf
NPTL
mutex                   elapsed:     4054 us; per iteration:   40 ns / 5e-05 csw
c.v. ping-pong test     elapsed:   421526 us; per iteration: 4215 ns / 6 csw
signal ping-pong test   elapsed:   430622 us; per iteration: 4306 ns / 4 csw
signal ping-pong no mtx elapsed:   207861 us; per iteration: 2078 ns / 2 csw
Comment 5 Jakub Jelinek 2007-03-28 09:17:50 UTC
$ gcc -g -o /tmp/condvar-perf{,.c} -lpthread -std=gnu99 -lrt -O2 -m64 -W -Wall
$ /tmp/condvar-perf
NPTL
mutex                   elapsed:     6790 us; per iteration:   67 ns / 0 csw
c.v. ping-pong test     elapsed:  1176092 us; per iteration: 11760 ns / 4 csw
signal ping-pong test   elapsed:   727195 us; per iteration: 7271 ns / 4 csw
signal ping-pong no mtx elapsed:   971107 us; per iteration: 9711 ns / 4 csw
$ gcc -g -o /tmp/condvar-perf{,.c} -lpthread -std=gnu99 -lrt -O2 -m32 -W -Wall
$ /tmp/condvar-perf
NPTL
mutex                   elapsed:     7980 us; per iteration:   79 ns / 2e-05 csw
c.v. ping-pong test     elapsed:  1328961 us; per iteration: 13289 ns / 4 csw
signal ping-pong test   elapsed:   704607 us; per iteration: 7046 ns / 4 csw
signal ping-pong no mtx elapsed:   748424 us; per iteration: 7484 ns / 3.9 csw
$ /lib64/libc.so.6 | head -n 7; /lib/libc.so.6 | head -n 7; uname -a
GNU C Library stable release version 2.5, by Roland McGrath et al.
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.1.1 20061011 (Red Hat 4.1.1-30).
Compiled on a Linux 2.6.9 system on 2007-01-05.
GNU C Library stable release version 2.5, by Roland McGrath et al.
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.1.1 20061011 (Red Hat 4.1.1-30).
Compiled on a Linux 2.6.9 system on 2007-01-05.
Linux tucnak.ms.mff.cuni.cz 2.6.18-1.2869.fc6 #1 SMP Wed Dec 20 14:51:34 EST
2006 x86_64 x86_64 x86_64 GNU/Linux
Comment 6 Jakub Jelinek 2007-03-30 12:35:32 UTC
Can you strace -f it?  If it doesn't use futex (..., 0x5, ...) (aka
FUTEX_WAKE_OP), then it is just a badly built glibc or your kernel is missing
that functionality.
Comment 7 Bart Van Assche 2007-03-30 18:06:06 UTC
(In reply to comment #6)
> Can you strace -f it?  If it doesn't use futex (..., 0x5, ...) (aka
> FUTEX_WAKE_OP), then it is just a badly built glibc or your kernel is missing
> that functionality.

After adding an fflush(stdout) just before the end of run_test I get the
following (this test is done with openSUSE's glibc-2.5-25):

$ strace -f ./condvar-perf 222 2>&1 | less
...
clock_gettime(CLOCK_MONOTONIC, {832, 851599805}) = 0
clock_gettime(CLOCK_MONOTONIC, {832, 851627494}) = 0
...
write(1, "NPTL\nmutex                   ela"..., 85NPTL
mutex                   elapsed:       28 us; per iteration:  126 ns / 0.11 csw
) = 85
...
[pid  6781] futex(0xbfb95840, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid  6783] futex(0xbfb95840, FUTEX_WAKE, 1 <unfinished ...>
[pid  6781] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavai
lable)
[pid  6783] <... futex resumed> )       = 0
[pid  6781] futex(0xbfb95858, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid  6783] futex(0xbfb95858, FUTEX_WAKE, 1 <unfinished ...>
[pid  6781] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavai
lable)
[pid  6783] <... futex resumed> )       = 0
[pid  6781] futex(0xbfb9585c, 0x5 /* FUTEX_??? */, 1 <unfinished ...>
[pid  6783] futex(0xbfb9585c, FUTEX_WAIT, 871, NULL <unfinished ...>
[pid  6781] <... futex resumed> )       = 0
[pid  6783] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavai
lable)
[pid  6781] futex(0xbfb95840, FUTEX_WAKE, 1 <unfinished ...>
[pid  6783] futex(0xbfb95858, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid  6781] <... futex resumed> )       = 0
[pid  6781] futex(0xbfb95858, FUTEX_WAKE, 1 <unfinished ...>
[pid  6783] <... futex resumed> )       = 0
[pid  6781] <... futex resumed> )       = 1
...
write(1, "c.v. ping-pong test     elapsed:"..., 88c.v. ping-pong test     elapse
d: 82901328 us; per iteration: 373429405 ns / 3.8e+02 csw
) = 88
...[pid  6796] rt_sigtimedwait([ALRM],  <unfinished ...>
[pid  6781] rt_sigtimedwait([ALRM],  <unfinished ...>
[pid  6796] <... rt_sigtimedwait resumed> 0, 0, 8) = 14
[pid  6796] tgkill(6781, 6781, SIGALRM <unfinished ...>
[pid  6781] <... rt_sigtimedwait resumed> 0, 0, 8) = 14
[pid  6796] <... tgkill resumed> )      = 0
[pid  6781] futex(0xbfb9586c, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid  6796] futex(0xbfb9586c, FUTEX_WAKE, 1 <unfinished ...>
[pid  6781] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavai
lable)
[pid  6796] <... futex resumed> )       = 0
[pid  6781] tgkill(6781, 6796, SIGALRM <unfinished ...>
...
write(1, "signal ping-pong test   elapsed:"..., 89signal ping-pong test   elapse
d: 116494476 us; per iteration: 524749891 ns / 5.5e+02 csw
) = 89
...
[pid  6781] rt_sigtimedwait([ALRM],  <unfinished ...>
[pid  7207] rt_sigtimedwait([ALRM],  <unfinished ...>
[pid  6781] <... rt_sigtimedwait resumed> 0, 0, 8) = 14
[pid  7207] <... rt_sigtimedwait resumed> 0, 0, 8) = 14
[pid  6781] tgkill(6781, 7207, SIGALRM <unfinished ...>
[pid  7207] tgkill(6781, 6781, SIGALRM <unfinished ...>
[pid  6781] <... tgkill resumed> )      = 0
[pid  7207] <... tgkill resumed> )      = 0
...
write(1, "signal ping-pong no mtx elapsed:"..., 90signal ping-pong no mtx elapse
d: 503406198 us; per iteration: 2267595486 ns / 1.6e+03 csw
) = 90
exit_group(0)                           = ?
Process 6781 detached


Similar result with glibc built from CVS sources:

$ strace -f bash -c 'LD_LIBRARY_PATH=/home/bart/glibc-cvs/lib:
/home/bart/glibc-cvs/lib/ld-linux.so.2 ./glibc-cvs-condvar-perf 2' 2>&1|grep
'FUTEX_???'
[pid  7467] futex(0xbfea12fc, 0x5 /* FUTEX_??? */, 1 <unfinished ...>
[pid  7465] futex(0xbfea12fc, 0x5 /* FUTEX_??? */, 1) = 0
[pid  7467] futex(0xbfea12fc, 0x5 /* FUTEX_??? */, 1 <unfinished ...>
[pid  7465] futex(0xbfea12fc, 0x5 /* FUTEX_??? */, 1) = 0
Comment 8 Ulrich Drepper 2007-08-25 05:38:57 UTC
Whatever you think you are measuring has nothing to do with glibc.  Should there
actually be extra wake-ups this happens entirely in the kernel.  glibc uses
FUTEX_WAKE_OP which prevents unnecessary delays.  If the kernel schedules
freshly woken threads right away before finishing the futex call something of
course can go wrong.  But all this happens in the kernel.  glibc gives the
kernel all the information it needs to avoid any problem.

Talk to the kernel people.
Comment 9 Bart Van Assche 2007-08-28 10:14:35 UTC
(In reply to comment #8)
> Whatever you think you are measuring has nothing to do with glibc.  Should there
> actually be extra wake-ups this happens entirely in the kernel.  glibc uses
> FUTEX_WAKE_OP which prevents unnecessary delays.  If the kernel schedules
> freshly woken threads right away before finishing the futex call something of
> course can go wrong.  But all this happens in the kernel.  glibc gives the
> kernel all the information it needs to avoid any problem.
> 
> Talk to the kernel people.

So what we agree about is that one context switch for waking up another thread
would be better than two or three. What we don't agree about is where this
should be solved. My opinion about this is as follows:
- This is a performance issue that can be solved in glibc.
- This is a performance issue that should be solved in glibc and not in the kernel.

Let me further clarify this by starting from typical POSIX threads code for
waking up a thread via a signal:

Thread 1:
  pthread_mutex_lock(&m);
  pthread_cond_wait(&cv, &m);
  pthread_mutex_unlock(&m);

Thread 2:
  pthread_mutex_lock(&m);
  pthread_cond_signal(&cv);
  pthread_mutex_unlock(&m);

Let us assume that thread two obtains a lock on mutex m after thread 1 entered
pthread_cond_wait(). In that case thread 2 will wake up thread 1 twice instead
of just one time: after pthread_cond_signal() wakes up pthread_cond_wait() via
the futex() system call, pthread_cond_wait() will try to lock the mutex m. But
since the mutex m is still locked at that time, thread 2 will be scheduled again
until it unlocks mutex m via another futex() system call. At that time thread 1
will be rescheduled. Or: three context switches have happened where one context
switch would have been sufficient. At least this is what I assume that happens.

My proposal is to change the implementation of the POSIX threads mutex and
condition variable primitives in such a way that waking up another thread is
deferred from pthread_cond_wait(cv) or pthread_cond_broadcast(cv) until
pthread_mutex_unlock(m) is called. And this for the mutex m that was associated
to condition variable cv via one or more threads calling pthread_cond_wait(). I
know this is nontrivial, and that several cases have to be considered:
- signal is sent from same process as the process waiting in pthread_cond_wait().
- signal is sent from another process than the one waiting in pthread_cond_wait().
- signal is sent to condition variable cv while the mutex m from the call to
pthread_cond_wait(cv, m) is locked.
- signal is sent to condition variable cv while the mutex m from the call to
pthread_cond_wait(cv, m) is not locked (software that shows such runtime
behavior is highly suspicious since this is a race condition, but the POSIX
standard does not forbid this kind of behavior).
- Note: in POSIX threads software where signals are only sent while the
appropriate mutex is held, there is exactly one mutex m associated with each
condition variable cv for threads blocked in pthread_cond_wait(cv, m).

One way to implement a solution is to keep track for each mutex m of all
condition variables cv for which a thread is blocked in pthread_cond_wait(cv,
m). In case only one process is involved this could be implemented by modifying
the pthread_cond_wait() implementation such that it allocates space for a list
element on the stack of the thread it is executed on. These list elements can
than be linked together via pointers, with a pointer to the head of the list
stored in the mutex data structure. When pthread_cond_signal(cv2) is called with
a lock held on mutex m2 by the same thread, this information is stored in the
data structures of cv2 and m2. At the time pthread_mutex_unlock() is called this
last function can then wake up the appropriate threads blocked in
pthread_cond_wait(). Or: no thread wakeup by pthread_cond_signal(), only by
pthread_mutex_unlock(). This optimization would have been a lot easier to
implement if the POSIX threads standard would have defined one single operation
pthread_cond_signal_and_mutex_unlock(cv, m).
Comment 10 Jakub Jelinek 2007-08-28 12:53:22 UTC
If you want to avoid that, simply don't call pthread_cond_signal with the
mutex held.
Comment 11 Bart Van Assche 2007-08-28 13:02:34 UTC
(In reply to comment #10)
> If you want to avoid that, simply don't call pthread_cond_signal with the
> mutex held.

Sorry Jakub, but IMHO that is the wrong way to go. It is not possible to write a
reliable POSIX threads program when not holding a mutex while calling
pthread_cond_signal(). Signals can easily be lost when you do that. 
Comment 12 Rich Felker 2012-09-19 08:54:35 UTC
Calling pthread_cond_signal without holding the mutex is perfectly valid and cannot result in signals being lost unless the implementation is buggy (which glibc may be; see bug #13165). What it can result in, however, is deviation from the expected priority behavior guaranteed by the priority scheduling options of POSIX.