This is the mail archive of the mailing list for the pthreas-win32 project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: mutexes: "food for thought"

Hi all,

Summary: mutex speedups.

Belatedly getting around to Alexander Terekhov's sketched enhancements to mutexes (below), I've rewritten the mutex routines in pthreads-win32. The main objective was to remove the need for the extra critical section (wait_cs) in the unlock and timedlock routines.

However, I couldn't get my translation of Alexander's logic to work, so have applied Ulrich Drepper's futex based mutex algorithms (specifically 'Mutex2') from his paper "";. Some of the other ideas in Alexander's sketch, such as postponing full initialisation of statically declared mutexes until the slow sections of mutex operations (shown as DCSI() in the sketch below), have not been included yet, and may not be, because it would require recompiling applications before they could use the new dll. Postponing saves a compare op in each call to lock, timedlock or trylock.

The new code is in CVS if anyone wants to inspect/try it. The modified pthreads-win32 dll passes the full testsuite and has achieved some very significant speedups - at least on my single processor machine. In particular, the rwlock7.c test, which intensively exercises reader/writer locks, runs approximately 3 times faster than previously. [The reader/writer locks in pthreads-win32 are built from pthreads-win32 mutexes and condition variables.]

Further speedups were attempted by inlining the [many] calls to InterlockedCompareExchange(). This uses the library's own assembler version of this routine (X86 only), which was originally included for Win9x systems. But surprisingly, this canceled out almost all of the speed gains just made. It turns out that the 'lock' prefix to the
cmpxchg instruction has this effect on single processor systems - as a google search later confirmed - see:

Interestingly though, the Windows version of
InterlockedCompareExchange() on single processor systems doesn't appear to use the 'lock' prefix as calling it is only marginally slower (by approximately 10%) than the new pthreads-win32 dll with inlined CMPXCHG minus 'lock' prefix. I assume the difference is subroutine call overhead. So, rather than build a separate dll for SMP systems, inlining
is currently turned off, sacrificing the 10% speed gain for binary portablility.

[If anyone wants to turn inlining on - after checking out the code from CVS, change the "#if 0" to "#if 1" at the bottom of ptw32_InterlockedCompareExchange.c, and build the dll by running "nmake clean VC-inlined", or "make clean GC-inlined" for MinGW.]

With all changes included, performance of pthreads mutexes is approaching, and in the case of trylock, apparently exceeding the performance of Win32 Critical Section calls - based on tests\benchtest1.c. But, by avoiding Win32 Critical Sections, there is now a possibility that pthreads-win32 mutexes can exist in process shared memory, which may then allow PROCESS_SHARED mutexes and other objects to be implemented.

Unless I'm mistaken, the one negative about all of this is that threads are no longer guarranteed strict FIFO access to the lock. That is, a thread newly requesting the lock can sometimes steal the lock off an already waiting thread.


Alexander Terekhov wrote 1 year ago:


here's "ala futex based" mutex stuff using XCHG.

No need for CAS. I hope that it will work just fine.

Can you see any harmful race condition(s) here?



struct swap_based_mutex_for_windows {

atomic<int> m_lock_status; // -1: free, 0: locked, 1 lock-contention
atomic<auto_reset_event *> m_retry_event; // DCSI'd

 void DCSI(); // double-checked serialized initialization
 void slow_lock();
 bool slow_trylock();
 bool slow_timedlock(absolute_timeout const & timeout);
 void release_one_waiter_if_any();

void lock() {
if (m_lock_status.swap(0, msync::acq) >= 0) slow_lock();

bool trylock() {
return (m_lock_status.swap(0, msync::acq) < 0) ? true : slow_trylock(); }

bool timedlock(absolute_timeout const & timeout) {
return (m_lock_status.swap(0, msync::acq) < 0) ? true : slow_timedlock(timeout);

void unlock() {
if (m_lock_status.swap(-1, msync::rel) > 0) release_one_waiter_if_any();


void swap_based_mutex_for_windows::slow_lock() {
 while (m_lock_status.swap(1, msync::acq) >= 0)

bool swap_based_mutex_for_windows::slow_trylock() {
 return m_lock_status.swap(1, msync::acq) < 0;

bool swap_based_mutex_for_windows::slow_timedlock(absolute_timeout const & timeout) {
while (m_lock_status.swap(1, msync::acq) >= 0)
if (!m_retry_event.load(msync::none)->timedwait(timeout)) return false;
return true;

void swap_based_mutex_for_windows::release_one_waiter_if_any() {

void swap_based_mutex_for_windows::DCSI() {
 if (!m_retry_event.load(msync::none)) {
   named_windows_mutex_trick guard(this);
   if (!m_retry_event.load(msync::none)) { auto_reset_event(), msync::rel);, msync::rel);


P.S. I've never run it. Just a sketch.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]