Bug 25847 - pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing
Summary: pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoi...
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: nptl (show other bugs)
Version: 2.27
: P2 normal
Target Milestone: ---
Assignee: Carlos O'Donell
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-04-18 08:04 UTC by Qin Li
Modified: 2022-11-07 18:23 UTC (History)
33 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
the c repro mentioned in the description (2.37 KB, text/plain)
2020-04-18 08:04 UTC, Qin Li
Details
mitigation patch mentioned in the description (1.15 KB, patch)
2020-04-18 08:07 UTC, Qin Li
Details | Diff
mitigation patch mentioned in the description (do actual broadcast) (471 bytes, patch)
2020-04-20 17:52 UTC, Qin Li
Details | Diff
Patch fix by broadening the scope of g_refs, making stealing impossible (8.07 KB, text/plain)
2020-11-15 02:45 UTC, Malte Skarupke
Details
proposed lost wakeup fix with g_signals relative to g1_start and no signal stealing (7.38 KB, patch)
2021-05-04 22:48 UTC, Frank Barrus
Details | Diff
simple reproducer for issue (806 bytes, text/x-csrc)
2022-09-26 14:28 UTC, Eric Hagberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Qin Li 2020-04-18 08:04:10 UTC
Created attachment 12480 [details]
the c repro mentioned in the description

Hi, I will try to be concise while this issue is hard to explain shortly. The problem is repro-able (with ~100 line c code that uses only pthread) on Ubuntu 18.04 glibc 2.27, while I look through the recent changes and it should be applicable to the latest version of glibc as well.

Our application is written in C# and runs on .net core on Linux, and .net core implements a critical section structure based on pthread_cond. The use of pthread_cond is for maintaining the waiters for the critical section, and the owner of critical section will use pthread_cond_signal when leaving, to wake up one waiter which was blocked on entering it before.

And it happens regularly that our application will hang, which the culprit of the hang is traced to be one or more threads stuck in pthread_cond_wait (called by EnterCritialSection function due to contention). The evidence that made use believe this is a bug in pthread_cond, instead of .net core, or our application, is this signature that we can always confirm from our application core dump:

(lldb) p (__pthread_cond_s)0x000055af919331c8
(__pthread_cond_s) $7 = {
{ __wseq = 5171942476 __wseq32 = (__low = 876975180, __high = 1) }
{ __g1_start = 5171942452 __g1_start32 = (__low = 876975156, __high = 1) }
__g_refs = ([0] = 4, [1] = 0)
__g_size = ([0] = 0, [1] = 0)
__g1_orig_size = 40
__wrefs = 16
__g_signals = ([0] = 0, [1] = 2)
}

The above internal state of pthread_cond is problematic because:
1) current G1 is at slot index 1 (LSB is 0 in wseq, so G2 at index 0)
2) there is 1 signal in G1, because __g_signals[G1] == 2
3) there are no futex waiters in G1, because __g_refs[G1] == 0
4) last signaler just deliver the last signal to G1 of this pthread_cond, because __g_size[G1] == 0.
5) as the same time there are 2 waiters in G2, as __g_refs[G2] = 4

So, the problem is while there are 2 waiters in G2, when the last signal is delivered to G1 when we have __g_size[G1] == 1, but __g_refs[G1] == 0.

Because based on line 77 to 89 from pthread_cond_signal.c, before we increment the the last signal on __g_signals[G1] from 0 to 2 (or some value N to N + 2), __g_size[G1] was 1, then __g_size[G1] got decremented to 0 after the signals being incremented.

  if ((cond->__data.__g_size[g1] != 0)
      || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
    {
      /* Add a signal.  Relaxed MO is fine because signaling does not need to
	 establish a happens-before relation (see above).  We do not mask the
	 release-MO store when initializing a group in
	 __condvar_quiesce_and_switch_g1 because we use an atomic
	 read-modify-write and thus extend that store's release sequence.  */
      atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2);
      cond->__data.__g_size[g1]--;
      /* TODO Only set it if there are indeed futex waiters.  */
      do_futex_wake = true;
    }

The critical section uses only pthread_cond_wait, so cancellation is not applicable here, which makes pthread_cond_signal the only place __g_size[G1] is decremented. And because __g_refs[G1] = 0, we essentially delivered a signal to a G1 where there is no futex waiter waiting on it. And the result is signal lost, when there are still 2 more waiters waiting for the signal in G2.

While there seem to be a lot of speculations in the above reasoning, as the dump only tells the final state, but won't necessarily mean it was the same state as I described, when the last signaler enters pthread_cond_signal. we do have more supporting evidence.

First, I wrote a C repro, that uses only pthread_cond/mutex, by mimicking the implementation of .net core's critical section, and the usage pattern of the critical section that causes the repro.

Here is the link to the repro: https://gist.github.com/tqinli/db1b892a97cfa0bc41fb8b0b0b156b7e

On my 4 core, 8 hyperthread machine, the repro creates 12 threads (1.5 x hyperthreads) that are entering and leaving the same critical section doing some dummy computation. Each round, each thread will try to enter and leave critical section once, then wait for the rest of the threads to finish doing entering and leaving the critical section once as well, before starting the next round, and it goes on forever.

Every 2 seconds, a monitor thread will print the current value of total number of rounds and the dummy variable used during calculation. When you see every 2s the same numbers are being printed over and over, it is repro'ed.

It takes < 20mins on my machine to repro, when I ran just 1 instance of the repro program. I find most successful to start 3 instances of the repro simultaneously, which tend to speed up the repro significantly so it can happen within 1min, some times in ten seconds. I'd say the mileage might vary especially based on the number of cores you have. We have 64 cores machines, and I've seen it can take 10 hours to repro.

Second, I believe I found the explanation why we would ever enter a situation where we could have __g_size[G1] to be 1 more than __g_refs[G1]. It is related to how we handle the signal stealing.

Right after a signal is successfully taken by a pthread_cond waiter, we have to check __g1_start, to confirm whether the waiter is taking the signal from the correct G1 the waiter belongs to, or the waiter took a signal from a later G1, that happened to aliased on the same slot, see below code snippets:

line 548 to 560 from pthread_cond_wait.c:
  /* Try to grab a signal.  Use acquire MO so that we see an up-to-date value
     of __g1_start below (see spinning above for a similar case).  In
     particular, if we steal from a more recent group, we will also see a
     more recent __g1_start below.  */
  while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
						&signals, signals - 2));

  /* We consumed a signal but we could have consumed from a more recent group
     that aliased with ours due to being in the same group slot.  If this
     might be the case our group must be closed as visible through
     __g1_start.  */
  uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);
  if (seq < (g1_start >> 1))

line 594 to 560 from pthread_cond_wait.c:

	      /* Try to add a signal.  We don't need to acquire the lock
		 because at worst we can cause a spurious wake-up.  If the
		 group is in the process of being closed (LSB is true), this
		 has an effect similar to us adding a signal.  */
	      if (((s & 1) != 0)
		  || atomic_compare_exchange_weak_relaxed
		       (cond->__data.__g_signals + g, &s, s + 2))

This is the place where a signal is posted to G1, but __g_size[G1] isn't decremented at the same time, which is correct if the waiter is indeed stealing a signal, as the decrementing of __g_size[G1] had already happened when the signal is originally posted in pthread_cond_signal.

However, two operations: a) taking a signal, b) checking the current __g1_start can't happen at the same time. If unfortunately a thread is being scheduled out right after this line, after it took the signal:
  while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
						&signals, signals - 2));

And resumed running only a while later, during which time several group rotations happened, the thread could now observe a newer g1_start at line below. If the new G1 happens to be aliased on the same slot, the code incorrectly thinks the waiter is stealing a signal from new G1, which in fact it took the signal from an old G1:
  uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);

When we then post the signal back to the new and current G1, we will spuriously wake up a futex waiter in this group, causing __g_refs[G1] to be 1 less than __g_size[G1]. And we know __g_size[G1] isn't decremented because the original signal didn't come from this new G1. At this point the damage to this G1 is already done, although this is not the final state yet.

As the G1 might still have several remaining waiters, when new signals come, waiters from this damaged G1 will still be woke up. Until the last signal is delivered on this G1, we would observe what was shown in the dump above: that we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1] was already 0 before __g_size[G1] did, and the signal remains not taken. But in the meantime there are one or more waiters in G2. Signal is lost, when indeed we could have wake up a waiter in G2.

I think the above code does give a plausible explanation for the dump and does suggest there might be a bug for handling signal stealing this way.

Third, with both the repro, and a theory about why it happened, we tried to hack the code to verify whether the theory is correct with the repro. Here a link to the hack:
https://github.com/thetradedesk/glibc/commit/900792b317b5b72a27bb4ec2aec1cd794dca62a5

Basically, if we suspect the current waiter stole the signal, posting a signal back to the current G1 must happen first, because if we are right, it is also possible a signaler thread is blocked in quiesce_and_switch(), while acquiring the private lock if the pthread_cond. What is next is to mitigate the case when we are wrong about stealing, I can't think of a way to detect easily. If we can't detect when we are wrong about stealing, the only way I could think of is to dismantle G1 entirely and letting every waiter in G1 to spuriously wake up. And also this broadcast is only possible after we potentially unblocked a thread blocked in quiesce_and_switch with the pthread_cond private lock.

This is by no means suggesting that this can be the actual fix, if this bug is confirmed here. But when I tried the repro with a "fixed" libpthread.so, what used to repro in 15-20mins is no longer reproing. I have been able to run my repro for 2 days straight without a repro, which gives me good confidence that the theory we had above holds true and is what is causing the problem we saw here.

So this is the issue. There are still a lot of details that I intentionally left out to make it concise, details about the repro, our application and .net core's critical section implementation based on pthread_cond, and why we seems to be the only ones hitting this issue. Please let me know if you have any questions and I will be happy to answer. Looking forward to a quick confirmation and a fix for this, and let me know if there is anything I can do to help to make it happen faster.

Best Regards.
Qin Li
Comment 1 Qin Li 2020-04-18 08:07:57 UTC
Created attachment 12481 [details]
mitigation patch mentioned in the description

This mitigation patch mentioned in the description. Link was used in the description but also attaching here in case the link is broken in the future.
Comment 2 Qin Li 2020-04-20 17:52:24 UTC
Created attachment 12484 [details]
mitigation patch mentioned in the description (do actual broadcast)

Looks like I have to do an actual broadcast of both G1 and G2, not just G1.
Comment 3 Arun 2020-05-05 11:50:46 UTC
I am probably facing the same issue with Python as well where GIL (Global Interpreter Lock) is the one locking the critical region of code.

My setup involves spawning 4 to 8 Python threads and each thread would be making lots of Redis socket calls. Python releases the GIL whenever a socket call is made so that other threads can make some progress.

Below is the backtrace of the thread which was supposed to release the GIL by doing a condition signal first.

-------------------

(gdb) bt

#0  0x00007f12c99e04b0 in futex_wait (private=<optimized out>, expected=4, futex_word=0xa745b0 <gil_cond+16>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:61

#1  0x00007f12c99e04b0 in futex_wait_simple (private=<optimized out>, expected=4, futex_word=0xa745b0 <gil_cond+16>)
    at ../sysdeps/nptl/futex-internal.h:135
#2  0x00007f12c99e04b0 in __condvar_quiesce_and_switch_g1 (private=<optimized out>,
                                                           g1index=<synthetic pointer>,
                                                           wseq=<optimized out>,
                                                           cond=0xa745a0 <gil_cond>) at pthread_cond_common.c:412

#3  0x00007f12c99e04b0 in __pthread_cond_signal (cond=0xa745a0 <gil_cond>) at pthread_cond_signal.c:78

#4  0x000000000050bfc8 in drop_gil.lto_priv (tstate=0x146ee3f0) at ../Python/ceval_gil.h:187
#5  0x000000000050c0ed in PyEval_SaveThread () at ../Python/ceval.c:356
#6  0x00000000005c04bb in sock_call_ex (s=0x7f125b97a9a8, writing=1, sock_func=

    0x5bdbf0 <sock_send_impl>, data=0x7f12594650a0, connect=0, err=0x0, timeout=-1000000000) at ../Modules/socketmodule.c:899
#7  0x00000000005c0659 in sock_sendall (s=0x7f125b97a9a8, args=<optimized out>) at ../Modules/socketmodule.c:3833
#8  0x000000000050a8af in _PyCFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=<optimized out>, func_obj=<built-in method sendall of socket object at remote 0x7f125b97a9a8>) at ../Objects/methodobject.c:234
#9  0x000000000050a8af in _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, stack=<optimized out>, func=<optimized out>) at ../Objects/methodobject.c:294
#10 0x000000000050a8af in call_function.lto_priv (pp_stack=0x7f1259465250, oparg=<optimized out>, kwnames=<optimized out>)
    at ../Python/ceval.c:4851
#11 0x000000000050c5b9 in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3335
#12 0x0000000000509d48 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x16cdcbf8, for file /usr/lib/python3/dist-packages/redis/connection.py, line 590, in send_packed_command (self=<Connection(pid=12668, host='10.64.219.4', port=6379, db=0, password=None, socket_timeout=None, socket_connect_timeout=None, socket_keepalive=None, socket_keepalive_options={}, retry_on_timeout=False, encoder=<Encoder(encoding='utf-8', encoding_errors='strict', decode_responses=True) at remote 0x7f1259c84be0>, _sock=<socket at remote 0x7f125b97a9a8>, _parser=<HiredisParser(socket_read_size=65536, _buffer=<bytearray at remote 0x7f1259c848b8>, _sock=<...>, _reader=<hiredis.Reader at remote 0x7f1259c8b0c0>, _next_response=False) at remote 0x7f1259c84518>, _description_args={'host': '10.64.219.4', 'port': 6379, 'db': 0}, _connect_callbacks=[]) at remote 0x7f1259c84ba8>, command=[b'*2\r\n$3\r\nGET\r\n$54\r\nproj_00|mock1|cps|07b6e3d7-5ed1-36e8-81ef-e53777298405\r\n'], item=b'*2\r\n$3\r\nGET\r\n$54\r\nproj_00|mock1|cps|07b6e3d7-5ed1-36e8-81ef-e53777298405\r\n')) at ../Python/ceval.c:754
#13 0x0000000000509d48 in _PyFunction_FastCall (globals=<optimized out>, nargs=382585848, args=<optimized out>, co=<optimized out>)
    at ../Python/ceval.c:4933


-------------------

Part of the python code that drops the GIL:

--------------------
static void
drop_gil(struct _ceval_runtime_state *ceval, PyThreadState *tstate)
{
    struct _gil_runtime_state *gil = &ceval->gil;
    if (!_Py_atomic_load_relaxed(&gil->locked)) {
        Py_FatalError("drop_gil: GIL is not locked");
    }

    /* tstate is allowed to be NULL (early interpreter init) */
    if (tstate != NULL) {
        /* Sub-interpreter support: threads might have been switched
           under our feet using PyThreadState_Swap(). Fix the GIL last
           holder variable so that our heuristics work. */
        _Py_atomic_store_relaxed(&gil->last_holder, (uintptr_t)tstate);
    }

    MUTEX_LOCK(gil->mutex);
    _Py_ANNOTATE_RWLOCK_RELEASED(&gil->locked, /*is_write=*/1);
    _Py_atomic_store_relaxed(&gil->locked, 0);
    COND_SIGNAL(gil->cond);
    MUTEX_UNLOCK(gil->mutex);

#ifdef FORCE_SWITCHING
    if (_Py_atomic_load_relaxed(&ceval->gil_drop_request) && tstate != NULL) {
        MUTEX_LOCK(gil->switch_mutex);
        /* Not switched yet => wait */
        if (((PyThreadState*)_Py_atomic_load_relaxed(&gil->last_holder)) == tstate)
        {
            assert(is_tstate_valid(tstate));
            RESET_GIL_DROP_REQUEST(tstate->interp);
            /* NOTE: if COND_WAIT does not atomically start waiting when
               releasing the mutex, another thread can run through, take
               the GIL and drop it again, and reset the condition
               before we even had a chance to wait for it. */
            COND_WAIT(gil->switch_cond, gil->switch_mutex);
        }
        MUTEX_UNLOCK(gil->switch_mutex);
    }
#endif
}
--------------------
Comment 4 Arun 2020-05-05 11:52:13 UTC
Forgot to mention the platform. The OS and libc version is the same that Qin reported. Its amd64 and on a VM on Openstack.
Comment 5 Ding Xiang Fei 2020-05-14 08:52:43 UTC
This bug can be reproduced on OCaml applications running on Debian 10 (buster), where glibc version is 2.28-10, though I think more recent versions of glibc will be affected as well since the current construction has not been revised since 2016.

It has the similar symptom with regards to the master lock of OCaml runtime system. With OCaml version 4.09, the bug is triggered producing the following state of the conditional variable.

{
  __data = {
    {
      __wseq = 105780454, 
      __wseq32 = {
        __low = 105780454, 
        __high = 0
      }
    }, 
    {
      __g1_start = 105780438, 
      __g1_start32 = {
        __low = 105780438, 
        __high = 0
      }
    }, 
    __g_refs = {12, 0}, 
    __g_size = {0, 0}, 
    __g1_orig_size = 8, 
    __wrefs = 48, 
    __g_signals = {0, 2}
  }, 
  __size = "\346\024N\006\000\000\000\000\326\024N\006\000\000\000\000\f", '\000' <repeats 15 times>, "\b\000\000\000\060\000\000\000\000\000\000\000\002\000\000", 
  __align = 105780454
}

One signal is pending for G1, while G1 is "empty".

The repro code from Qin also locks up on a Skylake processor, restricted to run on 4 cores, 100 threads.
Comment 6 Carlos O'Donell 2020-05-19 21:15:39 UTC
Thanks for reporting this. As you know reviewing this is quite complex. The initial review contains a lot good details, thank you for this. Correctness issues like this are going to take some time to resolve.
Comment 7 Georgy Kirichenko 2020-10-16 20:23:35 UTC
I did a litle experiment building a modified version of glibc, the modification implies usleep(0) just before line 561 (uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);) at prthred_cond_wait.c and face the issue just after running the repro.
Comment 8 Michael Bacarella 2020-10-18 22:55:15 UTC
I ran into this issue recently as well.  It took a *very* long time to isolate.  I'm glad to see a patch was posted already.

I confirm on my systems that the deadlock was introduced in 2.27 and appears to resolve when the one-line patch is applied to at least glibc-2.31.
Comment 9 Georgy Kirichenko 2020-10-20 20:12:37 UTC
I have another one idea how this could be fixed - what if we make g_signals 64 bits in size where the high half will contain groups rotating count and the low part - count of signals in a group (as it was before). This will protect us from signal stealing and allows each waiter to detect if it's group was closed?
Comment 10 Georgy Kirichenko 2020-10-20 20:17:36 UTC
In reply to my comment #7 I should say that usleep(10) should be used instead to reproduce the issue faster
Comment 11 Malte Skarupke 2020-11-01 13:58:23 UTC
I blogged about trying to understand this bug using TLA+ here:
https://probablydance.com/2020/10/31/using-tla-in-the-real-world-to-understand-a-glibc-bug/

TLA+ is a tool for doing program verification. I'll make the code available in any license you want. (I presume LGPL is good?) I think it would be good to have a TLA+ implementation of the condition variables in glibc.

My solution for this bug based on the investigation would be to broaden the scope of g_refs, so that it gets incremented in line 411 in pthread_cond_wait.c, and gets decremented in line 54. That makes it align exactly with the increment and decrement of wrefs, which might make wrefs unnecessary. (though I didn't verify that) If it does make wrefs unnecessary, this fix could both simplify the code, speed it up a little, and also get rid of the "potential stealing" case.

I haven't yet done the work to actually implement and try that change. Let me know if there is interest in me trying that. (or also let me know if it's unlikely to work because I missed something)
Comment 12 Michael Bacarella 2020-11-01 15:40:30 UTC
(In reply to Malte Skarupke from comment #11)
> I blogged about trying to understand this bug using TLA+ here:
> https://probablydance.com/2020/10/31/using-tla-in-the-real-world-to-
> understand-a-glibc-bug/

This is an awesome write-up.
 
> TLA+ is a tool for doing program verification. I'll make the code available
> in any license you want. (I presume LGPL is good?) I think it would be good
> to have a TLA+ implementation of the condition variables in glibc.

Agree.  I will happily contribute to this cause.

> My solution for this bug based on the investigation would be to broaden the
> scope of g_refs, so that it gets incremented in line 411 in
> pthread_cond_wait.c, and gets decremented in line 54. That makes it align
> exactly with the increment and decrement of wrefs, which might make wrefs
> unnecessary. (though I didn't verify that) If it does make wrefs
> unnecessary, this fix could both simplify the code, speed it up a little,
> and also get rid of the "potential stealing" case.
> 
> I haven't yet done the work to actually implement and try that change. Let
> me know if there is interest in me trying that. (or also let me know if it's
> unlikely to work because I missed something)

I'm fairly confident that Qin Li's patch stops the deadlock from happening (see my previous comment).  I understand it may not be the maximally efficient fix but I consider not deadlocking to outweigh any possible inefficiency introduced only in this rather edge, lock stealing case.

Given how hard this bug is to isolate (it took about two man-months on my side), and that other orgs have gone through and are going through similar experiences, it seems right to apply the fix now and discuss improving the performance of the fix in a new bug.

From Malte's excellent blog post it appears people have been struggling with this since at least 2019 https://github.com/mratsim/weave/issues/56
Comment 13 Malte Skarupke 2020-11-15 02:45:06 UTC
Created attachment 12956 [details]
Patch fix by broadening the scope of g_refs, making stealing impossible

I've attached a patch file (Fix-pthread_cond_wait-signal-bug-related-to-stealing.patch) that contains a fix for this bug. I haven't verified this fix with TLA+ yet, but I still intend to do that soon. I'm posting this early because I'm about to go on vacation for a week, so I didn't want to keep people waiting. Plus, it's not like other patches are verified in TLA+. So for now this follows the usual standard of passing all the tests, not triggering the bug in the reproduce case that's attached to this ticket, and of me spending a lot of time reasoning through it and convincing myself that it's correct.

The fix works by making stealing of signals from later groups impossible, allowing me to get rid of the code that tries to handle the stealing case, thus getting rid of the bug in that code.

The fix got a bit bigger in scope. While a simple fix would have just broadened the scope of g_refs, (this fix I did verify in TLA+) that leads to inelegant code where both wrefs and g_refs serve pretty much the same purpose. So I decided to get rid of the reference count in wrefs with this change. And unfortunately I couldn't completely do that. So the change got more complex.

This definitely needs some reviewing before people use it. But I think the direction is good, since it gets rid of a very complex edge case. Yes, it also introduces a new edge case, but that new edge case only happens for people who use pthread_cancel, and even then it should be simpler to reason through. (to the point that I'm not going to explain it here, hoping that you'll understand it just from the comment in the code)

Let me know if there is anything else that you need. Like is this even the right place to submit a patch? Or should I send an email to the mailing list?
Comment 14 Adhemerval Zanella 2020-11-23 16:46:27 UTC
Hi Malte Skarupke, thanks for working on this.  There is no active patch reviewed in the bug report (as for PR/defects on other tools/pojects), patchs are send and reviewed on libc-alpha [1].  Keep in mind that length of this change will require a Copyright assignment, as indicate on the contribution checklist [2]. Could you send it upstream to review?

This is a really trick issue and it might take some time for other developers to validate and review this change. You may also CC the creator of the new pthread_cond_wait implementation, Torvald Riegel <triegel@redhat.com>.

[1] https://sourceware.org/mailman/listinfo/libc-alpha
[2] https://sourceware.org/glibc/wiki/Contribution%20checklist
Comment 15 Balint Reczey 2020-12-07 15:41:08 UTC
Thank you for all both the bug report and the proposed fixes.
For the record I've cherry-picked the one line fix in Ubuntu development release and I plan backporting it to stable releases in the next weeks unless a different patch is accepted here earlier.
Comment 16 Malte Skarupke 2020-12-09 03:22:16 UTC
Sorry I was dragging my feet on submitting the patch I created above. I finally submitted it to the mailing list. I also sent off the email that I think I have to send in order to get the copyright assignment in order.

Oh and I did end up verifying my patch with TLA+. The above patch actually had a bug in pthread_cond_destroy that TLA+ found for me. It's fixed in the version that I sent to the mailing list.

Link to mailing list patch in the archive:
https://sourceware.org/pipermail/libc-alpha/2020-December/120563.html
Comment 17 Torvald Riegel 2020-12-24 20:02:48 UTC
(In reply to Arun from comment #3)
>     MUTEX_LOCK(gil->mutex);
>     _Py_ANNOTATE_RWLOCK_RELEASED(&gil->locked, /*is_write=*/1);
>     _Py_atomic_store_relaxed(&gil->locked, 0);
>     COND_SIGNAL(gil->cond);
>     MUTEX_UNLOCK(gil->mutex);
> 
> #ifdef FORCE_SWITCHING
>     if (_Py_atomic_load_relaxed(&ceval->gil_drop_request) && tstate != NULL)
> {
>         MUTEX_LOCK(gil->switch_mutex);
>         /* Not switched yet => wait */
>         if (((PyThreadState*)_Py_atomic_load_relaxed(&gil->last_holder)) ==
> tstate)
>         {
>             assert(is_tstate_valid(tstate));
>             RESET_GIL_DROP_REQUEST(tstate->interp);
>             /* NOTE: if COND_WAIT does not atomically start waiting when
>                releasing the mutex, another thread can run through, take
>                the GIL and drop it again, and reset the condition
>                before we even had a chance to wait for it. */
>             COND_WAIT(gil->switch_cond, gil->switch_mutex);
>         }
>         MUTEX_UNLOCK(gil->switch_mutex);
>     }
> #endif
> }

Can you please check whether you are using condvars correctly in your code, in particular whether your code handles spurious wake-ups of COND_WAIT correctly?  The bits of code you have posted do not have a loop that checks the wait condition again; there is just an if statement and you unlock the mutex right after the COND_WAIT.

Also, the two critical sections seem to use different mutexes and different conditions.  It would be more helpful if you could show code examples for pairs of related signals and waits.

The use of atomic access to the condition within the critical section ( gil->last_holder) can make sense, but it should not be required because that's what the mutex / critical section takes care of in a typical use of condvars.  Perhaps check that as well. 

Ideally, a small reproducer would be best.  (I'm aware of the first reproducer posted, but I'm currently looking at it and am not yet convinced that it is correct; it sends out more signals than the number of wake-ups it allows through the wait condition, AFAICT, which I find surprising.)
Comment 18 Torvald Riegel 2020-12-25 16:19:07 UTC
(In reply to Qin Li from comment #0)
> As the G1 might still have several remaining waiters, when new signals come,
> waiters from this damaged G1 will still be woke up. Until the last signal is
> delivered on this G1, we would observe what was shown in the dump above:
> that we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1]
> was already 0 before __g_size[G1] did, and the signal remains not taken. But
> in the meantime there are one or more waiters in G2. Signal is lost, when
> indeed we could have wake up a waiter in G2.

Qin Li, thank you for investigating, isolating the issue, and providing the reproducer.  I agree that there is a bug in that just incrementing a group's number of available signals isn't sufficient because it doesn't adjust the group size (ie, the number of waiters that signalers think are still in this group) accordingly.

The result is that signalers can put a correct signal in a group that is already effectively empty even though it's size doesn't show that, which leads to this signal being "lost".  __condvar_cancel_waiting also updates group size, for example. 

I also think that a solution would be to handle potential stealing in such a way that it takes all the steps a pthread_cond_signal would, so including updating the group's size.  I believe we don't need a full broadcast for that, which would reset everything basically.

Malte Skarupke's proposed solution of "locking" groups through __g_refs more broadly could also work in principle.  I'll respond to the patch directly on libc-alpha.  I'm wondering about the performance implications of this approach, even though a full pthread_cond_signal could also be costly because it adds contention to signalers.

(In reply to Torvald Riegel from comment #17)
> (I'm aware of the first
> reproducer posted, but I'm currently looking at it and am not yet convinced
> that it is correct; it sends out more signals than the number of wake-ups it
> allows through the wait condition, AFAICT, which I find surprising.)

I haven't fully wrapped my head around how the critical section implementation covered in the reproducer works, but I also don't see any concrete red flags anymore.  What had me surprised is that it really depends on blocking through the condvar, which I'd say is different from how many concurrent algorithms treat blocking through OS facilities like futexes as "optional" because the real functional blocking happens through shared-memory synchronization.  Second, the AWAKENED_WAITER flag seems to be intended to leak into the waiter ref count (ie, you can't have the waiter ref count start at a higher bit, the "overflow" of the flag seems to be intended).
Comment 19 Manoj Gupta 2021-01-07 01:09:49 UTC
We are also hitting this bug in Chrome OS builders (https://bugs.chromium.org/p/chromium/issues/detail?id=1101625)

@Malte, is your upstream patch accepted?
Comment 20 Balint Reczey 2021-01-07 07:31:29 UTC
This bug is also tracked in Ubuntu at https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800 .
Qin Li's patch is already included in Ubuntu's development series to rather have a minor performance regression than the occasional deadlock.
The patch is also being back-ported to Ubuntu 20.10 and will be back-ported to Ubuntu 20.04 LTS after the performance impact is known better, unless a better fix is developed.

Measurement results about the performance impact of Qin Li's patch is highly welcome either on Ubuntu 20.10 or on any other system.
Comment 21 Malte Skarupke 2021-01-07 13:54:32 UTC
Sorry, I continue to have very little energy to work on this. I have been putting a little more work into it yesterday though, and I have more doubts now that the mitigation patch is a long term solution.

I still think it's better than doing nothing, but it might introduce an even more rare deadlock instead. It's even more rare, because while the current deadlock triggers only in an edge case of an edge case, the new one would only trigger in an edge case of said edge case of an edge case.

My suspicions come from trying to make one of the alternative solutions mentioned in the patch email chain work, which was to do something similar to the mitigation patch, but to signal for the correct group instead of broadcasting. I couldn't make that alternative change work, and I reduced the problem down to calling this right in front of the "We potentially stole a signal" check:

  __condvar_acquire_lock (cond, private);
  __condvar_release_lock (cond, private);

Just taking and releasing the internal lock can deadlock in some of the tests. (most often in sysdeps/pthread/tst-cond16.c) The mitigation patch will call these functions. Most of the time that doesn't deadlock, and I haven't found out what exactly causes the deadlock (I'll instead put my limited energy back into a second version of my emailed patch), but the fact that unconditionally calling the above will deadlock should give us suspicion that calling the above conditionally might also sometimes deadlock. It's impossible to reproduce in the glibc tests because it's even more rare than the original problem, so I think the mitigation patch is still a little better than doing nothing.

Otherwise I think the patch I emailed is still good, (unlike my patch attached to this issue) and I aim to have a second version out, maybe on the weekend, that addresses the comments in the email conversation.
Comment 22 Torvald Riegel 2021-01-07 20:43:31 UTC
(In reply to Manoj Gupta from comment #19)
> We are also hitting this bug in Chrome OS builders
> (https://bugs.chromium.org/p/chromium/issues/detail?id=1101625)

Have you confirmed that there is a thread that's stuck in a condvar wait, and with a faulty state as described in this BZ?  Looking through this issue, I only saw a futex waiter being mentioned (which could be any other synchronization primitive that uses futexes, and could be due to buggy use of that primitive by the program).

> @Malte, is your upstream patch accepted?

AFAIA, no.  (And I'm not an active member of the glibc community anymore, so I don't get to contribute to acceptance decisions.)
Comment 23 Torvald Riegel 2021-01-07 23:31:05 UTC
(In reply to Malte Skarupke from comment #21)
> My suspicions come from trying to make one of the alternative solutions
> mentioned in the patch email chain work, which was to do something similar
> to the mitigation patch, but to signal for the correct group instead of
> broadcasting. I couldn't make that alternative change work, and I reduced
> the problem down to calling this right in front of the "We potentially stole
> a signal" check:
> 
>   __condvar_acquire_lock (cond, private);
>   __condvar_release_lock (cond, private);
> 
> Just taking and releasing the internal lock can deadlock in some of the
> tests. (most often in sysdeps/pthread/tst-cond16.c)

I believe that this just shows that stealing can happen.  Cancellation and timeouts, which appear to be working fine, also acquire this lock in __condvar_cancel_waiting -- but do so before decreasing the number of available signals.
Consider the case where a signal is stolen by thread A from a group 1 that is then fully signaled according to the group size, and then another thread B tries to quiesce and switch group 1.  The internal lock is held by B, which waits for a waiter W whose signal has been stolen by A.  W can't make progress because A has "its" signal.  And A waits for B if one adds the empty critical section, which is how I think this can deadlock.

> The mitigation patch
> will call these functions.

It does, but after giving back the stolen signal (or at least trying to do so, incorrectly).
Comment 24 Malte Skarupke 2021-01-08 03:45:05 UTC
(In reply to Torvald Riegel from comment #23)
> Consider the case where a signal is stolen by thread A from a group 1 that
> is then fully signaled according to the group size, and then another thread
> B tries to quiesce and switch group 1.  The internal lock is held by B,
> which waits for a waiter W whose signal has been stolen by A.  W can't make
> progress because A has "its" signal.  And A waits for B if one adds the
> empty critical section, which is how I think this can deadlock.
> 
> > The mitigation patch
> > will call these functions.
> 
> It does, but after giving back the stolen signal (or at least trying to do
> so, incorrectly).

That makes sense, and it would mean that the mitigation patch should be good.

So if stealing happens, a waiter is blocked. And that causes a signaler to block because it can't close the group. And because of that, I couldn't undo the stealing by signaling, because two threads can't signal at the same time because they have to hold the internal mutex.

Instead we have to do the partial signal that we do right now. But that one fails if stealing didn't happen. Because then we get into a weird state where g_size is incorrect because we can't change g_size without taking the internal mutex.

This means we have to do something different depending on whether stealing happened or didn't happen. But we only have this code to begin with because we can't tell one from the other.

Which means the mitigation patch might be the only way to do this properly: Do the partial signal in case stealing happens, and then in case stealing didn't happen do a broadcast to reset the state. And replacing the broadcast with a signal will be difficult because signaling doesn't reset the state of the condition variable, and therefore can't get us out of the weird state in case stealing didn't happen.

I personally think this is unsatisfying. It's very complex, even if we can find a way to do this correctly. So I won't attempt to work in this direction again, and I'll instead try to clean up the patch that makes stealing impossible, which reduces complexity.
Comment 25 Torvald Riegel 2021-01-16 00:21:17 UTC
(In reply to Qin Li from comment #0)
> Created attachment 12480 [details]
> the c repro mentioned in the description
> 
> Hi, I will try to be concise while this issue is hard to explain shortly.
> The problem is repro-able (with ~100 line c code that uses only pthread) on
> Ubuntu 18.04 glibc 2.27, while I look through the recent changes and it
> should be applicable to the latest version of glibc as well.
> 
> Our application is written in C# and runs on .net core on Linux, and .net
> core implements a critical section structure based on pthread_cond. The use
> of pthread_cond is for maintaining the waiters for the critical section, and
> the owner of critical section will use pthread_cond_signal when leaving, to
> wake up one waiter which was blocked on entering it before.
> 
> And it happens regularly that our application will hang, which the culprit
> of the hang is traced to be one or more threads stuck in pthread_cond_wait
> (called by EnterCritialSection function due to contention). The evidence
> that made use believe this is a bug in pthread_cond, instead of .net core,
> or our application, is this signature that we can always confirm from our
> application core dump:
> 
> (lldb) p (__pthread_cond_s)0x000055af919331c8
> (__pthread_cond_s) $7 = {
> { __wseq = 5171942476 __wseq32 = (__low = 876975180, __high = 1) }
> { __g1_start = 5171942452 __g1_start32 = (__low = 876975156, __high = 1) }
> __g_refs = ([0] = 4, [1] = 0)
> __g_size = ([0] = 0, [1] = 0)
> __g1_orig_size = 40
> __wrefs = 16
> __g_signals = ([0] = 0, [1] = 2)
> }
> 
> The above internal state of pthread_cond is problematic because:
> 1) current G1 is at slot index 1 (LSB is 0 in wseq, so G2 at index 0)
> 2) there is 1 signal in G1, because __g_signals[G1] == 2
> 3) there are no futex waiters in G1, because __g_refs[G1] == 0
> 4) last signaler just deliver the last signal to G1 of this pthread_cond,
> because __g_size[G1] == 0.
> 5) as the same time there are 2 waiters in G2, as __g_refs[G2] = 4
> 
> So, the problem is while there are 2 waiters in G2, when the last signal is
> delivered to G1 when we have __g_size[G1] == 1, but __g_refs[G1] == 0.
> 
> Because based on line 77 to 89 from pthread_cond_signal.c, before we
> increment the the last signal on __g_signals[G1] from 0 to 2 (or some value
> N to N + 2), __g_size[G1] was 1, then __g_size[G1] got decremented to 0
> after the signals being incremented.
> 
>   if ((cond->__data.__g_size[g1] != 0)
>       || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
>     {
>       /* Add a signal.  Relaxed MO is fine because signaling does not need to
> 	 establish a happens-before relation (see above).  We do not mask the
> 	 release-MO store when initializing a group in
> 	 __condvar_quiesce_and_switch_g1 because we use an atomic
> 	 read-modify-write and thus extend that store's release sequence.  */
>       atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2);
>       cond->__data.__g_size[g1]--;
>       /* TODO Only set it if there are indeed futex waiters.  */
>       do_futex_wake = true;
>     }
> 
> The critical section uses only pthread_cond_wait, so cancellation is not
> applicable here, which makes pthread_cond_signal the only place __g_size[G1]
> is decremented. And because __g_refs[G1] = 0, we essentially delivered a
> signal to a G1 where there is no futex waiter waiting on it. And the result
> is signal lost, when there are still 2 more waiters waiting for the signal
> in G2.
> 
> While there seem to be a lot of speculations in the above reasoning, as the
> dump only tells the final state, but won't necessarily mean it was the same
> state as I described, when the last signaler enters pthread_cond_signal. we
> do have more supporting evidence.
> 
> First, I wrote a C repro, that uses only pthread_cond/mutex, by mimicking
> the implementation of .net core's critical section, and the usage pattern of
> the critical section that causes the repro.
> 
> Here is the link to the repro:
> https://gist.github.com/tqinli/db1b892a97cfa0bc41fb8b0b0b156b7e
> 
> On my 4 core, 8 hyperthread machine, the repro creates 12 threads (1.5 x
> hyperthreads) that are entering and leaving the same critical section doing
> some dummy computation. Each round, each thread will try to enter and leave
> critical section once, then wait for the rest of the threads to finish doing
> entering and leaving the critical section once as well, before starting the
> next round, and it goes on forever.
> 
> Every 2 seconds, a monitor thread will print the current value of total
> number of rounds and the dummy variable used during calculation. When you
> see every 2s the same numbers are being printed over and over, it is
> repro'ed.
> 
> It takes < 20mins on my machine to repro, when I ran just 1 instance of the
> repro program. I find most successful to start 3 instances of the repro
> simultaneously, which tend to speed up the repro significantly so it can
> happen within 1min, some times in ten seconds. I'd say the mileage might
> vary especially based on the number of cores you have. We have 64 cores
> machines, and I've seen it can take 10 hours to repro.
> 
> Second, I believe I found the explanation why we would ever enter a
> situation where we could have __g_size[G1] to be 1 more than __g_refs[G1].
> It is related to how we handle the signal stealing.
> 
> Right after a signal is successfully taken by a pthread_cond waiter, we have
> to check __g1_start, to confirm whether the waiter is taking the signal from
> the correct G1 the waiter belongs to, or the waiter took a signal from a
> later G1, that happened to aliased on the same slot, see below code snippets:
> 
> line 548 to 560 from pthread_cond_wait.c:
>   /* Try to grab a signal.  Use acquire MO so that we see an up-to-date value
>      of __g1_start below (see spinning above for a similar case).  In
>      particular, if we steal from a more recent group, we will also see a
>      more recent __g1_start below.  */
>   while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
> 						&signals, signals - 2));
> 
>   /* We consumed a signal but we could have consumed from a more recent group
>      that aliased with ours due to being in the same group slot.  If this
>      might be the case our group must be closed as visible through
>      __g1_start.  */
>   uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);
>   if (seq < (g1_start >> 1))
> 
> line 594 to 560 from pthread_cond_wait.c:
> 
> 	      /* Try to add a signal.  We don't need to acquire the lock
> 		 because at worst we can cause a spurious wake-up.  If the
> 		 group is in the process of being closed (LSB is true), this
> 		 has an effect similar to us adding a signal.  */
> 	      if (((s & 1) != 0)
> 		  || atomic_compare_exchange_weak_relaxed
> 		       (cond->__data.__g_signals + g, &s, s + 2))
> 
> This is the place where a signal is posted to G1, but __g_size[G1] isn't
> decremented at the same time, which is correct if the waiter is indeed
> stealing a signal, as the decrementing of __g_size[G1] had already happened
> when the signal is originally posted in pthread_cond_signal.
> 
> However, two operations: a) taking a signal, b) checking the current
> __g1_start can't happen at the same time. If unfortunately a thread is being
> scheduled out right after this line, after it took the signal:
>   while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
> 						&signals, signals - 2));
> 
> And resumed running only a while later, during which time several group
> rotations happened, the thread could now observe a newer g1_start at line
> below. If the new G1 happens to be aliased on the same slot, the code
> incorrectly thinks the waiter is stealing a signal from new G1, which in
> fact it took the signal from an old G1:
>   uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);
> 
> When we then post the signal back to the new and current G1, we will
> spuriously wake up a futex waiter in this group, causing __g_refs[G1] to be
> 1 less than __g_size[G1]. And we know __g_size[G1] isn't decremented because
> the original signal didn't come from this new G1. At this point the damage
> to this G1 is already done, although this is not the final state yet.
> 
> As the G1 might still have several remaining waiters, when new signals come,
> waiters from this damaged G1 will still be woke up. Until the last signal is
> delivered on this G1, we would observe what was shown in the dump above:
> that we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1]
> was already 0 before __g_size[G1] did, and the signal remains not taken. But
> in the meantime there are one or more waiters in G2. Signal is lost, when
> indeed we could have wake up a waiter in G2.

So the stealing mitigation that adds a signal by just incrementing the respective slot in__g_signals can effectively mask another, real signal, but ...

> I think the above code does give a plausible explanation for the dump and
> does suggest there might be a bug for handling signal stealing this way.
> 
> Third, with both the repro, and a theory about why it happened, we tried to
> hack the code to verify whether the theory is correct with the repro. Here a
> link to the hack:
> https://github.com/thetradedesk/glibc/commit/
> 900792b317b5b72a27bb4ec2aec1cd794dca62a5

... why would a broadcast be required instead of just adding one more signal to fix/replay the masked signal?  (You later said that you found out the patch above that just closes group 1 isn't sufficient, and then proposed to use the broadcast.)

I don't have an explanation why the broadcast would be required, but I can confirm that just using one signal does not work for your test program.  When it hangs, I get such a state, for example:

__wseq = 10815175, __g1_start = 10815161, __g_refs = {0, 8},
__g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 32, __g_signals = {2, 0}

with the 4 remaining waiters all being in group 2 (ie, slot 1), and group 1 having one more signal than necessary according to it's size (which would be consistent with one being added unnecessarily).

So maybe adding one additional call to signals is insufficient; if I add two signals, I haven't observed a failure with the reproducer yet.  But why would it have to be more than one?

> So this is the issue. There are still a lot of details that I intentionally
> left out to make it concise, details about the repro, our application and
> .net core's critical section implementation based on pthread_cond, and why
> we seems to be the only ones hitting this issue.

Are you sure that your reproducer can correctly handle spurious wakeups?  I'm asking because if there might be a bug in the reproducer too, this could explain why a single additional signal does not work (assuming that this would indeed be a correct solution): The spuriously woken waiter goes to the back of the queue, but the additional signal goes in before any waiter including the spuriously woken one is in G2. 
The stealing and the additional signals resulting from that are the only sources of spurious wakeups given that you don't use cancellation or timeouts in the reproducer.  A broadcast or more signals could potentially just make it unlikely that a fault in the reproducer triggers.

Thus, if you could point me to docs or a version of your mutex implementation with comments, this would help me wrap my head around the details of it.

Malte, if you should have some cycles, perhaps your TLA+ model could also help to find executions in which just adding one signal would not be sufficient?
Comment 26 Torvald Riegel 2021-01-16 00:23:42 UTC
(In reply to Michael Bacarella from comment #8)
> I ran into this issue recently as well.  It took a *very* long time to
> isolate.  I'm glad to see a patch was posted already.
> 
> I confirm on my systems that the deadlock was introduced in 2.27 and appears
> to resolve when the one-line patch is applied to at least glibc-2.31.

Do you happen to have a reproducer (that is different from Qin Li's)?  Alternatively, is a call to pthread_cond_signal instead of the broadcast sufficient to prevent triggering errors in the cases that you observed?
Comment 27 Michael Bacarella 2021-01-30 00:59:55 UTC
> Torvald Riegel 2021-01-16 00:23:42 UTC
> (In reply to Michael Bacarella from comment #8)
> > I ran into this issue recently as well.  It took a *very* long time to
> > isolate.  I'm glad to see a patch was posted already.
> > 
> > I confirm on my systems that the deadlock was introduced in 2.27 and appears
> > to resolve when the one-line patch is applied to at least glibc-2.31.
>
> Do you happen to have a reproducer (that is different from Qin Li's)?
> Alternatively, is a call to pthread_cond_signal instead of the broadcast
> sufficient to prevent triggering errors in the cases that you observed?

Sadly no.  The existence of the bug was inferred through "masterlock" corruption in the OCaml runtime, and using the one-line patch to glibc above stopped the application from deadlocking.

I wasn't writing any C code directly that I could experiment with.

If it's helpful, there was some discussion here on how it was tracked down:
https://discuss.ocaml.org/t/is-there-a-known-recent-linux-locking-bug-that-affects-the-ocaml-runtime/6542/3

A pthread_cond_signal appears to simply be missed by threads waiting on pthread_cond_wait.  If I can force another thread to try to acquire the runtime lock again (briefly, by causing a thread that was blocked on I/O to complete (hitting Enter on stdin) which dispatches an new call to pthread_cond_wait), the deadlock clears.
Comment 28 Frank Barrus 2021-04-06 00:37:49 UTC
I have also recently been chasing this bug and couldn't find anything about it online before, so I ended up having to figure it out and develop a workaround for now.   While trying to find a place to post what I had found, my search now led to this thread.

So in case it's still helpful in providing more details or confirming what others have found, here's what I had already written up and was trying to find a place to post:


During the course of my work over the past couple years, in software using pthreads on multi-core and multi-CPU systems, I’ve been encountering obscure lost wakeups that were leading to sometimes being completely stuck, but could not pinpoint the source of the problem previously, and could find nothing online that matched what I was seeing, so I had implemented watchdogs to “kick” certain threads to prevent them from getting stuck once it was detected, but that was very specific to a certain threadpool/taskqueue usage pattern in the code.

After encountering this problem again in a general usage pattern that didn’t lend itself well to the watchdog approach, I finally was able to make investigating this problem a priority and just recently got to the bottom of it (after previously looking online but not finding anything matching what I was seeing at the time).

Here is the series of steps that leads to the bug:

- a waiter thread W1 has been running long enough (CPU-bound) to be eligible for involuntary pre-emption by the kernel soon since its quantum is about to expire.
thread W1 is immediately signaled before it blocks, and it consumes the signal, but is then pre-empted before it can read g1_start

- one or more G1/G2 switches occur such that the current index for G2 is back to where it was when W1 was pre-empted, but this is no longer effectively the same group anymore even though it has the same index

- thread W1 sees the change in g1_start and assumes it consumed a signal that was not meant for it. As a result, it conservatively sends a replacement signal to the current G1 group, which turns out to be an unnecessary spurious signal.  From the comments in the code, it would appear this not not considered harmful.  (although it turns out it is)

- the extra signal wakes up a waiter W2 from group G1. However, since this wakeup came from an extra replacement signal that should not have been sent, the size of the group in g_size[G1] was never decremented, so now there is a mismatch between the size of the group in g_size[G1] and the actual number of remaining waiters.
(If multiple waiters (W1) follow this entire pattern at once on multiple cores, there can be a mismatch of more than 1.)

- once the actual count of remaining waiters in G1 hits 0, the next signal sent will increment g_signal[G2] and decrement g_size[G1] but will not wake up any threads, and will not yet run the quiesce_and_switch since it thinks there are still G1 waiters to consume signals.  (this does not take wrefs into account to check how many waiters could possibly remain in G1)  This becomes a lost signal, which will show up in g_signals[G1] since it can never be consumed.  The condvar can now be in this state for any amount of time, and can accumulate additional waiters in G2, but will still lose the very next signal.  (or multiple next signals if g_size[G1] is greater than 1)
at this point, it's possible for software using the condvar to be stuck completely if there are no additional signals ever sent.

- if additional signals are sent, then once g_size[G1] reaches 0, the next signal will work properly, but the value of g_signals[G1] at that point represents how many signals were lost completely and how many additional wakeups never occurred. These will never directly be made up for unless this is observed and fixed before the quiesce_and_switch occurs, since that will then clear the g_signals count.  Note that this means it can require more than one signal to properly fix the problem.  There need to be enough signals to cause g_size[G1] to reach 0 just to become “unstuck”, but in addition there should be true replacement signals for however many unconsumed signals are in g_signals[G1], in order to deliver those to all waiters who were already in G2 by the time the last (otherwise lost) signal was sent.  It could be argued that we already woke that many extra waiters earlier, since those were the ones that caused the bug to arise in the first place, but if those waiters already awoke earlier, then they do not count as the required number of wakes, since they observed an earlier state of whatever the mutex protects, and it could have changed before the signals that were lost.  So they only count as spurious wakeups, and we still require the expected number of new wakeups, which is g_signals[G1], and to accomplish that we need g_size[G1] signals to be unstuck, plus g_signals[G1] more to actually perform the wakeups of the waiters still in G2.


In practice, most multi-threaded software that can get into this situation in the first place also has enough threads to send additional signals, so most of the time the lost wakeup just results in an additional latency until the next signal is sent, or in a slight loss of concurrency if a specific number of workers was meant to be awakened.  But every once in a while, a scenario can be set up where there can’t be another signal until the wakeup occurs, and then the users of the condvar are completely stuck.


As a temporary workaround I have wrapped pthread_cond_wait(), pthread_cond_timedwait(), and pthread_cond_signal() with code that performs a quick post-check on the state of the condvar and determines if there’s a lost signal (by checking g_size, g_signals, and calculating the size of G2 and checking wrefs to see how many waiters could possibly be left in G1 after accounting for all the ones in G2).  It will then send additional signals as needed to fix it and replace any lost signals once it can be determined.  This post-check has to be done in pthread_cond_wait/pthread_cond_timedwait() as well as pthread_cond_signal() since there are many cases where wrefs is temporarily higher until a waiter exits, so it cannot be conclusively determined that the condvar is in the lost wakeup state until then.  Once the last G1 waiter exits, then wrefs only represents G2 waiters, so (wrefs - remaining_G2_size) tells if there are any G1 waiters left who can potentially consume the signals there or not, and g_size[G1] can now be safely checked for a non-zero value if there are no G1 waiters left.  The post-check can also quickly bail-out in several cases where it's clear that G1 waiters still exist (i.e. it didn't detect the bug), such as when g_refs[G1] is non-zero.  (all of this is written to mean just the count bits, with the LSB already shifted out to make it easier to describe)

I’m looking for a proper pthreads fix to use instead as well as considering how I might go about solving this myself or contributing to the solution if nothing is available soon.
Comment 29 Frank Barrus 2021-04-06 01:17:32 UTC
I should add one more thing about my workaround fix:

Although it's currently implemented as a post-fix wrapper on pthread_cond_signal() as well as after waiting, it doesn't need to be.  That's just because it also tries to detect the bug condition *before* the lost wakeup occurs whenever possible.  Practically speaking, that almost never occurs anyway since wrefs usually leaves the state ambiguous until the last waiter exits, and then most of the time, the last exiting G1 waiter is the one who ends up detecting it first, whether it's before or after the signal was lost.   Also, checking the condvar in this way in pthread_cond_signal() assumes you know for sure that the wrapper code is being called while the mutex is held, which can't be guaranteed in a general-purpose fix for pthread_cond_signal.

So the post-fix check I'm doing could be done instead entirely from the wrappers for pthread_cond_wait/pthread_cond_timedwait().  That also means it could go in the common signal wait code, after the mutex is re-acquired.  As long as the common cases can bail out of the checks fast, there should be minimal overhead.  The only heavyweight cases will be when it actually detects that signals have already been lost, or the condvar is in a state that will lose the next signal.
Comment 30 Frank Barrus 2021-04-06 14:32:29 UTC
(In reply to Frank Barrus from comment #29)
> I should add one more thing about my workaround fix:
> 
> Although it's currently implemented as a post-fix wrapper on
> pthread_cond_signal() as well as after waiting, it doesn't need to be. 
> That's just because it also tries to detect the bug condition *before* the
> lost wakeup occurs whenever possible.  Practically speaking, that almost
> never occurs anyway since wrefs usually leaves the state ambiguous until the
> last waiter exits, and then most of the time, the last exiting G1 waiter is
> the one who ends up detecting it first, whether it's before or after the
> signal was lost.   Also, checking the condvar in this way in
> pthread_cond_signal() assumes you know for sure that the wrapper code is
> being called while the mutex is held, which can't be guaranteed in a
> general-purpose fix for pthread_cond_signal.
> 
> So the post-fix check I'm doing could be done instead entirely from the
> wrappers for pthread_cond_wait/pthread_cond_timedwait().  That also means it
> could go in the common signal wait code, after the mutex is re-acquired.  As
> long as the common cases can bail out of the checks fast, there should be
> minimal overhead.  The only heavyweight cases will be when it actually
> detects that signals have already been lost, or the condvar is in a state
> that will lose the next signal.

I should also add that this is only true (that it can be detected in just the waiters after they wake up) if the goal is only to detect and fix the lost wakeup conditions that can result in getting stuck.   But if the goal is also to replace the correct count of lost wakeups, then it's important to still add the detection at the end of pthread_cond_signal() as well, since if enough additional signals are sent before the last waiter exits to get out of this state and cause the quiesce_and_switch, then the count of unconsumed signals in g_signal[G1] will get cleared without performing that number of wakeups, unless there is detection of this state and that number of additional signals are sent to the G2 waiters once they become G1.
Comment 31 Frank Barrus 2021-04-06 16:49:57 UTC
(In reply to Michael Bacarella from comment #12)
> [...]
> Given how hard this bug is to isolate (it took about two man-months on my
> side), and that other orgs have gone through and are going through similar
> experiences, it seems right to apply the fix now and discuss improving the
> performance of the fix in a new bug.
> 
> From Malte's excellent blog post it appears people have been struggling with
> this since at least 2019 https://github.com/mratsim/weave/issues/56

We've been hitting it enough to start filing internal bugs as of at least June 2019 (perhaps earlier but that's the oldest I was able to quickly dig up).  I'm also pretty sure we had been seeing cases of it for at least a year before that.

At the time the suspicion was that our own C code using pthreads had to be the cause, since of course glibc and pthreads is so widely used it couldn't possibly have a bug like this that wasn't noticed and we could find nothing online about it when searching back then, although I noticed I never was able to reproduce the problem when using an alternate futex-based synchronization library.  We ended up making a watchdog workaround solution for the majority of cases where the lost wakeup was showing up.  It was only very recently that this became enough of a priority to warrant a deeper investigation into pthreads after first re-verifying the correctness of our logic above it and logging the pthreads condvar state with a circular in-core event log.

Now that we can detect it, with our software just running a normal constant multi-threaded I/O test load I see about 250 to 500 cases a day (on different hardware) of the bug.  But the majority of these are cases that would never have been seen normally except for a small latency increase since there's almost always another signal coming along soon anyway to "fix" the problem.  Of the ones that wouldn't be caught this way, our previous watchdog solution catches the rest.  The ones that still slip through the cracks (without the detector) are exceedingly rare and not really reproducible but they've been happening from time to time for at least a couple years now.
Comment 32 Carlos O'Donell 2021-04-11 12:13:33 UTC
I'm working through the review for this issue.
Comment 33 Frank Barrus 2021-04-13 12:21:01 UTC
(In reply to Carlos O'Donell from comment #32)
> I'm working through the review for this issue.

FYI, I'm currently testing a different pthreads fix for this issue that does it without the suggested "broadcast" solution that some distros appear to be adopting for now.  It also eliminates the need for the signal-stealing handling that was leading to the bug in the first place.  While broadcasting might work as a temporary workaround solution for correctness, it certainly has performance implications.

My previous workaround I mentioned (detecting the lost wakeup state or the potential for it after calling pthread_cond_wait/pthread_cond_timedwait and pthread_cond_signal) is not showing any cases detected when this new pthreads fix is applied.  However, when I run with the broadcast fix instead, it still detects cases that look like a potential lost wakeup.  I'm suspecting that those are just due to the state the condvar is temporarily being left in, and that all waiters are actually being woken up by the broadcast and just racing against my detection code which is catching the intermediate state, but it makes me wonder if there's still a rare edge case the broadcast solution isn't actually fixing and which could still get stuck.  (consider all threads that are not actually blocked and are outside the g_refs counted region and possibly have been pre-empted so they'll make their attempt to grab g_signals at a later time)  When I was running these tests my other post-detection logic still contained its fix to send extra signals as necessary also.  I'll have to re-run that as detection-only and no fix to see if it would have gotten stuck, but my priority has been the work on the new pthreads fix instead, not proving the correctness of the fix that uses broadcast.

So far the testing for my new fix is looking good for both correctness and performance, but of course there are many correctness issues to properly and thoroughly test.  I'm hoping to be able to post a patch in the near future.
Comment 34 Qin Li 2021-04-14 16:57:07 UTC
Hi Frank, I am the original reporter of this bug. Could you share a sneak peak version of your alternate fix that you mentioned below?

> FYI, I'm currently testing a different pthreads fix for this issue that does it without the suggested "broadcast" solution that some distros appear to be adopting for now.

The reason I am asking is that several months after applied the broadcast fix we started to observe a different hang caused by either pthread_cond_signal/pthread_cond_wait, or the constructs it relied on, e.g. futex. Original I feared it is related or caused by the broadcast fix, but later realized this might be another issue as it has also been independently reported in this bug by Arun without applying the broadcast fix: https://sourceware.org/bugzilla/show_bug.cgi?id=25847#c3

The signature of this different issue is this: 1 thread blocked "infinitely" in the pthread_cond_signal when calling __condvar_quiesce_and_switch_g1:

#0  futex_wait
#1  futex_wait_simple
#2  __condvar_quiesce_and_switch_g1
#3  __pthread_cond_signal

And all the other threads are blocked in pthread_cond_wait waiting for the signal.

Another interesting observation is the "infinitely" blocked thread on pthread_cond_signal can be unblocked if I send a SIG_SEGV to the linux .Net Core process that hit this issue which has a segfault handler that will call another external executable to take a core dump of this process. I am not exactly sure how much of the special signal handling logic is important to get pthread_cond_signal unblocked. It is possible that such signal would cause a spurious wakeup from futex_wait that actually unblocks __condvar_quiesce_and_switch_g1 and later __pthread_cond_signal, but this is pure speculation.

It would be nice to know if the ^^^ hanging pthread_cond_signal signature has also been discovered by the community and whether there might be any investigation/fix available.
Comment 35 Frank Barrus 2021-04-15 14:13:55 UTC
(In reply to Qin Li from comment #34)
> Hi Frank, I am the original reporter of this bug. Could you share a sneak
> peak version of your alternate fix that you mentioned below?
> 
> > FYI, I'm currently testing a different pthreads fix for this issue that does it without the suggested "broadcast" solution that some distros appear to be adopting for now.
> 
> The reason I am asking is that several months after applied the broadcast
> fix we started to observe a different hang caused by either
> pthread_cond_signal/pthread_cond_wait, or the constructs it relied on, e.g.
> futex. Original I feared it is related or caused by the broadcast fix, but
> later realized this might be another issue as it has also been independently
> reported in this bug by Arun without applying the broadcast fix:
> https://sourceware.org/bugzilla/show_bug.cgi?id=25847#c3
> 
> The signature of this different issue is this: 1 thread blocked "infinitely"
> in the pthread_cond_signal when calling __condvar_quiesce_and_switch_g1:
> 
> #0  futex_wait
> #1  futex_wait_simple
> #2  __condvar_quiesce_and_switch_g1
> #3  __pthread_cond_signal
> 
> And all the other threads are blocked in pthread_cond_wait waiting for the
> signal.
> 
> Another interesting observation is the "infinitely" blocked thread on
> pthread_cond_signal can be unblocked if I send a SIG_SEGV to the linux .Net
> Core process that hit this issue which has a segfault handler that will call
> another external executable to take a core dump of this process. I am not
> exactly sure how much of the special signal handling logic is important to
> get pthread_cond_signal unblocked. It is possible that such signal would
> cause a spurious wakeup from futex_wait that actually unblocks
> __condvar_quiesce_and_switch_g1 and later __pthread_cond_signal, but this is
> pure speculation.
> 
> It would be nice to know if the ^^^ hanging pthread_cond_signal signature
> has also been discovered by the community and whether there might be any
> investigation/fix available.

Hi Qin,

I believe you may be right about the second bug.  I have also seen such a signature ever since the fix went in to "or" in the 1 bit so it doesn't spin while waiting for g_refs to reach 0:
          r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1;

Whenever I've encountered threads "stuck" in that futex_wait, they have always become unblocked as soon as I resumed from gdb (which makes sense if they're just missing a futex wake since the signal handling unblocks them from the futex) so I was never quite sure if they were truly stuck there (which they seemed to be) or whether I was just catching them in a transient state.   However, since this was always at a time when other parts of our system had also become stuck, it seemed quite likely that the threads really were stuck in the futex_wait.

I have not seen this happen anymore ever since applying my new fix and testing with it, but that's purely anecdotal evidence of fixing the secondary issues, since it was quite rare to begin with.

If there's a race in the wait/wake logic itself for the grefs counting, I haven't found it yet, unless there's some very obscure bug in the underlying atomics themselves allowing the atomic or and atomic subtract to cross paths somehow thus missing the wakeup flag.  i.e. these two atomics:

  r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1;
  (in the quiesce_and_switch) 
and:
  if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) == 3)
  (in the gref_decrefs used from the cond_wait)

I think the more likely cause is related to still having an extra waiter that didn't get woken yet and didn't see the group being closed.   Note that when the group is closed, the LSB of g_signals is set, but there's no futex wakeup:

  atomic_fetch_or_relaxed (cond->__data.__g_signals + g1, 1);

This should be fine if there are no other bugs, since every waiter should have been woken already.  But since we're already dealing with a mix of both current waiters and older ones that were pre-empted first and which are now resuming, and possibly not yet aware that they are in an older group, there might be some race there.  In particular, if a signal did actually get stolen and not properly detected and replaced, there could be an extra waiter stuck in a futex wait on g_signals still.  Without an explicit wakeup when closing the group, it won't see the "closed" bit until it wakes from the futex, which won't happen in that case if there are no more signals sent to that group. However, an interrupt from normal signal handling (SIG* signals) will break it out of the futex, which is why a gdb attach and resume would get it unstuck in that case.  And then it will see the closed group on g_signals and proceed as if nothing was ever wrong.

Until we figure out the exact cause, I cannot guarantee that my new fix also addresses this issue.  Although as I said, I have not yet seen it occur with the fix.  Also since I eliminated the need for handling stolen signals and signal replacement, it might remove the cause of this second bug as a side effect.  (if indeed there truly is a second bug)

Do you happen to still have the condvar state when you've hit this bug?  Or can it be reproduced often enough to capture this state?   Could you instrument your calls to pthread_cond_signal to capture the condvar state before you send the signal also?  (and perhaps also in your waiters?).  I ended up adding circular logging of the condvar state (and a timestamp and TID) before and after every pthread_cond_wait and pthread_cond_signal to diagnose the lost wakeup so I could see how the events interleaved, including the often 2ms to 3ms pre-emption time in the older waiters that were coming back and leading to the bug.  I also called rusage() and captured the voluntary and involuntary context-switch counts and added those to the condvar event logs to make sure I was always seeing a pre-emptive involuntary context switch when the bug occurred.  You might not need to go that far, but it would help to at least find out the current condvar state for all the threads involved when you see the futex_wait get stuck.

I'm working on trying to get my patch available soon for further review and testing.

If you have a self-contained test case that you can release that can even occasionally show this new "second bug" (if it is), let me know.  Thanks!
Comment 36 Frank Barrus 2021-04-15 14:34:14 UTC
(In reply to Frank Barrus from comment #35)
> (In reply to Qin Li from comment #34)
> > Hi Frank, I am the original reporter of this bug. Could you share a sneak
> > peak version of your alternate fix that you mentioned below?
> > 
> > > FYI, I'm currently testing a different pthreads fix for this issue that does it without the suggested "broadcast" solution that some distros appear to be adopting for now.
> > 
> > The reason I am asking is that several months after applied the broadcast
> > fix we started to observe a different hang caused by either
> > pthread_cond_signal/pthread_cond_wait, or the constructs it relied on, e.g.
> > futex. Original I feared it is related or caused by the broadcast fix, but
> > later realized this might be another issue as it has also been independently
> > reported in this bug by Arun without applying the broadcast fix:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=25847#c3
> > 
> > The signature of this different issue is this: 1 thread blocked "infinitely"
> > in the pthread_cond_signal when calling __condvar_quiesce_and_switch_g1:
> > 
> > #0  futex_wait
> > #1  futex_wait_simple
> > #2  __condvar_quiesce_and_switch_g1
> > #3  __pthread_cond_signal
> > 
> > And all the other threads are blocked in pthread_cond_wait waiting for the
> > signal.
> > 
> > Another interesting observation is the "infinitely" blocked thread on
> > pthread_cond_signal can be unblocked if I send a SIG_SEGV to the linux .Net
> > Core process that hit this issue which has a segfault handler that will call
> > another external executable to take a core dump of this process. I am not
> > exactly sure how much of the special signal handling logic is important to
> > get pthread_cond_signal unblocked. It is possible that such signal would
> > cause a spurious wakeup from futex_wait that actually unblocks
> > __condvar_quiesce_and_switch_g1 and later __pthread_cond_signal, but this is
> > pure speculation.
> > 
> > It would be nice to know if the ^^^ hanging pthread_cond_signal signature
> > has also been discovered by the community and whether there might be any
> > investigation/fix available.
> 
> Hi Qin,
> 
> I believe you may be right about the second bug.  I have also seen such a
> signature ever since the fix went in to "or" in the 1 bit so it doesn't spin
> while waiting for g_refs to reach 0:
>           r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1;
> 
> Whenever I've encountered threads "stuck" in that futex_wait, they have
> always become unblocked as soon as I resumed from gdb (which makes sense if
> they're just missing a futex wake since the signal handling unblocks them
> from the futex) so I was never quite sure if they were truly stuck there
> (which they seemed to be) or whether I was just catching them in a transient
> state.   However, since this was always at a time when other parts of our
> system had also become stuck, it seemed quite likely that the threads really
> were stuck in the futex_wait.
> 
> I have not seen this happen anymore ever since applying my new fix and
> testing with it, but that's purely anecdotal evidence of fixing the
> secondary issues, since it was quite rare to begin with.
> 
> If there's a race in the wait/wake logic itself for the grefs counting, I
> haven't found it yet, unless there's some very obscure bug in the underlying
> atomics themselves allowing the atomic or and atomic subtract to cross paths
> somehow thus missing the wakeup flag.  i.e. these two atomics:
> 
>   r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1;
>   (in the quiesce_and_switch) 
> and:
>   if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) == 3)
>   (in the gref_decrefs used from the cond_wait)
> 
> I think the more likely cause is related to still having an extra waiter
> that didn't get woken yet and didn't see the group being closed.   Note that
> when the group is closed, the LSB of g_signals is set, but there's no futex
> wakeup:
> 
>   atomic_fetch_or_relaxed (cond->__data.__g_signals + g1, 1);
> 
> This should be fine if there are no other bugs, since every waiter should
> have been woken already.  But since we're already dealing with a mix of both
> current waiters and older ones that were pre-empted first and which are now
> resuming, and possibly not yet aware that they are in an older group, there
> might be some race there.  In particular, if a signal did actually get
> stolen and not properly detected and replaced, there could be an extra
> waiter stuck in a futex wait on g_signals still.  Without an explicit wakeup
> when closing the group, it won't see the "closed" bit until it wakes from
> the futex, which won't happen in that case if there are no more signals sent
> to that group. However, an interrupt from normal signal handling (SIG*
> signals) will break it out of the futex, which is why a gdb attach and
> resume would get it unstuck in that case.  And then it will see the closed
> group on g_signals and proceed as if nothing was ever wrong.
> 
> Until we figure out the exact cause, I cannot guarantee that my new fix also
> addresses this issue.  Although as I said, I have not yet seen it occur with
> the fix.  Also since I eliminated the need for handling stolen signals and
> signal replacement, it might remove the cause of this second bug as a side
> effect.  (if indeed there truly is a second bug)
> 
> Do you happen to still have the condvar state when you've hit this bug?  Or
> can it be reproduced often enough to capture this state?   Could you
> instrument your calls to pthread_cond_signal to capture the condvar state
> before you send the signal also?  (and perhaps also in your waiters?).  I
> ended up adding circular logging of the condvar state (and a timestamp and
> TID) before and after every pthread_cond_wait and pthread_cond_signal to
> diagnose the lost wakeup so I could see how the events interleaved,
> including the often 2ms to 3ms pre-emption time in the older waiters that
> were coming back and leading to the bug.  I also called rusage() and
> captured the voluntary and involuntary context-switch counts and added those
> to the condvar event logs to make sure I was always seeing a pre-emptive
> involuntary context switch when the bug occurred.  You might not need to go
> that far, but it would help to at least find out the current condvar state
> for all the threads involved when you see the futex_wait get stuck.
> 
> I'm working on trying to get my patch available soon for further review and
> testing.
> 
> If you have a self-contained test case that you can release that can even
> occasionally show this new "second bug" (if it is), let me know.  Thanks!

I should clarify what I'm looking for here in particular with the condvar state in this case, as well as knowing the state of all the other waiters and where they are stuck (or spinning):

(assuming this second bug exists) it seems that it is stuck on the futex_wait for g_refs to become 0 and that an interrupt to the futex_wait from signal handlers "fixes" the problem, which implies the underlying condvar state is currently correct and that there's just a missing/lost futex_wake.  However, is the wakeup to this particular futex_wait the one that fixes it?  Or is it an extra wakeup to a waiter blocked on g_signals that is needed to fix it?   We need to see if the condvar state shows that the g_refs waiter already has what it needs to continue and just missed a wakeup, or whether it is legitimately still waiting for g_refs to reach 0, and it's still at a refcount of 1 or more because there's another waiter that is still blocked on g_signals and has not been awakened to see that the group is now closed.  (the more likely cause)  Seeing the condvar state would greatly help in figuring out which case it is.
Comment 37 venkat 2021-04-30 17:41:02 UTC
Hi Carlos and Team,
Could you please let us know when the fix is planned and share us some ETA. 
Thanks,
Venkat
Comment 38 Frank Barrus 2021-05-04 22:48:54 UTC
Created attachment 13419 [details]
proposed lost wakeup fix with g_signals relative to g1_start and no signal stealing

Here is my proposed solution to the lost wakeup problem.  Further details are in the attachment, but the high level summary is that it turns g_signals[] into an always-advancing value for each new G1 group (using the low 31 bits of the current g1_start value as the relative base for the signal count), thus avoiding aliasing issues from G1/G2 re-use and avoiding A/B/A issues on the signal count.  This eliminates the signal stealing, as well as the need to block and wait for G1 to fully quiesce when signaling.  This provides a performance benefit at the same time as fixing the lost wakeup issue, since pthread_cond_signal no longer has to wait for remaining G1 waiters to wake up and run, or for pre-empted waiter threads to resume running.  (the latter was introducing rather high latencies for signaling when waiters got pre-empted)

My own testing (in a multi-core setup that was showing the lost wakeup quite frequently) has not shown any problems yet, but I'd welcome others to try the patch and give their feedback/results.  This hasn't been through TLA+ testing yet.

Note that this patch is against the current master version of glibc/pthreads.  If you need to patch an older version, make sure the additional futex waits in pthread_cond_wait_common() all have their "0" value changed to "signals".

This patch could still use some additional cleanup of the comments and some minor optimizations, but I'd like to get feedback on it (both the specific patch and the general path this solution is taking) first before polishing it further.
Comment 39 Frank Barrus 2021-05-04 22:51:43 UTC
(In reply to Qin Li from comment #34)
> Hi Frank, I am the original reporter of this bug. Could you share a sneak
> peak version of your alternate fix that you mentioned below?
> 
> > FYI, I'm currently testing a different pthreads fix for this issue that does it without the suggested "broadcast" solution that some distros appear to be adopting for now.
> 
> The reason I am asking is that several months after applied the broadcast
> fix we started to observe a different hang caused by either
> pthread_cond_signal/pthread_cond_wait, or the constructs it relied on, e.g.
> futex. Original I feared it is related or caused by the broadcast fix, but
> later realized this might be another issue as it has also been independently
> reported in this bug by Arun without applying the broadcast fix:
> https://sourceware.org/bugzilla/show_bug.cgi?id=25847#c3
> 
> The signature of this different issue is this: 1 thread blocked "infinitely"
> in the pthread_cond_signal when calling __condvar_quiesce_and_switch_g1:
> 
> #0  futex_wait
> #1  futex_wait_simple
> #2  __condvar_quiesce_and_switch_g1
> #3  __pthread_cond_signal
> 
> And all the other threads are blocked in pthread_cond_wait waiting for the
> signal.
> 
> Another interesting observation is the "infinitely" blocked thread on
> pthread_cond_signal can be unblocked if I send a SIG_SEGV to the linux .Net
> Core process that hit this issue which has a segfault handler that will call
> another external executable to take a core dump of this process. I am not
> exactly sure how much of the special signal handling logic is important to
> get pthread_cond_signal unblocked. It is possible that such signal would
> cause a spurious wakeup from futex_wait that actually unblocks
> __condvar_quiesce_and_switch_g1 and later __pthread_cond_signal, but this is
> pure speculation.
> 
> It would be nice to know if the ^^^ hanging pthread_cond_signal signature
> has also been discovered by the community and whether there might be any
> investigation/fix available.

Hi Qin,

I just posted the current version of my proposed fix.  (see previous comment and attachment)
This will address your second concern as well, since it removes the need for condvar_quiesce_and_switch_g1() to block.

Please try the patch and let me know how it goes for your testing.
Thanks!
- Frank
Comment 40 Frank Barrus 2021-05-04 22:58:55 UTC
Since posting my diff as a patch obscures the commit comment that has a description of this solution, here is another copy of it:


This fixes the lost wakeup (from a bug in signal stealing) with a change
in the usage of g_signals[] in the condition variable internal state.
It also completely eliminates the concept and handling of signal stealing,
as well as the need for signalers to block to wait for waiters to wake
up every time there is a G1/G2 switch.  This greatly reduces the average
and maximum latency for pthread_cond_signal.

The g_signals[] field now contains a signal count that is relative to
the current g1_start value.  Since it is a 32-bit field, and the LSB is
still reserved (though not currently used anymore), it has a 31-bit value
that corresponds to the low 31 bits of the sequence number in g1_start.
(since g1_start also has an LSB flag, this means bits 31:1 in g_signals
correspond to bits 31:1 in g1_start, plus the current signal count)

By making the signal count relative to g1_start, there is no longer
any ambiguity or A/B/A issue, and thus any checks before blocking,
including the futex call itself, are guaranteed not to block if the G1/G2
switch occurs, even if the signal count remains the same.  This allows
initially safely blocking in G2 until the switch to G1 occurs, and
then transitioning from G1 to a new G1 or G2, and always being able to
distinguish the state change.  This removes the race condition and A/B/A
problems that otherwise ocurred if a late (pre-empted) waiter were to
resume just as the futex call attempted to block on g_signal since
otherwise there was no last opportunity to re-check things like whether
the current G1 group was already closed.

By fixing these issues, the signal stealing code can be eliminated,
since there is no concept of signal stealing anymore.  The code to block
for all waiters to exit g_refs can also be removed, since any waiters
that are still in the g_refs region can be guaranteed to safely wake
up and exit.  If there are still any left at this time, they are all
sent one final futex wakeup to ensure that they are not blocked any
longer, but there is no need for the signaller to block and wait for
them to wake up and exit the g_refs region.

The signal count is then effectively "zeroed" but since it is now
relative to g1_start, this is done by advancing it to a new value that
can be observed by any pending blocking waiters.  Any late waiters can
always tell the difference, and can thus just cleanly exit if they are
in a stale G1 or G2.  They can never steal a signal from the current
G1 if they are not in the current G1, since the signal value that has
to match in the cmpxchg has the low 31 bits of the g1_start value
contained in it, and that's first checked, and then it won't match if
there's a G1/G2 change.

Note: the 31-bit sequence number used in g_signals is designed to
handle wrap-around when checking the signal count, but if the entire
31-bit wraparound (2 billion signals) occurs while there is still a
late waiter that has not yet resumed, and it happens to then match
the current g1_start low bits, and the pre-emption occurs after the
normal "closed group" checks (which are 64-bit) but then hits the
futex syscall and signal consuming code, then an A/B/A issue could
still result and cause an incorrect assumption about whether it
should block.  This particular scenario seems unlikely in practice.
Note that once awake from the futex, the waiter would notice the
closed group before consuming the signal (since that's still a 64-bit
check that would not be aliased in the wrap-around in g_signals),
so the biggest impact would be blocking on the futex until the next
full wakeup from a G1/G2 switch.
Comment 41 Evgeny Morozov 2021-08-16 09:41:07 UTC
Our .NET Core application is freezing quite regularly, which is apparently due to this bug: https://github.com/dotnet/runtime/issues/47700

It seems like a serious issue and has been open for a long time. Is there any ETA on the fix being merged?
Comment 42 Evgeny Morozov 2021-09-22 19:03:45 UTC
This is still causing freezes for us. Is it safe to at least apply the "mitigation patch" on top of glibc 2.27 (the version that Ubuntu 18.04 ships with)? As an "outsider" with no knowledge of glibc it's hard to be sure of this from the comment history.
Comment 43 Tulio Magno Quites Machado Filho 2021-09-24 00:18:54 UTC
(In reply to Evgeny Morozov from comment #42)
> This is still causing freezes for us. Is it safe to at least apply the
> "mitigation patch" on top of glibc 2.27 (the version that Ubuntu 18.04 ships
> with)? As an "outsider" with no knowledge of glibc it's hard to be sure of
> this from the comment history.

The patch from comment 2 (the same one in Ubuntu 18.04) does not solve the issue.
I can still reproduce the issue after applying it.
I've been studying and evaluating the patches from comments 13 and 38, but the progress has been slower than I was hoping.
Comment 44 Michael Bacarella 2021-09-24 00:58:17 UTC
> The patch from comment 2 (the same one in Ubuntu 18.04) does not solve the issue.
> I can still reproduce the issue after applying it.
> I've been studying and evaluating the patches from comments 13 and 38, but the > progress has been slower than I was hoping.

Please see my Ubuntu specific report at https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800

My conclusion based on some tests of various distros and loads is that there's a different bug also causing deadlocks in Ubuntu 18.04 that is fixed by 20.04 though even including the one-line mitigation patch here reduces the frequency of deadlocks by orders of magnitude in the repo case.

Applied to Ubuntu 20.04 this one-line mitigation patch solves deadlocks completely.
Comment 45 Malte Skarupke 2022-09-18 05:38:39 UTC
I looked into this bug again because people kept on running into it. I ran a bigger TLA+ analysis and found an interleaving where the mitigation patch doesn't help. I wrote it up on my blog again here:

https://probablydance.com/2022/09/17/finding-the-second-bug-in-glibcs-condition-variable/

The short of it is that the pthread_cond_broadcast() in the mitigation patch can early-out when no thread is sleeping at the same time, in which case it doesn't wake anyone and doesn't change any state. Meaning the mitigation patch might do nothing. But the leftover signal from the wait will stick around. After that you can get a similar interleaving as before which will result in an incorrect g_size, causing pthread_cond_signal() to wake in the wrong group.

It's less likely than before, because you need three rare things to happen, but it can still happen:
1. Trigger the potential steal
2. The pthread_cond_broadcast() from the mitigation patch has to early-out without doing anything
3. You need to consume that extra signal from the potential steal at a time where it will cause quiesce_and_switch_g1() to increase g_size to at least 2

My patch from a year ago should fix this:
https://sourceware.org/pipermail/libc-alpha/2021-September/130840.html
Comment 46 Carlos O'Donell 2022-09-18 20:06:56 UTC
(In reply to Malte Skarupke from comment #45)
> I looked into this bug again because people kept on running into it. I ran a
> bigger TLA+ analysis and found an interleaving where the mitigation patch
> doesn't help. I wrote it up on my blog again here:
> 
> https://probablydance.com/2022/09/17/finding-the-second-bug-in-glibcs-
> condition-variable/

I'll have a look at this when I'm back from GNU Tools Cauldron. It's disappointing that there is still an interleaving that doesn't work. I've had a community member testing Frank Barrus' patch and that did seem to be correct.
 
> The short of it is that the pthread_cond_broadcast() in the mitigation patch
> can early-out when no thread is sleeping at the same time, in which case it
> doesn't wake anyone and doesn't change any state. Meaning the mitigation
> patch might do nothing. But the leftover signal from the wait will stick
> around. After that you can get a similar interleaving as before which will
> result in an incorrect g_size, causing pthread_cond_signal() to wake in the
> wrong group.
> 
> It's less likely than before, because you need three rare things to happen,
> but it can still happen:
> 1. Trigger the potential steal
> 2. The pthread_cond_broadcast() from the mitigation patch has to early-out
> without doing anything
> 3. You need to consume that extra signal from the potential steal at a time
> where it will cause quiesce_and_switch_g1() to increase g_size to at least 2
> 
> My patch from a year ago should fix this:
> https://sourceware.org/pipermail/libc-alpha/2021-September/130840.html

Could you expand on this a bit? Do you mean to say your patch from last September resolves all the issues you have seen, including the new one?
Comment 47 Malte Skarupke 2022-09-19 03:38:32 UTC
(In reply to Carlos O'Donell from comment #46)
> I'll have a look at this when I'm back from GNU Tools Cauldron. It's
> disappointing that there is still an interleaving that doesn't work. I've
> had a community member testing Frank Barrus' patch and that did seem to be
> correct.

I like Frank Barrus' patch, too. I think it takes the condition variable in a good direction, that might allow more cleanups. I was actually thinking of also submitting a patch that goes in  that direction. Where my patch from last year would force signalers to wait more often, his patch claims to make it unnecessary for signalers to ever wait. So I like his direction better, I just haven't taken the time yet to properly understand/review it.

> Could you expand on this a bit? Do you mean to say your patch from last
> September resolves all the issues you have seen, including the new one?

The new issue turned out to be the same as the old issue. The mitigation patch just makes it much less likely to happen. So yes, I think that my patch would resolve all the issues. That being said someone immediately showed up in my comments to say that they tried my patch and quickly got some unexplained hangs. I have not yet been able to check why that might be. It's one of those "I only proved it correct" situations where I could have made a subtle mistake while translating the change from TLA+ back to C. Or maybe I need to re-check the patch against a newer version of glibc, or maybe the person in my comments made a mistake. At this point I know only this:
- It solves the issue in TLA+ when run with the same number of threads
- It passes the glibc tests
- I have carefully thought about it and have good reasons to think that it should solve the issue because it can't happen any more that a waiter steals a signal from a future waiter, which was the source of the problem

I'll try to find the time this week to look at the patch again in detail.
Comment 48 Eric Hagberg 2022-09-26 14:28:29 UTC
Created attachment 14360 [details]
simple reproducer for issue

Used to reproduce problem on AWS instance with specific config (both with and without Ubuntu one-line patch).
Comment 49 Eric Hagberg 2022-09-26 14:32:03 UTC
This is a description of how to run the reproducer (https://sourceware.org/bugzilla/attachment.cgi?id=14360) “pthread_condition_issue.c” on AWS:

Boot a c5.metal instance running the AMI Rocky-8-ec2-8.6-20220515.0.x86_64-d6577ceb-8ea8-4e0e-84c6-f098fc302e82.
Build with “gcc pthread_condition_issue.c -lpthread”.
Run the following as root: tuned-adm profile latency-performance. This is required to obtain a timely reproduction.
Run the resulting executable 80 times concurrently.

With the stock glibc the problem will occur fairly quickly - within about an hour, there should be 7 or more aborts.

If the Ubuntu patch (https://sourceware.org/bugzilla/attachment.cgi?id=12484) has been applied to glibc, the following additional changes should be made to obtain a timely reproduction:

sysctl -w kernel.sched_min_granularity_ns=300000
sysctl -w kernel.sched_wakeup_granularity_ns=400000

Within about an hour, you should expect to see 3 or 4 aborts.
Comment 50 Malte Skarupke 2022-10-06 21:58:25 UTC
I sent a new version of my patch series:
https://sourceware.org/pipermail/libc-alpha/2022-October/142562.html

It fixes the issue where the first patch applied on its own didn't work. (it had a bug that was fixed by the third patch in the series) I also rebased the patch to work on top of release/2.36/master.

I'm pretty confident that these fix the issue, but they do still need review. I am also still thinking of doing the following two things:
1. Properly review Frank Barrus' patch, because from the little I looked at, I like it and it might be a better solution than the patch I submitted here.
2. Maybe submit a patch that just allows more spurious wakeups. The original problem was that if a waiter takes too long to wake, and the group switched twice, it could steal a signal from a later group and a waiter from a later group would go back to sleep. And then the code that was supposed to handle that caused this bug. But my current thinking is "what if the thread whose signal got stolen just doesn't go back to sleep?" Meaning you'd be fine with slightly more spurious wakeups in edge cases. I think it would simplify a lot.

Still don't have a lot of time to work on this though. So I first just wanted to get my patch back into a good condition. I think the first patch in the series is a low-risk fix, after it has been reviewed and tested a bit more. It doesn't change much and I actually did the work of verifying it in TLA+.
Comment 51 Malte Skarupke 2022-10-15 19:57:56 UTC
I submitted another option for a fix for this:

https://sourceware.org/pipermail/libc-alpha/2022-October/142724.html

I'm quite fond of this one. It fixes the problem while also simplifying the condition variable implementation a lot, making the code much easier to reason about. I much prefer this fix over my last submission. Let me know if there is something I need to do to get this reviewed. (I got no feedback on my last submission) I think the review on this will be much easier than on any of the other patches so far, because the new condvar implementation is so much simpler.
Comment 52 Daniel Richman 2022-11-07 18:23:06 UTC
Hello,

For full disclosure, I work at the same company as Eric and Malte.

Eric and I were investigating reproducing this issue after we started regularly (~once a day) hitting it in one of our production systems, despite said system running against a glibc that has the patch in attachment 12484 [details] (i.e., the patch that Ubuntu has adopted); it was preventing us from upgrading production to a more recent version of the linux distribution we use (the previous version used a glibc older than 2.25).

Unfortunately, said system system is large and proprietary so we couldn’t simply provide a copy of it, hence the reproducer Eric shared. We hope the reproducer (a) is simple enough that you can convince yourself that the C is correct, (b) makes it possible for you to observe the deadlocks for yourself (on that specific hardware and kernel settings that Eric mentions), and (c) exhibits said deadlock in a reasonable amount of time that you don’t have to wait too long.

I just wanted to tell this story to try and explain that this wasn’t just an academic exercise: we were regularly running into the bug, and would be very happy to see a patch reviewed and merged so that we don’t need to maintain a separately patched glibc internally.

Thanks,
Daniel