Bug 21031 - pthread_key_delete() race with thread finalization
Summary: pthread_key_delete() race with thread finalization
Status: REOPENED
Alias: None
Product: glibc
Classification: Unclassified
Component: nptl (show other bugs)
Version: 2.24
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-07 19:29 UTC by account disabled by myself since useless
Modified: 2024-10-21 18:10 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2023-02-08 00:00:00
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description account disabled by myself since useless 2017-01-07 19:29:09 UTC
A race condition could occur between the pthread_key_delete() and the __nptl_deallocate_tsd().

For instance, __nptl_deallocate_tsd() could call a destructor for the key, immediately before the pthread_key_delete() invalidates it (from an another thread), and will continue destructor execution after the completion of pthread_key_delete().

From a user code this looks as if the corresponding destructor executes after the key has been removed by pthread_key_delete(), and there is no way to know whether was destructor called/executed or not.

Suggest add pthread_rwlock_rdlock() for __nptl_deallocate_tsd() and pthread_rwlock_wrlock() for pthread_key_delete().
Comment 1 account disabled by myself since useless 2018-03-28 12:17:11 UTC
Related to bugs 18136, 21032.
Comment 2 account disabled by myself since useless 2023-02-08 12:06:23 UTC
Not a bug for glibc, but the feature )
Comment 3 Adhemerval Zanella 2023-02-08 16:23:31 UTC
Reopened because it still needs discussion.
Comment 4 Barnabás Pőcze 2024-10-18 16:51:10 UTC
This "issue" has been encountered in pipewire[0]. The fact that destructors may be called after `tss_delete()` / `pthread_key_delete()` returns makes it quite hard to use. Is there maybe a simple to way to safely delete the TSS key, and then destroy all remaining objects that weren't processed by the TSS destructor?

One solution appears to be moving the TSS key into a reference counted allocation and deleting the TSS key after the last thread that has data in that slot has exited. But this seems pretty suboptimal since it takes up one TSS slot completely unnecessarily for a potentially unbounded amount of time. Another one is building one's own TSS API on top, just like chromium does (although my reading of the code suggests that it has the same problem, but of course they can easily change it if it turns out to be an actual issue for them). Neither options seem appealing for pipewire, though.

I also realize that neither POSIX nor the C standard says anything about this, still, I can't help but think that the current status quo makes this TSS API nigh unusable if one wants to actually avoid resource leaks. Unless I am (hopefully) missing something obvious?

[0]: https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/4356
Comment 5 Florian Weimer 2024-10-18 17:49:42 UTC
The current glibc implementation treats keys as a very scarce resource, there are just 1024 of them in total. They are really not designed to be allocated on a per-data-structure basis. I don't know what pipewire tries to accomplish here. I found a reference to an issue:

  multiple data loops vs. invokes
  <https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/3983>

But there's no discussion about the design of the commit that addresses this issue.
Comment 6 Barnabás Pőcze 2024-10-18 18:03:06 UTC
Yes, arguably pipewire's use of TSS slots is not optimal. But even even if we assumed that pipewire wanted to allocate just a single slot in `pw_init()` (or at library load time), and free it in `pw_deinit()` (or at library unload time), the issue is still there.
Comment 7 Florian Weimer 2024-10-21 09:46:43 UTC
(In reply to Barnabás Pőcze from comment #6)
> Yes, arguably pipewire's use of TSS slots is not optimal. But even even if
> we assumed that pipewire wanted to allocate just a single slot in
> `pw_init()` (or at library load time), and free it in `pw_deinit()` (or at
> library unload time), the issue is still there.

Unloading can be disabled by linking with -Wl,-z,nodelete. That's appropriate for libraries which cannot safely be unloaded. Keep in mind that pthread_key_delete will not be able to invoke key destructors on threads that have previously called pthread_setspecific. This aspect is not something that can change.

Based on the limited information I have, what the library tries to do looks like an anti-pattern to me: There's no proper lifetime management for some resource, so its lifetime is extended to the lifetime of the calling thread. That tends to have surprising consequences once thread pools are involved. We see this internally with glibc for our malloc arenas, for example. This is a conceptual problem, it persists even if we change pthread_key_delete.
Comment 8 Barnabás Pőcze 2024-10-21 18:10:04 UTC
The core issue here is how one is expected to safely use `pthread_key_delete()`. The current glibc and musl implementations do not synchronize destructor calls caused by thread termination with calls to `pthread_key_delete()`. This makes it possible for a destructor to be called in a terminating thread after `pthread_key_delete()` returns. For this reason, it is not feasible to delete the key, then iterate over a list of all thread local objects to free the remaining ones because that might race with thread termination and lead to double frees. So I am wondering what the expected usage of `pthread_key_delete()` is, because to me this behaviour seems to be a significant oversight in the current design, unless I am (hopefully) missing something.