[RFC PATCH] Fix: intl: use nestable locking for reentrancy

Mathieu Desnoyers mathieu.desnoyers@efficios.com
Thu Aug 19 19:35:07 GMT 2021


----- On Aug 19, 2021, at 12:11 PM, Florian Weimer fweimer@redhat.com wrote:

> * Mathieu Desnoyers via Libc-alpha:
> 
>> ----- On Aug 12, 2021, at 3:24 PM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>>
>>> malloc interposers calling failing dl API (e.g. dlopen/dlsym) end up
>>> calling i18n translation. It can corrupt the i18n locking state when
>>> malloc is called internally from within i18n code with i18n locking
>>> held. This is an issue for libasan in gcc 8, 9, 10, 11.
>>> 
>>> This patch applies on top of glibc 2.31.
>>> 
>>> This rather crude patch is provided as RFC only: I'm not sure whether we
>>> want to convert all intl locks to recursive locks, and whether a non-rw
>>> recursive lock is appropriate.
>>
>> Hi Carlos, Florian,
>>
>> Before this issue falls off from my backlog pile, is this patch OK as is
>> as a glibc fix, in which case I could re-send it without the "RFC" tag,
>> or should I consider making changes either to this patch or to a patch
>> on top ? For instance, it's a bit weird to keep the "gl_rwlock_" macro
>> names mapped to non-rw recursive locks. However, renaming all "gl_rwlock_*"
>> macros to "gl_lock_*_recursive" implies more code churn, which might not
>> be welcome in a minimal fix.
> 
> I have yet to see a case where a recursive lock reliably avoids
> deadlocks in a callback-based interface.  The self-deadlock with a
> single thread is gone, but more complex patterns of deadlocks remain.

Agreed.

> The loader lock is particularly prone to this, and I think it would be
> involved here as well.

What makes you think the loader lock is involved here ?

The specific issue I am trying to solve is caused by i18n textdomain code using
malloc with i18n locks held. When a malloc interposer library ends up failing on
dlsym, the call to __dcgettext takes the lock recursively.

There are of course many possible solutions to this issue, some requiring rather
more work than others. A few ideas that come to mind:

1) Redesign the i18n textdomain code to move all memory allocation outside of the
   i18n lock,

2) Redesign textdomain/gettext data structures to allow RCU updates/lookups, and
   introduce RCU synchronization into glibc.

3) Modify __dcgettext to detect when it is used recursively with the textdomain
   code through a TLS variable. When this is detected, skip the lookup entirely
   and return the input string. (this is what I do in my workaround .so)

4) Use a recursive lock rather than non-recursive RW lock (proposed patch). It has
   some downsides though: it requires that all interposed functions (e.g. malloc/free...)
   are placed in the textdomain algorithm in code locations where the protected data
   structures are in a consistent state for lookups.

As you point out, recursive locks is not such a good design, because their recursive
nature do not propagate across a lock dependency chain. Therefore, as soon as those locks are
nested with other locks, deadlocks still happen, but they are harder to reproduce, which is
quite bad.

Given that this is an issue that occurs in the field today, I suspect we may want to have
two distinct discussions: what should we do for a fix which can be shipped in existing
distributions, and what should we do moving forward with the next glibc release ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


More information about the Libc-alpha mailing list