This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PowerPC: libc single-thread lock optimization

On Tue, 2014-04-29 at 15:05 -0300, Adhemerval Zanella wrote:
> On 29-04-2014 14:53, Torvald Riegel wrote:
> > On Tue, 2014-04-29 at 13:49 -0300, Adhemerval Zanella wrote:
> >> On 29-04-2014 13:22, Torvald Riegel wrote:
> >>> On Mon, 2014-04-28 at 19:33 -0300, Adhemerval Zanella wrote:
> >>>> I bring this about x86 because usually it is the reference implementation and sometimes puzzles
> >>>> me that copying the same idea in another platform raise architectural question.  But I concede
> >>>> that the reference itself maybe had not opted for best solution in first place.
> >>>>
> >>>> So if I have understand correctly, is the optimization to check for single-thread and opt to
> >>>> use locks is to focused on lowlevellock solely?  If so, how do you suggest to other archs to 
> >>>> mimic x86 optimization on atomic.h primitives?  Should other arch follow the x86_64 and
> >>>> check for __libc_multiple_threads value instead?  This could be a way, however it is redundant
> >>>> in mostly way: the TCP definition already contains the information required, so there it no
> >>>> need to keep track of it in another memory reference.  Also, following x86_64 idea, it check
> >>>> for TCB header information for sysdeps/CPU/bits/atomic.h, but for __libc_multiple_threads
> >>>> in lowlevellock.h.  Which is correct guideline for other archs?
> >>> >From a synchronization perspective, I think any single-thread
> >>> optimizations belong into the specific concurrent algorithms (e.g.,
> >>> mutexes, condvars, ...)
> >>> * Doing the optimization at the lowest level (ie, the atomics) might be
> >>> insufficient because if there's indeed just one thread, then lots of
> >>> synchronization code can be a lot more simpler than just avoiding
> >>> atomics (e.g., avoiding loops, checks, ...).
> >>> * The mutexes, condvars, etc. are what's exposed to the user, so the
> >>> assumptions of whether there really no concurrency or not just make
> >>> sense there.  For example, a single-thread program can still have a
> >>> process-shared condvar, so the condvar would need to use
> >>> synchronization.
> >> Follow x86_64 idea, this optimization is only for internal atomic usage for
> >> libc itself: for a process-shared condvar, one will use pthread code, which
> >> is *not* built with this optimization.
> > pthread code uses the same atomics we use for libc internally.
> > Currently, the x86_64 condvar, for example, doesn't use the atomics --
> > but this is what we'd need it do to if we ever want to use unified
> > implementations of condvars (e.g., like we did for pthread_once
> > recently).
> If you check my patch, the SINGLE_THREAD_P is defined as:
> #ifndef NOT_IN_libc
> # define SINGLE_THREAD_P \
>   (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
> #else
> # define SINGLE_THREAD_P   0
> #endif
> So for libpthread, the code path to use non atomic will be eliminated.  x86_64 is
> not that careful in some atomic primitives though.

I think that's not sufficient, nor are the low-level atomics the right
place for this kind of optimization.

First, there are several source of concurrency affecting shared-memory
* Threads created by nptl.
* Other processes we're interacting with via shared memory.
* Reentrancy.
* The kernel, if we should synchronize with it via shared memory (e.g.,
recent perf does so, IIRC).

We control the first.  The second case is, I suppose, only reachable by
using pthreads pshared sync operations (or not?).

In case of reentrancy, there is concurrency between a signal handler and
a process consisting of a single thread, so we might want to use atomics
to synchronize.  I haven't checked whether we actually do (Alex might
know after doing the MT-Safety documentation) -- but I would not want us
preventing from using atomics for that, so a check on just
multiple_threads is not sufficient IMO.
Something similar applies to the kernel case.  Or if, in the future, we
should want to sync with any accelerators or similar.

Therefore, I think we need to have atomics that always synchronize, even
if we've just got one thread so far.

If we want to do the optimization you want to do, I think we need the
user to clarify which sources of concurrency cannot be present in the
respective piece of code.  This also drives how we can weaken the
* If there's only reentrancy as source of concurrency, we still need
read-modify-write ops, atomic writes, etc., but can skip all HW
barriers.  (Assuming that signal handler execution actually establishes
a happens-before with the interrupted thread.)
* If there's no source of concurrency at all, we can just write
sequential code.

Those weaker versions should, IMO, be available as wrappers or separate
functions, so that the full atomics are always available.  Something
like catomic_add and similar on x86_64, except that we'd might want to
have a more descriptive name, and distinguish between the
no-concurrency-at-all case and the no-concurrency-except-reentrancy
cases (the latter being what catomic_add falls into).

The next issue I see is whether we'd actually want the sequential code
(ie, for no-concurrency-at-all) to be provided in the form of a variant
of atomics or directly in the code using it.  Based on a quick grep for
the atomics, I see about 20 files that use atomics (ignoring nptl).
Quite a few of those seem to use atomics in concurrent code that goes
beyond a single increment of a counter or such, so they might benefit
performance-wise from having an actually sequential version of the code,
not just "sequential atomics".
The amount lines of code defining atomics is significantly larger than
the uses.

This indicates that it might be better to put those optimizations into
the code that uses atomics.  The code would have to be reviewed anyway
to see which sources of concurrency it faces, and would probably have to
be changed to add the selection of the appropriate atomics for that (if
we should add the optimization to the atomics).

Finally, any such optimization will add a branch on each use of atomics,
and depending on how they are built (e.g., asm or C), the compiler might
or might not be able to merge those.  I'd like to see at least some
indication that we're not significantly slowing down multi-threaded code
to get some benefit on single-threaded code.
We also might want to consider whether we'd want to put a glibc_likely
on those branches, favoring multi-threaded code (in my opinion) or
single-threaded code (considering the amount of single-threaded code we
have today).

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]