This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: PowerPC: libc single-thread lock optimization

From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
To: Torvald Riegel <triegel at redhat dot com>
Cc: libc-alpha at sourceware dot org
Date: Tue, 29 Apr 2014 15:05:34 -0300
Subject: Re: PowerPC: libc single-thread lock optimization
Authentication-results: sourceware.org; auth=none
References: <5343F8F1 dot 4000400 at linux dot vnet dot ibm dot com> <535ECADE dot 2050004 at linux dot vnet dot ibm dot com> <20140428214938 dot 3B10F2C3A13 at topped-with-meat dot com> <535ED72A dot 5060203 at linux dot vnet dot ibm dot com> <1398788543 dot 32485 dot 1139 dot camel at triegel dot csb> <535FD835 dot 1090702 at linux dot vnet dot ibm dot com> <1398794007 dot 32485 dot 1490 dot camel at triegel dot csb>

On 29-04-2014 14:53, Torvald Riegel wrote:
> On Tue, 2014-04-29 at 13:49 -0300, Adhemerval Zanella wrote:
>> On 29-04-2014 13:22, Torvald Riegel wrote:
>>> On Mon, 2014-04-28 at 19:33 -0300, Adhemerval Zanella wrote:
>>>> On 28-04-2014 18:49, Roland McGrath wrote:
>>>>> Heretofore sysdeps/CPU/bits/atomic.h is for pure CPU-based implementations.
>>>>> In a few cases there exists a sysdeps/unix/sysv/linux/CPU/bits/atomic.h as
>>>>> well because it needs to use kernel support.
>>>>>
>>>>> This is something somewhere in between: you are not depending directly on
>>>>> specific facilities outside the pure CPU facilities; but you are depending
>>>>> on library infrastructure and associated assumptions that do not hold in
>>>>> the general case of using the atomic macros in arbitrary contexts.
>>>>> Furthermore, you are defining SINGLE_THREAD_P to depend on NPTL
>>>>> implementation details.  IMHO neither of these things belong in a
>>>>> sysdeps/CPU/bits/atomic.h file.
>>>>>
>>>>> The lowlevellock.h change doesn't have those issues, so I'd suggest you
>>>>> send that separately and it should go in easily.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Roland
>>>>>
>>>> I tend to agree with you, however the x86_64 implementation (sysdeps/x86_64/bits/atomic.h)
>>>> itself relies on NPTL definitions (the check using (offsetof (tcbhead_t, multiple_threads)))).
>>>> And the idea of changing the atomic.h it to simplify the logic to add this optimization:
>>>> instead of changing the macros in lowlevellock.h and other atomic usage, it implements
>>>> the optimization on the atomic itself.
>>> I agree with Roland that atomic.h shouldn't have the optimization; to
>>> me, the strongest reason is that we might need atomics that actually
>>> synchronize independently of whether we have spawned a thread or used
>>> cancellation.  Also, having this optimization in the atomics will make
>>> it harder to move to, say, C11 atomics; we'd have to keep the wrappers.
>> It does not solve the case for internal usage of atomic operations (for instance, 
>> atomic_compare_and_exchange_val_acq in malloc code).
> In that case, this optimization should be in the malloc code.

I think is feasible to add this optimization based either on __libc_multiple_threads or in
TCB header information.

>
>> The idea of this patch is to
>> mimic x86 optimization, not to refactor atomic case. So the general idea I got so
>> far is: this should not be in atomic primitives, but x8_64 does it anyway. 
> I don't expect you to clean that up, it's just that I think the way x86
> does it is not quite right, so duplicating this wouldn't be the right
> thing to do.

In that rationale I agree. 

>
>>>> I bring this about x86 because usually it is the reference implementation and sometimes puzzles
>>>> me that copying the same idea in another platform raise architectural question.  But I concede
>>>> that the reference itself maybe had not opted for best solution in first place.
>>>>
>>>> So if I have understand correctly, is the optimization to check for single-thread and opt to
>>>> use locks is to focused on lowlevellock solely?  If so, how do you suggest to other archs to 
>>>> mimic x86 optimization on atomic.h primitives?  Should other arch follow the x86_64 and
>>>> check for __libc_multiple_threads value instead?  This could be a way, however it is redundant
>>>> in mostly way: the TCP definition already contains the information required, so there it no
>>>> need to keep track of it in another memory reference.  Also, following x86_64 idea, it check
>>>> for TCB header information for sysdeps/CPU/bits/atomic.h, but for __libc_multiple_threads
>>>> in lowlevellock.h.  Which is correct guideline for other archs?
>>> >From a synchronization perspective, I think any single-thread
>>> optimizations belong into the specific concurrent algorithms (e.g.,
>>> mutexes, condvars, ...)
>>> * Doing the optimization at the lowest level (ie, the atomics) might be
>>> insufficient because if there's indeed just one thread, then lots of
>>> synchronization code can be a lot more simpler than just avoiding
>>> atomics (e.g., avoiding loops, checks, ...).
>>> * The mutexes, condvars, etc. are what's exposed to the user, so the
>>> assumptions of whether there really no concurrency or not just make
>>> sense there.  For example, a single-thread program can still have a
>>> process-shared condvar, so the condvar would need to use
>>> synchronization.
>> Follow x86_64 idea, this optimization is only for internal atomic usage for
>> libc itself: for a process-shared condvar, one will use pthread code, which
>> is *not* built with this optimization.
> pthread code uses the same atomics we use for libc internally.
> Currently, the x86_64 condvar, for example, doesn't use the atomics --
> but this is what we'd need it do to if we ever want to use unified
> implementations of condvars (e.g., like we did for pthread_once
> recently).

If you check my patch, the SINGLE_THREAD_P is defined as:

#ifndef NOT_IN_libc
# define SINGLE_THREAD_P \
  (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
#else
# define SINGLE_THREAD_P   0
#endif

So for libpthread, the code path to use non atomic will be eliminated.  x86_64 is
not that careful in some atomic primitives though.

>>> * We currently don't have other intra-process sources of concurrency
>>> than NPTL threads, but if we get another source (e.g., due to trying to
>>> support accelerators), we'd likely need low-level synchronization to
>>> communicate with the other thing -- and this would be independent of
>>> whether we have threads.
>> This optimization does not apply to pthread code.
> If it uses atomics, it does.  condvars, mutexes, rwlocks don't use
> atomics or barriers IIRC, but semaphores do, for example.  We have a
> custom assembler implementation on x86 for semaphores, so the problem
> isn't currently present.  If we don't want to keep those forever, we'll
> need the atomics to be useful for process-shared synchronization as
> well.
>

References:
- PowerPC: libc single-thread lock optimization
  - From: Adhemerval Zanella
- Re: PowerPC: libc single-thread lock optimization
  - From: Adhemerval Zanella
- Re: PowerPC: libc single-thread lock optimization
  - From: Roland McGrath
- Re: PowerPC: libc single-thread lock optimization
  - From: Adhemerval Zanella
- Re: PowerPC: libc single-thread lock optimization
  - From: Torvald Riegel
- Re: PowerPC: libc single-thread lock optimization
  - From: Adhemerval Zanella
- Re: PowerPC: libc single-thread lock optimization
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]