This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PowerPC: libc single-thread lock optimization


On 02-05-2014 11:04, Torvald Riegel wrote:
> On Tue, 2014-04-29 at 15:05 -0300, Adhemerval Zanella wrote:
>> On 29-04-2014 14:53, Torvald Riegel wrote:
>>> On Tue, 2014-04-29 at 13:49 -0300, Adhemerval Zanella wrote:
>>>> On 29-04-2014 13:22, Torvald Riegel wrote:
>>>>> On Mon, 2014-04-28 at 19:33 -0300, Adhemerval Zanella wrote:
>>>>>> I bring this about x86 because usually it is the reference implementation and sometimes puzzles
>>>>>> me that copying the same idea in another platform raise architectural question.  But I concede
>>>>>> that the reference itself maybe had not opted for best solution in first place.
>>>>>>
>>>>>> So if I have understand correctly, is the optimization to check for single-thread and opt to
>>>>>> use locks is to focused on lowlevellock solely?  If so, how do you suggest to other archs to 
>>>>>> mimic x86 optimization on atomic.h primitives?  Should other arch follow the x86_64 and
>>>>>> check for __libc_multiple_threads value instead?  This could be a way, however it is redundant
>>>>>> in mostly way: the TCP definition already contains the information required, so there it no
>>>>>> need to keep track of it in another memory reference.  Also, following x86_64 idea, it check
>>>>>> for TCB header information for sysdeps/CPU/bits/atomic.h, but for __libc_multiple_threads
>>>>>> in lowlevellock.h.  Which is correct guideline for other archs?
>>>>> >From a synchronization perspective, I think any single-thread
>>>>> optimizations belong into the specific concurrent algorithms (e.g.,
>>>>> mutexes, condvars, ...)
>>>>> * Doing the optimization at the lowest level (ie, the atomics) might be
>>>>> insufficient because if there's indeed just one thread, then lots of
>>>>> synchronization code can be a lot more simpler than just avoiding
>>>>> atomics (e.g., avoiding loops, checks, ...).
>>>>> * The mutexes, condvars, etc. are what's exposed to the user, so the
>>>>> assumptions of whether there really no concurrency or not just make
>>>>> sense there.  For example, a single-thread program can still have a
>>>>> process-shared condvar, so the condvar would need to use
>>>>> synchronization.
>>>> Follow x86_64 idea, this optimization is only for internal atomic usage for
>>>> libc itself: for a process-shared condvar, one will use pthread code, which
>>>> is *not* built with this optimization.
>>> pthread code uses the same atomics we use for libc internally.
>>> Currently, the x86_64 condvar, for example, doesn't use the atomics --
>>> but this is what we'd need it do to if we ever want to use unified
>>> implementations of condvars (e.g., like we did for pthread_once
>>> recently).
>> If you check my patch, the SINGLE_THREAD_P is defined as:
>>
>> #ifndef NOT_IN_libc
>> # define SINGLE_THREAD_P \
>>   (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
>> #else
>> # define SINGLE_THREAD_P   0
>> #endif
>>
>> So for libpthread, the code path to use non atomic will be eliminated.  x86_64 is
>> not that careful in some atomic primitives though.
> I think that's not sufficient, nor are the low-level atomics the right
> place for this kind of optimization.
>
> First, there are several source of concurrency affecting shared-memory
> synchronization:
> * Threads created by nptl.
> * Other processes we're interacting with via shared memory.
> * Reentrancy.
> * The kernel, if we should synchronize with it via shared memory (e.g.,
> recent perf does so, IIRC).
>
> We control the first.  The second case is, I suppose, only reachable by
> using pthreads pshared sync operations (or not?).
>
> In case of reentrancy, there is concurrency between a signal handler and
> a process consisting of a single thread, so we might want to use atomics
> to synchronize.  I haven't checked whether we actually do (Alex might
> know after doing the MT-Safety documentation) -- but I would not want us
> preventing from using atomics for that, so a check on just
> multiple_threads is not sufficient IMO.
> Something similar applies to the kernel case.  Or if, in the future, we
> should want to sync with any accelerators or similar.

As I stated previously, I have dropped to modify the atomic.h in favor of just
the lowlevellock.h.  

And I think we need to reevaluate then the x86_64 code that does exactly what
you think is wrong (add the single-thread opt on atomics). 


>
> Therefore, I think we need to have atomics that always synchronize, even
> if we've just got one thread so far.
>
> If we want to do the optimization you want to do, I think we need the
> user to clarify which sources of concurrency cannot be present in the
> respective piece of code.  This also drives how we can weaken the
> atomics:
> * If there's only reentrancy as source of concurrency, we still need
> read-modify-write ops, atomic writes, etc., but can skip all HW
> barriers.  (Assuming that signal handler execution actually establishes
> a happens-before with the interrupted thread.)
> * If there's no source of concurrency at all, we can just write
> sequential code.
>
> Those weaker versions should, IMO, be available as wrappers or separate
> functions, so that the full atomics are always available.  Something
> like catomic_add and similar on x86_64, except that we'd might want to
> have a more descriptive name, and distinguish between the
> no-concurrency-at-all case and the no-concurrency-except-reentrancy
> cases (the latter being what catomic_add falls into).
>
>
> The next issue I see is whether we'd actually want the sequential code
> (ie, for no-concurrency-at-all) to be provided in the form of a variant
> of atomics or directly in the code using it.  Based on a quick grep for
> the atomics, I see about 20 files that use atomics (ignoring nptl).
> Quite a few of those seem to use atomics in concurrent code that goes
> beyond a single increment of a counter or such, so they might benefit
> performance-wise from having an actually sequential version of the code,
> not just "sequential atomics".
> The amount lines of code defining atomics is significantly larger than
> the uses.
>
> This indicates that it might be better to put those optimizations into
> the code that uses atomics.  The code would have to be reviewed anyway
> to see which sources of concurrency it faces, and would probably have to
> be changed to add the selection of the appropriate atomics for that (if
> we should add the optimization to the atomics).

That's my idea of the malloc patch I have sent, instead of change atomics, change
the way the atomics is being used.

>
>
> Finally, any such optimization will add a branch on each use of atomics,
> and depending on how they are built (e.g., asm or C), the compiler might
> or might not be able to merge those.  I'd like to see at least some
> indication that we're not significantly slowing down multi-threaded code
> to get some benefit on single-threaded code.
> We also might want to consider whether we'd want to put a glibc_likely
> on those branches, favoring multi-threaded code (in my opinion) or
> single-threaded code (considering the amount of single-threaded code we
> have today).
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]