This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PowerPC: libc single-thread lock optimization


On 02-05-2014 11:37, Torvald Riegel wrote:
> On Fri, 2014-05-02 at 11:15 -0300, Adhemerval Zanella wrote:
>> On 02-05-2014 11:04, Torvald Riegel wrote:
>>> On Tue, 2014-04-29 at 15:05 -0300, Adhemerval Zanella wrote:
>>>> On 29-04-2014 14:53, Torvald Riegel wrote:
>>>>> On Tue, 2014-04-29 at 13:49 -0300, Adhemerval Zanella wrote:
>>>>>> On 29-04-2014 13:22, Torvald Riegel wrote:
>>>>>>> On Mon, 2014-04-28 at 19:33 -0300, Adhemerval Zanella wrote:
>>>>>>>> I bring this about x86 because usually it is the reference implementation and sometimes puzzles
>>>>>>>> me that copying the same idea in another platform raise architectural question.  But I concede
>>>>>>>> that the reference itself maybe had not opted for best solution in first place.
>>>>>>>>
>>>>>>>> So if I have understand correctly, is the optimization to check for single-thread and opt to
>>>>>>>> use locks is to focused on lowlevellock solely?  If so, how do you suggest to other archs to 
>>>>>>>> mimic x86 optimization on atomic.h primitives?  Should other arch follow the x86_64 and
>>>>>>>> check for __libc_multiple_threads value instead?  This could be a way, however it is redundant
>>>>>>>> in mostly way: the TCP definition already contains the information required, so there it no
>>>>>>>> need to keep track of it in another memory reference.  Also, following x86_64 idea, it check
>>>>>>>> for TCB header information for sysdeps/CPU/bits/atomic.h, but for __libc_multiple_threads
>>>>>>>> in lowlevellock.h.  Which is correct guideline for other archs?
>>>>>>> >From a synchronization perspective, I think any single-thread
>>>>>>> optimizations belong into the specific concurrent algorithms (e.g.,
>>>>>>> mutexes, condvars, ...)
>>>>>>> * Doing the optimization at the lowest level (ie, the atomics) might be
>>>>>>> insufficient because if there's indeed just one thread, then lots of
>>>>>>> synchronization code can be a lot more simpler than just avoiding
>>>>>>> atomics (e.g., avoiding loops, checks, ...).
>>>>>>> * The mutexes, condvars, etc. are what's exposed to the user, so the
>>>>>>> assumptions of whether there really no concurrency or not just make
>>>>>>> sense there.  For example, a single-thread program can still have a
>>>>>>> process-shared condvar, so the condvar would need to use
>>>>>>> synchronization.
>>>>>> Follow x86_64 idea, this optimization is only for internal atomic usage for
>>>>>> libc itself: for a process-shared condvar, one will use pthread code, which
>>>>>> is *not* built with this optimization.
>>>>> pthread code uses the same atomics we use for libc internally.
>>>>> Currently, the x86_64 condvar, for example, doesn't use the atomics --
>>>>> but this is what we'd need it do to if we ever want to use unified
>>>>> implementations of condvars (e.g., like we did for pthread_once
>>>>> recently).
>>>> If you check my patch, the SINGLE_THREAD_P is defined as:
>>>>
>>>> #ifndef NOT_IN_libc
>>>> # define SINGLE_THREAD_P \
>>>>   (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
>>>> #else
>>>> # define SINGLE_THREAD_P   0
>>>> #endif
>>>>
>>>> So for libpthread, the code path to use non atomic will be eliminated.  x86_64 is
>>>> not that careful in some atomic primitives though.
>>> I think that's not sufficient, nor are the low-level atomics the right
>>> place for this kind of optimization.
>>>
>>> First, there are several source of concurrency affecting shared-memory
>>> synchronization:
>>> * Threads created by nptl.
>>> * Other processes we're interacting with via shared memory.
>>> * Reentrancy.
>>> * The kernel, if we should synchronize with it via shared memory (e.g.,
>>> recent perf does so, IIRC).
>>>
>>> We control the first.  The second case is, I suppose, only reachable by
>>> using pthreads pshared sync operations (or not?).
>>>
>>> In case of reentrancy, there is concurrency between a signal handler and
>>> a process consisting of a single thread, so we might want to use atomics
>>> to synchronize.  I haven't checked whether we actually do (Alex might
>>> know after doing the MT-Safety documentation) -- but I would not want us
>>> preventing from using atomics for that, so a check on just
>>> multiple_threads is not sufficient IMO.
>>> Something similar applies to the kernel case.  Or if, in the future, we
>>> should want to sync with any accelerators or similar.
>> As I stated previously, I have dropped to modify the atomic.h in favor of just
>> the lowlevellock.h.  
>>
>> And I think we need to reevaluate then the x86_64 code that does exactly what
>> you think is wrong (add the single-thread opt on atomics). 
> Note that those are different: They drop the "lock" prefix, but they are
> not sequential code like what you add.
> I agree that it's worth documenting them, but those should work in case
> of reentrancy.
>
In fact I am arguing about the x86_64 atomic.h implementation because from this
discussion I understood the good practice for an architecture implementation is
not to add such optimization on atomic itself, but rather than in the atomics
usage (such as lowlevellock.h).

And correct if I'm wrong, but I think it would be concurrency in reentrancy only
if pthread_create (which is the function that will change tcb::multiple_threads 
or __libc_multiple_threads) would be called inside the signal handler.  However
pthread_create (and majority of pthread functions) are not signal-safe. 



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]