This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Unify pthread_once (bug 15215)

On Mon, 2014-04-07 at 17:17 -0300, Adhemerval Zanella wrote:
> On 28-03-2014 20:29, Carlos O'Donell wrote:
> > David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
> >
> > We would like to unify all C-based pthread_once implmentations
> > per the plan in bug 15215 for glibc 2.20.
> >
> > Your machines are on the list of C-based pthread_once implementations.
> >
> > See this for the intial discussions on the unified pthread_once:
> >
> >
> > The goal is to provide a single and correct C implementation of 
> > pthread_once. Architectures can then build on that if they need more 
> > optimal implementations, but I don't encourage that and I'd rather
> > see deep discussions on how to make one unified solution where
> > possible.
> >
> > I've also just reviewed Torvald's new pthread_once microbenchmark which
> > you can use to compare your previous C implementation with the new
> > standard C implementation (measures pthread_once latency). The primary
> > use of this test is to help provide objective proof for or against the
> > i386 and x86_64 assembly implementations.
> >
> > We are not presently converting any of the machines with custom
> > implementations, but that will be a next step after testing with the
> > help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
> >
> > If we don't hear any objections we will go forward with this change
> > in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
> > and aarch64 on a single pthread_once implementation based on sparc's C
> > implementation.
> >
> > Any objections to this cleanup for 2.20?
> >
> I tested the patch and benchmark on PowerPC (POWER7) and the results looks good:
> * Current code:
> "duration": 5.08322e+09, "iterations": 2.2037e+08, "max": 244.863, "min": 22.08, "mean": 23.0668
> "duration": 5.08316e+09, "iterations": 2.20479e+08, "max": 771.08, "min": 21.984, "mean": 23.0551
> "duration": 5.08306e+09, "iterations": 2.21093e+08, "max": 53.966, "min": 22.052, "mean": 22.9906
> "duration": 5.0833e+09, "iterations": 2.20062e+08, "max": 347.895, "min": 22.15, "mean": 23.0994
> "duration": 5.08277e+09, "iterations": 2.20699e+08, "max": 632.479, "min": 21.997, "mean": 23.0303
> * Optimization:
> "duration": 4.97694e+09, "iterations": 8.42834e+08, "max": 134.181, "min": 5.405, "mean": 5.90501
> "duration": 4.9758e+09, "iterations": 8.66952e+08, "max": 29.163, "min": 5.405, "mean": 5.73941
> "duration": 4.9778e+09, "iterations": 8.51788e+08, "max": 40.819, "min": 5.405, "mean": 5.84394
> "duration": 4.97413e+09, "iterations": 8.52432e+08, "max": 37.089, "min": 5.488, "mean": 5.83523
> "duration": 4.97795e+09, "iterations": 8.43376e+08, "max": 163.703, "min": 5.405, "mean": 5.90241
> I am running on a 18 cores machine, so I guess the 'max' is due a timing issue from os jitter.

There's no spinning in the algorithm currently (or most of glibc AFAIK,
except the rather simplistic attempt in adaptive mutexes), so even small
initialization routines may go straight to blocking via futexes.  (And
AFAIK, futexes don't spin before actually trying to block.)

> Digging up on current code and the unified one, I noted PowerPC one is currently doing load+and+store
> condition followed by a full barrier.

Hmm, the code I see is using __lll_acq_instr, which is an isync, which
AFAICT is not a full barrier (hwsync).

> This is overkill for some scenarios: if the initialization is
> done (val & 2) we don't need a full barrier (isync), since the requirement (as Tovalds has explained)
> is just a load acquire.

The atomic_read_barrier used in the unified code seems to result in an
lwsync.  Is that faster than an isync?  If not, the overhead may be due
to something else.

> I ran make check and results looks good. I also checked with GCC issues that originated the fix
> that added the release barrier in the code ( and
> it does not show the issue with new implementation.

Thanks for doing the tests.  Would you want to remove the powerpc
implementation after this patch has been committed?

I've also seen that there is a separate s390 implementation as well,
which seems similar to the powerpc algorithm.  Has someone tested this
one as well?

> Finally, I also check if by replacing the 'atomic_read_barrier' and 'atomic_write_barrier' with
> a load acquire and load release on POWER would shows any difference. The result are not conclusive:
> "duration": 4.97196e+09, "iterations": 8.79836e+08, "max": 22.874, "min": 5.405, "mean": 5.651
> "duration": 4.97144e+09, "iterations": 8.55294e+08, "max": 270.72, "min": 5.4, "mean": 5.81256
> "duration": 4.97496e+09, "iterations": 8.67163e+08, "max": 27.274, "min": 5.405, "mean": 5.73705
> "duration": 4.97603e+09, "iterations": 8.61568e+08, "max": 41.631, "min": 5.405, "mean": 5.77556
> "duration": 4.97347e+09, "iterations": 8.54189e+08, "max": 41.457, "min": 5.405, "mean": 5.82244

Interesting.  Which HW barrier / code did you use for the load acquire?

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]