This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Unify pthread_once (bug 15215)
- From: Torvald Riegel <triegel at redhat dot com>
- To: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Thu, 10 Apr 2014 21:57:40 +0200
- Subject: Re: [PATCH] Unify pthread_once (bug 15215)
- Authentication-results: sourceware.org; auth=none
- References: <1368024237 dot 7774 dot 794 dot camel at triegel dot csb> <519D97E4 dot 4030808 at redhat dot com> <1381018836 dot 8757 dot 3598 dot camel at triegel dot csb> <Pine dot LNX dot 4 dot 64 dot 1310071604090 dot 31470 at digraph dot polyomino dot org dot uk> <1381182784 dot 18547 dot 138 dot camel at triegel dot csb> <533605BF dot 9000005 at redhat dot com> <534307E1 dot 5050904 at linux dot vnet dot ibm dot com>
On Mon, 2014-04-07 at 17:17 -0300, Adhemerval Zanella wrote:
> On 28-03-2014 20:29, Carlos O'Donell wrote:
> > David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
> >
> > We would like to unify all C-based pthread_once implmentations
> > per the plan in bug 15215 for glibc 2.20.
> >
> > Your machines are on the list of C-based pthread_once implementations.
> >
> > See this for the intial discussions on the unified pthread_once:
> > https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
> >
> > The goal is to provide a single and correct C implementation of
> > pthread_once. Architectures can then build on that if they need more
> > optimal implementations, but I don't encourage that and I'd rather
> > see deep discussions on how to make one unified solution where
> > possible.
> >
> > I've also just reviewed Torvald's new pthread_once microbenchmark which
> > you can use to compare your previous C implementation with the new
> > standard C implementation (measures pthread_once latency). The primary
> > use of this test is to help provide objective proof for or against the
> > i386 and x86_64 assembly implementations.
> >
> > We are not presently converting any of the machines with custom
> > implementations, but that will be a next step after testing with the
> > help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
> >
> > If we don't hear any objections we will go forward with this change
> > in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
> > and aarch64 on a single pthread_once implementation based on sparc's C
> > implementation.
> >
> > Any objections to this cleanup for 2.20?
> >
> I tested the patch and benchmark on PowerPC (POWER7) and the results looks good:
>
> * Current code:
> "duration": 5.08322e+09, "iterations": 2.2037e+08, "max": 244.863, "min": 22.08, "mean": 23.0668
> "duration": 5.08316e+09, "iterations": 2.20479e+08, "max": 771.08, "min": 21.984, "mean": 23.0551
> "duration": 5.08306e+09, "iterations": 2.21093e+08, "max": 53.966, "min": 22.052, "mean": 22.9906
> "duration": 5.0833e+09, "iterations": 2.20062e+08, "max": 347.895, "min": 22.15, "mean": 23.0994
> "duration": 5.08277e+09, "iterations": 2.20699e+08, "max": 632.479, "min": 21.997, "mean": 23.0303
>
> * Optimization:
> "duration": 4.97694e+09, "iterations": 8.42834e+08, "max": 134.181, "min": 5.405, "mean": 5.90501
> "duration": 4.9758e+09, "iterations": 8.66952e+08, "max": 29.163, "min": 5.405, "mean": 5.73941
> "duration": 4.9778e+09, "iterations": 8.51788e+08, "max": 40.819, "min": 5.405, "mean": 5.84394
> "duration": 4.97413e+09, "iterations": 8.52432e+08, "max": 37.089, "min": 5.488, "mean": 5.83523
> "duration": 4.97795e+09, "iterations": 8.43376e+08, "max": 163.703, "min": 5.405, "mean": 5.90241
>
> I am running on a 18 cores machine, so I guess the 'max' is due a timing issue from os jitter.
There's no spinning in the algorithm currently (or most of glibc AFAIK,
except the rather simplistic attempt in adaptive mutexes), so even small
initialization routines may go straight to blocking via futexes. (And
AFAIK, futexes don't spin before actually trying to block.)
> Digging up on current code and the unified one, I noted PowerPC one is currently doing load+and+store
> condition followed by a full barrier.
Hmm, the code I see is using __lll_acq_instr, which is an isync, which
AFAICT is not a full barrier (hwsync).
> This is overkill for some scenarios: if the initialization is
> done (val & 2) we don't need a full barrier (isync), since the requirement (as Tovalds has explained)
> is just a load acquire.
The atomic_read_barrier used in the unified code seems to result in an
lwsync. Is that faster than an isync? If not, the overhead may be due
to something else.
> I ran make check and results looks good. I also checked with GCC issues that originated the fix
> that added the release barrier in the code (gcc.gnu.org/bugzilla/show_bug.cgi?id=52839#c10) and
> it does not show the issue with new implementation.
Thanks for doing the tests. Would you want to remove the powerpc
implementation after this patch has been committed?
I've also seen that there is a separate s390 implementation as well,
which seems similar to the powerpc algorithm. Has someone tested this
one as well?
> Finally, I also check if by replacing the 'atomic_read_barrier' and 'atomic_write_barrier' with
> a load acquire and load release on POWER would shows any difference. The result are not conclusive:
>
> "duration": 4.97196e+09, "iterations": 8.79836e+08, "max": 22.874, "min": 5.405, "mean": 5.651
> "duration": 4.97144e+09, "iterations": 8.55294e+08, "max": 270.72, "min": 5.4, "mean": 5.81256
> "duration": 4.97496e+09, "iterations": 8.67163e+08, "max": 27.274, "min": 5.405, "mean": 5.73705
> "duration": 4.97603e+09, "iterations": 8.61568e+08, "max": 41.631, "min": 5.405, "mean": 5.77556
> "duration": 4.97347e+09, "iterations": 8.54189e+08, "max": 41.457, "min": 5.405, "mean": 5.82244
>
Interesting. Which HW barrier / code did you use for the load acquire?