This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 5/6][BZ #11588] x86_64: Remove assembly implementations for pthread_cond_*
- From: Torvald Riegel <triegel at redhat dot com>
- To: gratian dot crisan at ni dot com
- Cc: libc-alpha at sourceware dot org, Darren Hart <dvhart at linux dot intel dot com>, "Carlos O'Donell" <carlos at redhat dot com>, Joseph Myers <joseph at codesourcery dot com>, Jeff Law <law at redhat dot com>, Scot Salmon <scot dot salmon at ni dot com>, Siddhesh Poyarekar <spoyarek at redhat dot com>, Thomas Gleixner <tglx at linutronix dot de>, Clark Williams <williams at redhat dot com>, "Paul E. McKenney" <paulmck at linux dot vnet dot ibm dot com>, Will Newton <will dot newton at linaro dot org>, gratian at gmail dot com
- Date: Wed, 13 Aug 2014 18:36:40 +0200
- Subject: Re: [PATCH 5/6][BZ #11588] x86_64: Remove assembly implementations for pthread_cond_*
- Authentication-results: sourceware.org; auth=none
- References: <OF6ABEE614 dot FAE80AD2-ON86257D0E dot 006B38F4-86257D0E dot 0070034A at ni dot com> <1406680317-20189-1-git-send-email-gratian dot crisan at ni dot com> <1406680317-20189-6-git-send-email-gratian dot crisan at ni dot com>
On Tue, 2014-07-29 at 19:31 -0500, gratian.crisan@ni.com wrote:
> From: Gratian Crisan <gratian.crisan@ni.com>
>
> Switch x86_64 from using assembly implementations for pthread_cond_signal,
> pthread_cond_broadcast, pthread_cond_wait, and pthread_cond_timedwait to
> using the generic C implementation. Based on benchmarks results (see below)
> the C implementation is comparable in performance, easier to maintain, less
> bug prone, and supports priority inheritance for associated mutexes.
> Note: the bench-pthread_cond output was edited to fit within 80 columns by
> removing some white space and the 'variance' column.
>
> C implementation, quad core Intel(R) Xeon(R) CPU E5-1620 @3.60GHz, gcc 4.7.3
> pthread_cond_[test] iter/threads mean min max std. dev
> ----------------------------------------------------------------------------
> signal (w/o waiters) 1000000/100 93.002 57 6519657 2679.6
> broadcast (w/o waiters) 1000000/100 96.6929 57 10231506 2996.06
> signal 1000000/1 2833.97 532 92328 1348.39
> broadcast 1000000/1 3317.85 704 172804 1108.65
> signal/wait 100000/100 7726.83 3388 23269308 22286.5
> signal/timedwait 100000/100 8148.47 3888 23172368 18712.9
> broadcast/wait 100000/100 7895.33 3888 14886020 14894.2
> broadcast/timedwait 100000/100 8362.07 3924 18439204 19950.1
>
> Assembly implementation, quad core, Intel(R) Xeon(R) CPU E5-1620 @ 3.60GHz
> pthread_cond_[test] iter/threads mean min max std. dev
> ----------------------------------------------------------------------------
> signal (w/o waiters) 1000000/100 94.1301 57 69489528 8016.01
> broadcast (w/o waiters) 1000000/100 104.562 57 300175497 39393.4
> signal 1000000/1 2868.11 510 157149 1363.98
> broadcast 1000000/1 3057.23 688 180376 1192.49
> signal/wait 100000/100 7676.12 3340 24017028 20393.1
> signal/timedwait 100000/100 8157.42 3856 28700448 22368
> broadcast/wait 100000/100 7871.86 3648 27913676 21203.7
> broadcast/timedwait 100000/100 8300.47 4188 27813444 24769.8
>
> C implementation, dual core Intel(R) Atom(TM) CPU E3825 @ 1.33GHz, gcc 4.7.3
> pthread_cond_[test] iter/threads mean min max std. dev
> ----------------------------------------------------------------------------
> signal (w/o waiters) 1000000/100 95.077 90 28960 33.3326
> broadcast (w/o waiters) 1000000/100 114.874 90 13820 78.6426
> signal 1000000/1 6704.17 3510 49390 3537.21
> broadcast 1000000/1 6726.35 3850 55430 3297.21
> signal/wait 100000/100 16888.2 12240 6682020 15045.4
> signal/timedwait 100000/100 19246.6 13560 6874950 15969.5
> broadcast/wait 100000/100 17228.5 12390 6461480 14780.2
> broadcast/timedwait 100000/100 19414.5 13910 6656950 15681.8
>
> Assembly implementation, dual core Intel(R) Atom(TM) CPU E3825 @ 1.33GHz
> pthread_cond_[test] iter/threads mean min max std. dev
> ----------------------------------------------------------------------------
> signal (w/o waiters) 1000000/100 263.81 70 120171680 90138
> broadcast (w/o waiters) 1000000/100 264.213 70 160178010 91861.4
> signal 1000000/1 15851.7 3800 13372770 13889
> broadcast 1000000/1 16095.2 5900 14940170 16346.7
> signal/wait 100000/100 33151 7930 252746080 475402
> signal/timedwait 100000/100 34921.1 10950 147023040 270191
> broadcast/wait 100000/100 33400.2 11810 247194720 455105
> broadcast/timedwait 100000/100 35022.1 13610 161552720 30328
It seems the assembly implementation (or the runs where you used it)
suffer from very large delays which seem to be outliers; max is several
orders of magnitude higher. This seems to be the case on the Xeon too
to some extent.