This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Benchmarking __libc_single_threaded

From: Florian Weimer <fweimer at redhat dot com>
To: libc-alpha at sourceware dot org
Cc: jwakely at redhat dot com
Date: Fri, 28 Jun 2019 23:57:58 +0200
Subject: Benchmarking __libc_single_threaded

Back in February, I posted a proof-of-concept patch which exposed a
__libc_single_threaded variable (for avoiding atomics in std::shared_ptr
and the like):

  <https://sourceware.org/ml/libc-alpha/2019-02/msg00073.html>

The question came up whether the hidden per-DSO symbol makes sense
because it trades a lot of complexity for some performance gain.

I have now faked both implementations, by rewriting the <memory> header
to use either approach:

  <https://pagure.io/fweimer/shared_ptr-single-threaded>

Basically, it demonstrates the performance overhead of passing a
std::shared_ptr down a somewhat arbitrarily nested call chain.  Only
single-threaded mode is benchmarked, the multi-threaded mode is quite
slow no matter what.

I ran both variants under “perf stat” to count CPU cycles, which seemed
to me like a relatively stable measurement.

The results on i386 and ppc64le are expected: the difference between the
hidden and global symbol is in the noise because the architecture does
not have PC-relative loads.

For x86-64, it depends on the CPU.  I get statistically significant
results (!) either way.  On my laptop, the hidden symbol is 2% faster,
but on one server CPU, it's round 4% slower, and 7% slower on another
server CPU.  This result is quite surprising The number of branch misses
is substantially higher inthe slower&hidden case, so something must be
confusing the branch predictor.  (A mispredicted branch into the atomic
case is probably quite costly.)  It's conceivable that just the
different code layout causes this.

For aarch64, I couldn't get current CPUs during the Beaker roulette.  In
the single CPU I tested, the hidden symbol was around 14% faster, but
I'm not sure if that means much.  (It's an outdated CPU generation,
possible pre-production silicon.)

Anyway, so it looks like this is basically impossible to benchmark
properly.  I think we can therefore use the simpler approach with a
global symbol.

Thanks,
Florian

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]