This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v2] Single threaded stdio optimization
On 30/06/17 14:16, Carlos O'Donell wrote:
> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
>> i didn't dig into the root cause of the regression (or
>> why is static linking slower?), i would not be too
>> worried about it since the common case for hot stdio
>> loops is in single thread processes where even on x86
>> the patch gives >2x speedup.
> Regardless of the cause, the 15% regression on x86 MT performance
> is serious, and I see no reason to push this into glibc 2.26.
> We can add it any time in 2.27, or the distros can pick it up with
> a backport.
> I would like to see a better characterization of the regression before
> accepting this patch.
> While I agree that common case for hot stdio loops is non-MT, there
> are still MT cases, and 15% is a large double-digit loss.
> Have you looked at the assembly differences? What is the compiler
> doing differently?
> When our a user asks "Why is my MT stdio 15% slower?" We owe them an
> answer that is clear and concise.
sorry the x86 measurement was bogus because only
the high level code thought it's multithreaded, the
lowlevellock code thought it's single threaded so
there were no atomic ops executed in the stdio_mt case
with atomics the orig performance is significantly
slower so the regression relative to that is small in %.
if i create a dummy thread (to measure true mt
behaviour, same loop count):
time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt
20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k
0inputs+0outputs (0major+180minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt
20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k
0inputs+0outputs (0major+179minor)pagefaults 0swaps
the relative diff is 2% now, but notice that the
abs diff went down too (which points to uarch issue
in the previous measurement).
perf stat indicates that there are 15 vs 16 branches
in the loop (so my patch indeed adds one branch
but there are plenty branches already) the instruction
count goes from 43 to 45 per loop iteration
(flag check + branch).
in my previous measurements, how can +1 branch
decrease the performance >10% when there are
already >10 branches (and several other insns)
is something the x86 uarchitects could explain.
in summary the patch trades 2% mt performance to
2x non-mt performance on this x86 cpu.