This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v2] Single threaded stdio optimization
On 06/30/2017 10:39 AM, Szabolcs Nagy wrote:
> On 30/06/17 14:16, Carlos O'Donell wrote:
>> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
>>> i didn't dig into the root cause of the regression (or
>>> why is static linking slower?), i would not be too
>>> worried about it since the common case for hot stdio
>>> loops is in single thread processes where even on x86
>>> the patch gives >2x speedup.
>>
>> Regardless of the cause, the 15% regression on x86 MT performance
>> is serious, and I see no reason to push this into glibc 2.26.
>> We can add it any time in 2.27, or the distros can pick it up with
>> a backport.
>>
>> I would like to see a better characterization of the regression before
>> accepting this patch.
>>
>> While I agree that common case for hot stdio loops is non-MT, there
>> are still MT cases, and 15% is a large double-digit loss.
>>
>> Have you looked at the assembly differences? What is the compiler
>> doing differently?
>>
>> When our a user asks "Why is my MT stdio 15% slower?" We owe them an
>> answer that is clear and concise.
>>
>
> sorry the x86 measurement was bogus because only
> the high level code thought it's multithreaded, the
> lowlevellock code thought it's single threaded so
> there were no atomic ops executed in the stdio_mt case
OK.
> with atomics the orig performance is significantly
> slower so the regression relative to that is small in %.
>
> if i create a dummy thread (to measure true mt
> behaviour, same loop count):
>
> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt
> 20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k
> 0inputs+0outputs (0major+180minor)pagefaults 0swaps
> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt
> 20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k
> 0inputs+0outputs (0major+179minor)pagefaults 0swaps
>
> the relative diff is 2% now, but notice that the
> abs diff went down too (which points to uarch issue
> in the previous measurement).
OK. This is much better.
> perf stat indicates that there are 15 vs 16 branches
> in the loop (so my patch indeed adds one branch
> but there are plenty branches already) the instruction
> count goes from 43 to 45 per loop iteration
> (flag check + branch).
>
> in my previous measurements, how can +1 branch
> decrease the performance >10% when there are
> already >10 branches (and several other insns)
> is something the x86 uarchitects could explain.
>
> in summary the patch trades 2% mt performance to
> 2x non-mt performance on this x86 cpu.
Excellent, this is exactly the analysis I was looking for, and this kind
of result is something that can make sense to our users.
I'm OK with the patch for 2.26.
--
Cheers,
Carlos.