This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] Single threaded stdio optimization


On 29/06/17 13:12, Carlos O'Donell wrote:
> On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote:
>> On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote:
>>> The patch looks OK except for the duplication (and a missing comment
>>> below), which looks a bit clumsy.  How about something like this instead:
>>>
>>>   bool need_lock = _IO_need_lock (fp);
>>>
>>>   if (need_lock)
>>>     _IO_flockfile (fp);
>>>   result = _IO_ferror_unlocked (fp);
>>>   if (need_lock)
>>>     _IO_funlockfile (fp);
>>>
>>>   return result;
>>>
>>> You could probably make some kind of a macro out of this, I haven't
>>> looked that hard.
>>
>> I forgot that Torvald had commented (off-list, the thread broke somehow)
>> that it would be important to try and measure how much worse this makes
>> the multi-threaded case worse.
> 
> +1
> 
> If we are going to optimize the single threaded case we need to know what
> impact this has on the multi-threaded case.
> 

$orig == current
$stdio == my patch
$stdio_mt == my patch but 'needs lock' flag is set so multithread path is taken

on two particular aarch64 cpus with a particular loop count:

cpu1
time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
8.08user 0.04system 0:08.12elapsed 100%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
1.07user 0.04system 0:01.11elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar
7.87user 0.00system 0:07.88elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps

cpu2
time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
8.11user 0.04system 0:08.16elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
2.29user 0.06system 0:02.35elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar
8.12user 0.03system 0:08.16elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps

on a particular x86_64 cpu with particular loop count:

time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
0inputs+0outputs (0major+153minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
0inputs+0outputs (0major+155minor)pagefaults 0swaps
time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar
6.76user 0.08system 0:06.87elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
0inputs+0outputs (0major+155minor)pagefaults 0swaps

in summary: on aarch64 i see no regression (in some case stdio_mt
become faster, can happen since the layout of the code changed)
on this particular x86 cpu stdio_mt has a close to 15% regression.

i don't believe the big regression on x86 is valid, it could
be that the benchmark just got past some cpu internal limit
or the code got aligned differently, in fact if i static link
the exact same code then on the same cpu i get

time ./getchar_static-orig
6.60user 0.05system 0:06.66elapsed 99%CPU (0avgtext+0avgdata 912maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps
time ./getchar_static-stdio
2.24user 0.08system 0:02.33elapsed 99%CPU (0avgtext+0avgdata 896maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps
time ./getchar_static-stdio_mt
6.50user 0.06system 0:06.57elapsed 99%CPU (0avgtext+0avgdata 896maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps

i.e. now it is faster from the branch! (both measurements
are repeatable)

i didn't dig into the root cause of the regression (or
why is static linking slower?), i would not be too
worried about it since the common case for hot stdio
loops is in single thread processes where even on x86
the patch gives >2x speedup.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]