This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][AArch64] Single thread lowlevellock optimization

On 20/06/17 19:10, Torvald Riegel wrote:
> On Tue, 2017-06-20 at 16:05 +0100, Szabolcs Nagy wrote:
>> On 20/06/17 14:47, Torvald Riegel wrote:
>>> On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
>>>> Differences compared to the current x86_64 behaviour:
>>>> - The optimization is not silently applied to shared locks, in that
>>>> case the build fails.
>>>> - Unlock assumes the futex value is 0 or 1, there are no waiters to
>>>> wake (that would not work in single thread and libc does not use
>>>> such locks, to be sure lll_cond* is undefed).
>>>> This speeds up a getchar loop about 2-4x depending on the cpu,
>>>> while only cause around 5-10% regression for the multi-threaded case
>>> What measurement of what benchmark resulted in that number (the latter
>>> one)?  Without details of what you are measuring this isn't meaningful.
>> these are all about getchar in a loop
>> for (i=0; i<N; i++) getchar();
>> and then time ./a.out </dev/zero
>> it is i think idiomatic input processing code for a number
>> of cmdline tools and those tools tend to be single threaded.
> Can you measure any CPU time difference for these tools?

gnu dc with some generated input:

$ time taskset -c 1 $NOLOCK/lib64/ --library-path $NOLOCK/lib64 ./dc <dcinput

real	0m21.083s
user	0m21.040s
sys	0m0.040s
$ time taskset -c 1 $ORIG/lib64/ --library-path $ORIG/lib64 ./dc <dcinput

real	0m29.734s
user	0m29.716s
sys	0m0.016s

this also affects $customer tool

(most gnu tools have their own silly buffering
exactly to avoid the slow libc stdio, some tools
use _unlocked interfaces directly which are less
portable, so there are plenty of maintenance issues
caused by leaving this unfixed)

>> the multi-threaded case is just creating a dummy thread to
>> disable the optimization.
> Note that half of the overhead will be in the unlock code, and so is
> executed during the critical section.  That means that you make the
> sequential parts of a program longer, and that will limit the maximum
> amount of parallelism you can have.
> Also, more and more programs will be multi-threaded (though maybe they
> don't do tight getchar() loops like the one above), so it's not quite
> clear whether the 5-10% are less important overall or not.

if this optimization is so bad, then remove it
from x86_64, it affects a lot of users.

>>>> (other libc internal locks are not expected to be performance
>>>> critical or significantly affected by this change).
>>> Why do you think this is the case?
>> there is only an extra branch in the lock and unlock
>> code, i don't see locks in libc that can be hot enough
>> to make that matter, except for stdio and malloc locks.
> If it's just a few of the higher-level clients that you think would
> benefit, this is another reason to optimize there and leave the
> low-level lock unchanged.

i can simplify the stdio patch a bit so it is only
applied to getc/putc/.. then malloc interposition
is not an issue, nor printf hooks.

that should be a safe first step.

>> (it does add some code bloat to libc though)
>> in stdio only getc/putc/getchar/putchar +w variants are
>> short enough to make the optimization practically relevant.
>> the effect on malloc is already much smaller since it has
>> more surrounding code beyond the lock/unlock (instead of
>> 2-4x speed up you get 10% or so with a naive free(malloc(1))
>> in a loop, with more complex workloads i'd expect smaller
>> effect as that would probably go through more branches in
>> malloc/free)
> What about multi-threaded malloc?

<= 5% (value depends on cpu)

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]