This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH][AArch64] Single thread lowlevellock optimization
- From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: nd at arm dot com, GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 21 Jun 2017 10:22:15 +0100
- Subject: Re: [PATCH][AArch64] Single thread lowlevellock optimization
- Authentication-results: sourceware.org; auth=none
- Authentication-results: sourceware.org; dkim=none (message not signed) header.d=none;sourceware.org; dmarc=none action=none header.from=arm.com;
- Nodisclaimer: True
- References: <firstname.lastname@example.org> <email@example.com> <594939C9.firstname.lastname@example.org> <email@example.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
On 20/06/17 19:10, Torvald Riegel wrote:
> On Tue, 2017-06-20 at 16:05 +0100, Szabolcs Nagy wrote:
>> On 20/06/17 14:47, Torvald Riegel wrote:
>>> On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
>>>> Differences compared to the current x86_64 behaviour:
>>>> - The optimization is not silently applied to shared locks, in that
>>>> case the build fails.
>>>> - Unlock assumes the futex value is 0 or 1, there are no waiters to
>>>> wake (that would not work in single thread and libc does not use
>>>> such locks, to be sure lll_cond* is undefed).
>>>> This speeds up a getchar loop about 2-4x depending on the cpu,
>>>> while only cause around 5-10% regression for the multi-threaded case
>>> What measurement of what benchmark resulted in that number (the latter
>>> one)? Without details of what you are measuring this isn't meaningful.
>> these are all about getchar in a loop
>> for (i=0; i<N; i++) getchar();
>> and then time ./a.out </dev/zero
>> it is i think idiomatic input processing code for a number
>> of cmdline tools and those tools tend to be single threaded.
> Can you measure any CPU time difference for these tools?
gnu dc with some generated input:
$ time taskset -c 1 $NOLOCK/lib64/ld-linux-aarch64.so.1 --library-path $NOLOCK/lib64 ./dc <dcinput
$ time taskset -c 1 $ORIG/lib64/ld-linux-aarch64.so.1 --library-path $ORIG/lib64 ./dc <dcinput
this also affects $customer tool
(most gnu tools have their own silly buffering
exactly to avoid the slow libc stdio, some tools
use _unlocked interfaces directly which are less
portable, so there are plenty of maintenance issues
caused by leaving this unfixed)
>> the multi-threaded case is just creating a dummy thread to
>> disable the optimization.
> Note that half of the overhead will be in the unlock code, and so is
> executed during the critical section. That means that you make the
> sequential parts of a program longer, and that will limit the maximum
> amount of parallelism you can have.
> Also, more and more programs will be multi-threaded (though maybe they
> don't do tight getchar() loops like the one above), so it's not quite
> clear whether the 5-10% are less important overall or not.
if this optimization is so bad, then remove it
from x86_64, it affects a lot of users.
>>>> (other libc internal locks are not expected to be performance
>>>> critical or significantly affected by this change).
>>> Why do you think this is the case?
>> there is only an extra branch in the lock and unlock
>> code, i don't see locks in libc that can be hot enough
>> to make that matter, except for stdio and malloc locks.
> If it's just a few of the higher-level clients that you think would
> benefit, this is another reason to optimize there and leave the
> low-level lock unchanged.
i can simplify the stdio patch a bit so it is only
applied to getc/putc/.. then malloc interposition
is not an issue, nor printf hooks.
that should be a safe first step.
>> (it does add some code bloat to libc though)
>> in stdio only getc/putc/getchar/putchar +w variants are
>> short enough to make the optimization practically relevant.
>> the effect on malloc is already much smaller since it has
>> more surrounding code beyond the lock/unlock (instead of
>> 2-4x speed up you get 10% or so with a naive free(malloc(1))
>> in a loop, with more complex workloads i'd expect smaller
>> effect as that would probably go through more branches in
> What about multi-threaded malloc?
<= 5% (value depends on cpu)