This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386

From: Andrew Pinski <pinskia at gmail dot com>
To: Rich Felker <dalias at aerifal dot cx>
Cc: Torvald Riegel <triegel at redhat dot com>, GLIBC Devel <libc-alpha at sourceware dot org>, andi <andi at firstfloor dot org>
Date: Mon, 7 Apr 2014 09:41:18 -0700
Subject: Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386
Authentication-results: sourceware.org; auth=none
References: <1381523328 dot 18547 dot 3422 dot camel at triegel dot csb> <6DC33685-5DC2-4449-ADFE-C1696B949465 at gmail dot com> <1381610436 dot 18547 dot 3660 dot camel at triegel dot csb> <20131012211001 dot GW20515 at brightrain dot aerifal dot cx>

On Sat, Oct 12, 2013 at 2:10 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Sat, Oct 12, 2013 at 11:40:36PM +0300, Torvald Riegel wrote:
>> On Fri, 2013-10-11 at 22:40 -0700, pinskia@gmail.com wrote:
>> >
>> > > On Oct 11, 2013, at 1:28 PM, Torvald Riegel <triegel@redhat.com> wrote:
>> > >
>> > > Assuming the pthread_once unification I sent recently is applied, we
>> > > still have custom x86_64 and i386 variants of pthread_once.  The
>> > > algorithm they use is the same as the unified variant, so we would be
>> > > able to remove the custom variants if this doesn't affect performance.
>> > >
>> > > The common case when pthread_once is executed is that the initialization
>> > > has already been performed; thus, this is the fast path that we can
>> > > focus on.  (I haven't looked specifically at the generated code for the
>> > > slow path, but the algorithm is the same and I assume that the overhead
>> > > of the synchronizing instructions and futex syscalls determines the
>> > > performance of it, not any differences between compiler-generated code
>> > > and the custom code.)
>> > >
>> > > The fast path of the custom assembler version:
>> > >    testl    $2, (%rdi)
>> > >    jz    1f
>> > >    xorl    %eax, %eax
>> > >    retq
>> > >
>> > > The fast path of the generic pthread_once C code, as it is after the
>> > > pthread_once unification patch:
>> > >  20:   48 89 5c 24 e8          mov    %rbx,-0x18(%rsp)
>> > >  25:   48 89 6c 24 f0          mov    %rbp,-0x10(%rsp)
>> > >  2a:   48 89 fb                mov    %rdi,%rbx
>> > >  2d:   4c 89 64 24 f8          mov    %r12,-0x8(%rsp)
>> > >  32:   48 89 f5                mov    %rsi,%rbp
>> > >  35:   48 83 ec 38             sub    $0x38,%rsp
>> > >  39:   41 b8 ca 00 00 00       mov    $0xca,%r8d
>> > >  3f:   8b 13                   mov    (%rbx),%edx
>> > >  41:   f6 c2 02                test   $0x2,%dl
>> > >  44:   74 16                   je     5c <__pthread_once+0x3c>
>> > >  46:   31 c0                   xor    %eax,%eax
>> > >  48:   48 8b 5c 24 20          mov    0x20(%rsp),%rbx
>> > >  4d:   48 8b 6c 24 28          mov    0x28(%rsp),%rbp
>> > >  52:   4c 8b 64 24 30          mov    0x30(%rsp),%r12
>> > >  57:   48 83 c4 38             add    $0x38,%rsp
>> > >  5b:   c3                      retq
>> >
>> > Seems like this is a good case where shrink wrapping should have
>> > helped. What version of GCC did you try this with and if it was
>> > 4.8 or latter, can you file a bug for this missed opt?
>>
>> I used gcc 4.4.
>>
>> If 4.8 generates a leaner fast path, would people think that this is
>> enough reason to not split out the fast path manually?
>
> I doubt gcc 4.8 does any better. gcc is _REALLY_ bad at optimizing
> functions which have "fast paths" (is this what "shrink wrapping"
> means?); it always ends up with huge bloated stack frame setup
> prologue and epilogue even if the fast path does not need a stack
> frame at all. This seems to have gotten worse, not better, over time;
> I remember gcc 2.95 and/or 3.x being much less offensive in this
> regard. This is one place the libfirm/cparser project is already
> blowing gcc away...

I think you should be filing bugs about this rather than complaining
to this list.  Also shrink wrapping works in some places and is
improving every GCC release since 4.8.  GCC 4.9 has some improvements
too.  I know because I filed original bug requesting shrink wrapping
(before I knew it was called shrink wrapping :) ).

Thanks,
Andrew


>
> Rich

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]