This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386


On Sat, Oct 12, 2013 at 11:40:36PM +0300, Torvald Riegel wrote:
> On Fri, 2013-10-11 at 22:40 -0700, pinskia@gmail.com wrote:
> > 
> > > On Oct 11, 2013, at 1:28 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > > 
> > > Assuming the pthread_once unification I sent recently is applied, we
> > > still have custom x86_64 and i386 variants of pthread_once.  The
> > > algorithm they use is the same as the unified variant, so we would be
> > > able to remove the custom variants if this doesn't affect performance.
> > > 
> > > The common case when pthread_once is executed is that the initialization
> > > has already been performed; thus, this is the fast path that we can
> > > focus on.  (I haven't looked specifically at the generated code for the
> > > slow path, but the algorithm is the same and I assume that the overhead
> > > of the synchronizing instructions and futex syscalls determines the
> > > performance of it, not any differences between compiler-generated code
> > > and the custom code.)
> > > 
> > > The fast path of the custom assembler version:
> > >    testl    $2, (%rdi)
> > >    jz    1f
> > >    xorl    %eax, %eax
> > >    retq
> > > 
> > > The fast path of the generic pthread_once C code, as it is after the
> > > pthread_once unification patch:
> > >  20:   48 89 5c 24 e8          mov    %rbx,-0x18(%rsp)
> > >  25:   48 89 6c 24 f0          mov    %rbp,-0x10(%rsp)
> > >  2a:   48 89 fb                mov    %rdi,%rbx
> > >  2d:   4c 89 64 24 f8          mov    %r12,-0x8(%rsp)
> > >  32:   48 89 f5                mov    %rsi,%rbp
> > >  35:   48 83 ec 38             sub    $0x38,%rsp
> > >  39:   41 b8 ca 00 00 00       mov    $0xca,%r8d
> > >  3f:   8b 13                   mov    (%rbx),%edx
> > >  41:   f6 c2 02                test   $0x2,%dl
> > >  44:   74 16                   je     5c <__pthread_once+0x3c>
> > >  46:   31 c0                   xor    %eax,%eax
> > >  48:   48 8b 5c 24 20          mov    0x20(%rsp),%rbx
> > >  4d:   48 8b 6c 24 28          mov    0x28(%rsp),%rbp
> > >  52:   4c 8b 64 24 30          mov    0x30(%rsp),%r12
> > >  57:   48 83 c4 38             add    $0x38,%rsp
> > >  5b:   c3                      retq   
> > 
> > Seems like this is a good case where shrink wrapping should have
> > helped. What version of GCC did you try this with and if it was
> > 4.8 or latter, can you file a bug for this missed opt?
> 
> I used gcc 4.4.
> 
> If 4.8 generates a leaner fast path, would people think that this is
> enough reason to not split out the fast path manually?

I doubt gcc 4.8 does any better. gcc is _REALLY_ bad at optimizing
functions which have "fast paths" (is this what "shrink wrapping"
means?); it always ends up with huge bloated stack frame setup
prologue and epilogue even if the fast path does not need a stack
frame at all. This seems to have gotten worse, not better, over time;
I remember gcc 2.95 and/or 3.x being much less offensive in this
regard. This is one place the libfirm/cparser project is already
blowing gcc away...

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]