This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386


On Sat, Oct 12, 2013 at 2:10 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Sat, Oct 12, 2013 at 11:40:36PM +0300, Torvald Riegel wrote:
>> On Fri, 2013-10-11 at 22:40 -0700, pinskia@gmail.com wrote:
>> >
>> > > On Oct 11, 2013, at 1:28 PM, Torvald Riegel <triegel@redhat.com> wrote:
>> > >
>> > > Assuming the pthread_once unification I sent recently is applied, we
>> > > still have custom x86_64 and i386 variants of pthread_once.  The
>> > > algorithm they use is the same as the unified variant, so we would be
>> > > able to remove the custom variants if this doesn't affect performance.
>> > >
>> > > The common case when pthread_once is executed is that the initialization
>> > > has already been performed; thus, this is the fast path that we can
>> > > focus on.  (I haven't looked specifically at the generated code for the
>> > > slow path, but the algorithm is the same and I assume that the overhead
>> > > of the synchronizing instructions and futex syscalls determines the
>> > > performance of it, not any differences between compiler-generated code
>> > > and the custom code.)
>> > >
>> > > The fast path of the custom assembler version:
>> > >    testl    $2, (%rdi)
>> > >    jz    1f
>> > >    xorl    %eax, %eax
>> > >    retq
>> > >
>> > > The fast path of the generic pthread_once C code, as it is after the
>> > > pthread_once unification patch:
>> > >  20:   48 89 5c 24 e8          mov    %rbx,-0x18(%rsp)
>> > >  25:   48 89 6c 24 f0          mov    %rbp,-0x10(%rsp)
>> > >  2a:   48 89 fb                mov    %rdi,%rbx
>> > >  2d:   4c 89 64 24 f8          mov    %r12,-0x8(%rsp)
>> > >  32:   48 89 f5                mov    %rsi,%rbp
>> > >  35:   48 83 ec 38             sub    $0x38,%rsp
>> > >  39:   41 b8 ca 00 00 00       mov    $0xca,%r8d
>> > >  3f:   8b 13                   mov    (%rbx),%edx
>> > >  41:   f6 c2 02                test   $0x2,%dl
>> > >  44:   74 16                   je     5c <__pthread_once+0x3c>
>> > >  46:   31 c0                   xor    %eax,%eax
>> > >  48:   48 8b 5c 24 20          mov    0x20(%rsp),%rbx
>> > >  4d:   48 8b 6c 24 28          mov    0x28(%rsp),%rbp
>> > >  52:   4c 8b 64 24 30          mov    0x30(%rsp),%r12
>> > >  57:   48 83 c4 38             add    $0x38,%rsp
>> > >  5b:   c3                      retq
>> >
>> > Seems like this is a good case where shrink wrapping should have
>> > helped. What version of GCC did you try this with and if it was
>> > 4.8 or latter, can you file a bug for this missed opt?
>>
>> I used gcc 4.4.
>>
>> If 4.8 generates a leaner fast path, would people think that this is
>> enough reason to not split out the fast path manually?
>
> I doubt gcc 4.8 does any better. gcc is _REALLY_ bad at optimizing
> functions which have "fast paths" (is this what "shrink wrapping"
> means?); it always ends up with huge bloated stack frame setup
> prologue and epilogue even if the fast path does not need a stack
> frame at all. This seems to have gotten worse, not better, over time;
> I remember gcc 2.95 and/or 3.x being much less offensive in this
> regard. This is one place the libfirm/cparser project is already
> blowing gcc away...

I think you should be filing bugs about this rather than complaining
to this list.  Also shrink wrapping works in some places and is
improving every GCC release since 4.8.  GCC 4.9 has some improvements
too.  I know because I filed original bug requesting shrink wrapping
(before I knew it was called shrink wrapping :) ).

Thanks,
Andrew


>
> Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]