This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386
- From: Rich Felker <dalias at aerifal dot cx>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: pinskia at gmail dot com, GLIBC Devel <libc-alpha at sourceware dot org>, andi <andi at firstfloor dot org>
- Date: Sat, 12 Oct 2013 17:10:01 -0400
- Subject: Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386
- Authentication-results: sourceware.org; auth=none
- References: <1381523328 dot 18547 dot 3422 dot camel at triegel dot csb> <6DC33685-5DC2-4449-ADFE-C1696B949465 at gmail dot com> <1381610436 dot 18547 dot 3660 dot camel at triegel dot csb>
On Sat, Oct 12, 2013 at 11:40:36PM +0300, Torvald Riegel wrote:
> On Fri, 2013-10-11 at 22:40 -0700, pinskia@gmail.com wrote:
> >
> > > On Oct 11, 2013, at 1:28 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > >
> > > Assuming the pthread_once unification I sent recently is applied, we
> > > still have custom x86_64 and i386 variants of pthread_once. The
> > > algorithm they use is the same as the unified variant, so we would be
> > > able to remove the custom variants if this doesn't affect performance.
> > >
> > > The common case when pthread_once is executed is that the initialization
> > > has already been performed; thus, this is the fast path that we can
> > > focus on. (I haven't looked specifically at the generated code for the
> > > slow path, but the algorithm is the same and I assume that the overhead
> > > of the synchronizing instructions and futex syscalls determines the
> > > performance of it, not any differences between compiler-generated code
> > > and the custom code.)
> > >
> > > The fast path of the custom assembler version:
> > > testl $2, (%rdi)
> > > jz 1f
> > > xorl %eax, %eax
> > > retq
> > >
> > > The fast path of the generic pthread_once C code, as it is after the
> > > pthread_once unification patch:
> > > 20: 48 89 5c 24 e8 mov %rbx,-0x18(%rsp)
> > > 25: 48 89 6c 24 f0 mov %rbp,-0x10(%rsp)
> > > 2a: 48 89 fb mov %rdi,%rbx
> > > 2d: 4c 89 64 24 f8 mov %r12,-0x8(%rsp)
> > > 32: 48 89 f5 mov %rsi,%rbp
> > > 35: 48 83 ec 38 sub $0x38,%rsp
> > > 39: 41 b8 ca 00 00 00 mov $0xca,%r8d
> > > 3f: 8b 13 mov (%rbx),%edx
> > > 41: f6 c2 02 test $0x2,%dl
> > > 44: 74 16 je 5c <__pthread_once+0x3c>
> > > 46: 31 c0 xor %eax,%eax
> > > 48: 48 8b 5c 24 20 mov 0x20(%rsp),%rbx
> > > 4d: 48 8b 6c 24 28 mov 0x28(%rsp),%rbp
> > > 52: 4c 8b 64 24 30 mov 0x30(%rsp),%r12
> > > 57: 48 83 c4 38 add $0x38,%rsp
> > > 5b: c3 retq
> >
> > Seems like this is a good case where shrink wrapping should have
> > helped. What version of GCC did you try this with and if it was
> > 4.8 or latter, can you file a bug for this missed opt?
>
> I used gcc 4.4.
>
> If 4.8 generates a leaner fast path, would people think that this is
> enough reason to not split out the fast path manually?
I doubt gcc 4.8 does any better. gcc is _REALLY_ bad at optimizing
functions which have "fast paths" (is this what "shrink wrapping"
means?); it always ends up with huge bloated stack frame setup
prologue and epilogue even if the fast path does not need a stack
frame at all. This seems to have gotten worse, not better, over time;
I remember gcc 2.95 and/or 3.x being much less offensive in this
regard. This is one place the libfirm/cparser project is already
blowing gcc away...
Rich