This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Save and restore xmm0-xmm7 in _dl_runtime_resolve
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Date: Mon, 27 Jul 2015 06:14:07 -0700
- Subject: Re: [PATCH] Save and restore xmm0-xmm7 in _dl_runtime_resolve
- Authentication-results: sourceware.org; auth=none
- References: <CAMe9rOo81zoKpt+QmmVYjfjV2=KwLkYiqeADv3kMyeouM+9uug at mail dot gmail dot com> <0EFAB2BDD0F67E4FB6CCC8B9F87D75696A9235F6 at IRSMSX101 dot ger dot corp dot intel dot com> <CAMe9rOoXLPUr_LUexoRKjrCdNhP0J8EMY+1XNAaLnpW1qknb7w at mail dot gmail dot com> <20150709142827 dot GA18030 at domone> <CAMe9rOoXCwiPdQVP7_tV7599f6y9w_n1P+SXsE7urb69f3v7gA at mail dot gmail dot com> <20150711104654 dot GA26570 at domone> <20150711202742 dot GA9074 at gmail dot com> <20150711235002 dot GA7543 at gmail dot com> <20150726131622 dot GA10623 at domone> <CAMe9rOre_GQimKou2PXjp95xcfN1jYO5-tkEAB7eMbP1HMO+FQ at mail dot gmail dot com> <20150727101015 dot GA489 at domone>
On Mon, Jul 27, 2015 at 3:10 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote:
>> On Sun, Jul 26, 2015 at 6:16 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
>> > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote:
>> >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote:
>> >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, OndÅej BÃlka wrote:
>> >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
>> >> > > > On Thu, Jul 9, 2015 at 7:28 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
>> >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
>> >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
>> >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
>> >> > > > >> >> wrote:
>> >> > > > >> >> > Fixed in the attached patch
>> >> > > > >> >> >
>> >> > > > >> >>
>> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for
>> >> > > > >> >> HAVE_MPX_SUPPORT. Please verify both with HAVE_MPX_SUPPORT and
>> >> > > > >> >> without on i386 and x86-64.
>> >> > > > >> >
>> >> > > > >> > Done, all works fine
>> >> > > > >> >
>> >> > > > >>
>> >> > > > >> I checked it in for you.
>> >> > > > >>
>> >> > > > > These are nice but you could have same problem with lazy tls allocation.
>> >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write
>> >> > > > > similar patch to solve that? Original purpose was to always save xmm
>> >> > > > > registers so we could use sse2 routines which speeds up lookup time.
>> >> > > >
>> >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
>> >> > > > much gain it will give us?
>> >> > > >
>> >> > > I couldn't measure that without patch. Gain now would be big as we now
>> >> > > use byte-by-byte loop to check symbol name which is slow, especially
>> >> > > with c++ name mangling. Would be following benchmark good to measure
>> >> > > speedup or do I need to measure startup time which is bit harder?
>> >> > >
>> >> >
>> >> > Please try this.
>> >> >
>> >>
>> >> We have to use movups instead of movaps due to
>> >>
>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
>> >>
>> >>
>> > Thanks, this looks promising.
>> >
>> > I think how to do definite benchmark, Now I have evidence that its
>> > likely improvement but not definite.
>> >
>> > I found that benchmark that i intended causes too much noise and I
>> > didn't get useful from that yet. It was creating 1000 functions in
>> > library and calling them from main where performance between runs vary
>> > by factor of 3 for same implementation.
>> >
>> > I have indirect evidence. With attached patch to use sse2 routines I
>> > decreased startup time of running binaries when you run "make bench"
>> > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge.
>> >
>> > See results on haswell of
>> >
>> > LD_DEBUG=statistics make bench &> old_rtld
>> >
>> > that are large so you could browse these here
>> >
>> > http://kam.mff.cuni.cz/~ondra/old_rtld
>> > http://kam.mff.cuni.cz/~ondra/new_rtld
>> >
>> > For dlopen benchmark I measure ten times performance of
>> > dlopen(RTLD_DEFAULT,"memcpy");
>> > dlopen(RTLD_DEFAULT,"strlen");
>> >
>> > Without patch I get
>> > 624.49 559.58 556.6 556.04 558.42 557.86 559.46 555.17 556.93 555.32
>> > and with patch
>> > 604.71 536.74 536.08 535.78 534.11 533.67 534.8 534.8 533.46 536.08
>> >
>> > I attached vip patches, I didn't change memcpy yet.
>> >
>> > So if you have idea how directly measure fixup change it would be
>> > welcome.
>> >
>>
>> There is a potential performance issue. This won't change parameters
>> passed in S256-bit/512-bit vector registers because SSE load will only
>> update the lower 128 bits of 256-bit/512-bit vector registers while
>> preserving the upper bits. But these SSE load operations may not be
>> fast on all current and future processors. To load the entire
>> 256-bit/512-bit vector registers, we need to check CPU feature in
>> each symbol lookup. On the other hand, we can compile x86-64 ld.so
>> with -msse2. I don't know what the final performance impact is.
>>
> Yes, these should be saved due problems with modes. There could be
> problem that saving these takes longer. You don't need
> check cpu features on each call.
> Make _dl_runtime_resolve a function pointer and on
> startup initialize it to correct variant.
One more indirect call.
> This depends if more complicated code is worth gain. On other hand I
> read original code to find it was also about fixing bug 15128 where
> ifunc resolver clobbers xmm registers, similar problem is with LD_AUDIT.
>
> These could be solved with this or by saving xmm registers only when
> ifunc is called. I don't know whats best maintainable solution.
That is true. Can you try compile ld.so with -msse2?
--
H.J.