This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Save and restore xmm0-xmm7 in _dl_runtime_resolve


On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote:
> On Mon, Jul 27, 2015 at 3:10 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> > On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote:
> >> On Sun, Jul 26, 2015 at 6:16 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> >> > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote:
> >> >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote:
> >> >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, OndÅej BÃlka wrote:
> >> >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
> >> >> > > > On Thu, Jul 9, 2015 at 7:28 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> >> >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
> >> >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
> >> >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
> >> >> > > > >> >> wrote:
> >> >> > > > >> >> > Fixed in the attached patch
> >> >> > > > >> >> >
> >> >> > > > >> >>
> >> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for
> >> >> > > > >> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
> >> >> > > > >> >> without on i386 and x86-64.
> >> >> > > > >> >
> >> >> > > > >> > Done, all works fine
> >> >> > > > >> >
> >> >> > > > >>
> >> >> > > > >> I checked it in for you.
> >> >> > > > >>
> >> >> > > > > These are nice but you could have same problem with lazy tls allocation.
> >> >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write
> >> >> > > > > similar patch to solve that? Original purpose was to always save xmm
> >> >> > > > > registers so we could use sse2 routines which speeds up lookup time.
> >> >> > > >
> >> >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
> >> >> > > > much gain it will give us?
> >> >> > > >
> >> >> > > I couldn't measure that without patch. Gain now would be big as we now
> >> >> > > use byte-by-byte loop to check symbol name which is slow, especially
> >> >> > > with c++ name mangling. Would be following benchmark good to measure
> >> >> > > speedup or do I need to measure startup time which is bit harder?
> >> >> > >
> >> >> >
> >> >> > Please try this.
> >> >> >
> >> >>
> >> >> We have to use movups instead of movaps due to
> >> >>
> >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
> >> >>
> >> >>
> >> > Thanks, this looks promising.
> >> >
> >> > I think how to do definite benchmark, Now I have evidence that its
> >> > likely improvement but not definite.
> >> >
> >> > I found that benchmark that i intended causes too much noise and I
> >> > didn't get useful from that yet. It was creating 1000 functions in
> >> > library and calling them from main where performance between runs vary
> >> > by factor of 3 for same implementation.
> >> >
> >> > I have indirect evidence. With attached patch to use sse2 routines I
> >> > decreased startup time of running binaries when you run "make bench"
> >> > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge.
> >> >
> >> > See results on haswell of
> >> >
> >> > LD_DEBUG=statistics make bench &> old_rtld
> >> >
> >> > that are large so you could browse these here
> >> >
> >> > http://kam.mff.cuni.cz/~ondra/old_rtld
> >> > http://kam.mff.cuni.cz/~ondra/new_rtld
> >> >
> >> > For dlopen benchmark I measure ten times performance of
> >> > dlopen(RTLD_DEFAULT,"memcpy");
> >> > dlopen(RTLD_DEFAULT,"strlen");
> >> >
> >> > Without patch I get
> >> >  624.49  559.58  556.6 556.04  558.42  557.86  559.46  555.17  556.93  555.32
> >> > and with patch
> >> >   604.71  536.74  536.08  535.78  534.11  533.67  534.8 534.8 533.46 536.08
> >> >
> >> > I attached vip patches, I didn't change memcpy yet.
> >> >
> >> > So if you have idea how directly measure fixup change it would be
> >> > welcome.
> >> >
> >>
> >> There is a potential performance issue.  This won't change parameters
> >> passed in S256-bit/512-bit vector registers because SSE load will only
> >> update the lower 128 bits of 256-bit/512-bit vector registers while
> >> preserving the upper bits.  But these SSE load operations may not be
> >> fast on all current and future processors.  To load the entire
> >> 256-bit/512-bit vector registers, we need to check CPU feature in
> >> each symbol lookup.  On the other hand, we can compile x86-64 ld.so
> >> with -msse2.  I don't know what the final performance impact is.
> >>
> > Yes, these should be saved due problems with modes. There could be
> > problem that saving these takes longer. You don't need
> > check cpu features on each call.
> > Make _dl_runtime_resolve a function pointer and on
> > startup initialize it to correct variant.
> 
> One more indirect call.
>
no, my proposal is different, we could do this:

void *_dl_runtime_resolve;
int startup()
{
  if (has_avx())
    _dl_runtime_resolve = _dl_runtime_resolve_avx;
  else
    _dl_runtime_resolve = _dl_runtime_resolve_sse;
}

Then we will assign correct variant. 
 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]