This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote: > On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote: > > On Sat, Jul 11, 2015 at 12:46:54PM +0200, OndÅej BÃlka wrote: > > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote: > > > > On Thu, Jul 9, 2015 at 7:28 AM, OndÅej BÃlka <neleai@seznam.cz> wrote: > > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote: > > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote: > > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com> > > > > >> >> wrote: > > > > >> >> > Fixed in the attached patch > > > > >> >> > > > > > >> >> > > > > >> >> I fixed some typos and updated sysdeps/i386/configure for > > > > >> >> HAVE_MPX_SUPPORT. Please verify both with HAVE_MPX_SUPPORT and > > > > >> >> without on i386 and x86-64. > > > > >> > > > > > >> > Done, all works fine > > > > >> > > > > > >> > > > > >> I checked it in for you. > > > > >> > > > > > These are nice but you could have same problem with lazy tls allocation. > > > > > I wrote patch to merge trampolines, which now conflicts. Could you write > > > > > similar patch to solve that? Original purpose was to always save xmm > > > > > registers so we could use sse2 routines which speeds up lookup time. > > > > > > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How > > > > much gain it will give us? > > > > > > > I couldn't measure that without patch. Gain now would be big as we now > > > use byte-by-byte loop to check symbol name which is slow, especially > > > with c++ name mangling. Would be following benchmark good to measure > > > speedup or do I need to measure startup time which is bit harder? > > > > > > > Please try this. > > > > We have to use movups instead of movaps due to > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066 > > Thanks, this looks promising. I think how to do definite benchmark, Now I have evidence that its likely improvement but not definite. I found that benchmark that i intended causes too much noise and I didn't get useful from that yet. It was creating 1000 functions in library and calling them from main where performance between runs vary by factor of 3 for same implementation. I have indirect evidence. With attached patch to use sse2 routines I decreased startup time of running binaries when you run "make bench" by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge. See results on haswell of LD_DEBUG=statistics make bench &> old_rtld that are large so you could browse these here http://kam.mff.cuni.cz/~ondra/old_rtld http://kam.mff.cuni.cz/~ondra/new_rtld For dlopen benchmark I measure ten times performance of dlopen(RTLD_DEFAULT,"memcpy"); dlopen(RTLD_DEFAULT,"strlen"); Without patch I get 624.49 559.58 556.6 556.04 558.42 557.86 559.46 555.17 556.93 555.32 and with patch 604.71 536.74 536.08 535.78 534.11 533.67 534.8 534.8 533.46 536.08 I attached vip patches, I didn't change memcpy yet. So if you have idea how directly measure fixup change it would be welcome.
Attachment:
0002-dlopen-benchmark.patch
Description: Text document
Attachment:
0004-rtld.patch
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |