This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]


On 2017.10.20 at 04:11 -0700, H.J. Lu wrote:
> On Fri, Oct 20, 2017 at 12:24 AM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
> > On 2017.10.19 at 15:36 -0700, H.J. Lu wrote:
> >> On Thu, Oct 19, 2017 at 2:55 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> >> > On 10/19/2017 10:41 AM, H.J. Lu wrote:
> >> >> In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
> >> >> mask and bound registers.  It simplifies _dl_runtime_resolve and supports
> >> >> different calling conventions.  ld.so code size is reduced by more than
> >> >> 1 KB.  However, use fxsave/xsave/xsavec takes a little bit more cycles
> >> >> than saving and restoring vector and bound registers individually.
> >> >>
> >> >> Latency for _dl_runtime_resolve to lookup the function, foo, from one
> >> >> shared library plus libc.so:
> >> >>
> >> >>                              Before    After     Change
> >> >>
> >> >> Westmere (SSE)/fxsave         345      866       151%
> >> >> IvyBridge (AVX)/xsave         420      643       53%
> >> >> Haswell (AVX)/xsave           713      1252      75%
> >> >> Skylake (AVX+MPX)/xsavec      559      719       28%
> >> >> Skylake (AVX512+MPX)/xsavec   145      272       87%
> >> >
> >> > This is a good baseline, but as you note, the change may not be observable
> >> > in any real world programs.
> >> >
> >> > The case I made to David Kreitzer here:
> >> > https://sourceware.org/ml/libc-alpha/2017-03/msg00430.html
> >> > ~~~
> >> >   ... Alternatively a more detailed performance analysis of
> >> >   the impact on applications that don't use __regcall is required before adding
> >> >   instructions to the hot path of the average application (or removing their use
> >> >   in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
> >> >   on hardware that supports those vector registers).
> >> > ~~~
> >> >
> >> >> This is the worst case where portion of time spent for saving and
> >> >> restoring registers is bigger than majority of cases.  With smaller
> >> >> _dl_runtime_resolve code size, overall performance impact is negligible.
> >> >>
> >> >> On IvyBridge, differences in build and test time of binutils with lazy
> >> >> binding GCC and binutils are noises.  On Westmere, differences in
> >> >> bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
> >> >> binutils are also noises.
> >> > Do you have any statistics on the timing for large applications that
> >> > use a lot of libraries? I don't see gcc, binutils, or glibc as indicative
> >> > of the complexity of shared libraries in terms of loaded shared libraries.
> >>
> >> _dl_runtime_resolve is only called once when an external function is
> >> called the first time.  Many shared libraries isn't a problem unless
> >> all execution
> >> time is spent in _dl_runtime_resolve.  I don't believe this is a
> >> typical behavior.
> >>
> >> > Something like libreoffice's soffice.bin has 142 DSOs, or chrome's
> >> > 103 DSOs. It might be hard to measure if the lazy resolution is impacting
> >> > the performance or if you are hitting some other performance boundary, but
> >> > a black-box test showing performance didn't get *worse* for startup and
> >> > exit, would mean it isn't the bottlneck (but might be some day). To test
> >> > this you should be able to use libreoffice's CLI arguments to batch process
> >> > some files and time that (or the --cat files option).
> >
> > I did some testing on my old SSE only machine and everything is in the
> > noise. For example:
> >
> >  ~ % ldd /usr/lib64/libreoffice/program/soffice.bin | wc -l
> > 105
> >  ~ % hardening-check /usr/lib64/libreoffice/program/soffice.bin
> > /usr/lib64/libreoffice/program/soffice.bin:
> >  Position Independent Executable: no, normal executable!
> >  Stack protected: no, not found!
> >  Fortify Source functions: no, not found!
> >  Read-only relocations: yes
> >  Immediate binding: no, not found!
> 
> I have
> 
> [hjl@gnu-6 tmp]$ readelf -d  /usr/lib64/libreoffice/program/soffice.bin
> 
> Dynamic section at offset 0xdb8 contains 27 entries:
>   Tag        Type                         Name/Value
>  0x0000000000000001 (NEEDED)             Shared library: [libuno_sal.so.3]
>  0x0000000000000001 (NEEDED)             Shared library: [libsofficeapp.so]
>  0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
>  0x000000000000000f (RPATH)              Library rpath: [$ORIGIN]
>  0x000000000000000c (INIT)               0x710
>  0x000000000000000d (FINI)               0x904
>  0x0000000000000019 (INIT_ARRAY)         0x200da0
>  0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
>  0x000000000000001a (FINI_ARRAY)         0x200da8
>  0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
>  0x000000006ffffef5 (GNU_HASH)           0x298
>  0x0000000000000005 (STRTAB)             0x478
>  0x0000000000000006 (SYMTAB)             0x2e0
>  0x000000000000000a (STRSZ)              301 (bytes)
>  0x000000000000000b (SYMENT)             24 (bytes)
>  0x0000000000000015 (DEBUG)              0x0
>  0x0000000000000003 (PLTGOT)             0x200fa8
>  0x0000000000000007 (RELA)               0x608
>  0x0000000000000008 (RELASZ)             264 (bytes)
>  0x0000000000000009 (RELAENT)            24 (bytes)
>  0x0000000000000018 (BIND_NOW)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   _dl_runtime_resolve isn't
> used at all.

Yes. That is why I posted the hardening-check output: 
 "Immediate binding: no, not found!" means that "-z lazy" was used in my
 case.

-- 
Markus


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]