This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]


On 10/19/2017 10:41 AM, H.J. Lu wrote:
> In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
> mask and bound registers.  It simplifies _dl_runtime_resolve and supports
> different calling conventions.  ld.so code size is reduced by more than
> 1 KB.  However, use fxsave/xsave/xsavec takes a little bit more cycles
> than saving and restoring vector and bound registers individually.
> 
> Latency for _dl_runtime_resolve to lookup the function, foo, from one
> shared library plus libc.so:
> 
>                              Before    After     Change
> 
> Westmere (SSE)/fxsave         345      866       151%
> IvyBridge (AVX)/xsave         420      643       53%
> Haswell (AVX)/xsave           713      1252      75%
> Skylake (AVX+MPX)/xsavec      559      719       28%
> Skylake (AVX512+MPX)/xsavec   145      272       87%

This is a good baseline, but as you note, the change may not be observable
in any real world programs.

The case I made to David Kreitzer here:
https://sourceware.org/ml/libc-alpha/2017-03/msg00430.html
~~~
  ... Alternatively a more detailed performance analysis of
  the impact on applications that don't use __regcall is required before adding
  instructions to the hot path of the average application (or removing their use
  in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
  on hardware that supports those vector registers).
~~~

> This is the worst case where portion of time spent for saving and
> restoring registers is bigger than majority of cases.  With smaller
> _dl_runtime_resolve code size, overall performance impact is negligible.
> 
> On IvyBridge, differences in build and test time of binutils with lazy
> binding GCC and binutils are noises.  On Westmere, differences in
> bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
> binutils are also noises.
Do you have any statistics on the timing for large applications that
use a lot of libraries? I don't see gcc, binutils, or glibc as indicative
of the complexity of shared libraries in terms of loaded shared libraries.

Something like libreoffice's soffice.bin has 142 DSOs, or chrome's
103 DSOs. It might be hard to measure if the lazy resolution is impacting
the performance or if you are hitting some other performance boundary, but
a black-box test showing performance didn't get *worse* for startup and
exit, would mean it isn't the bottlneck (but might be some day). To test
this you should be able to use libreoffice's CLI arguments to batch process
some files and time that (or the --cat files option).

If we can show that the above latency is in the noise for real applications
using many DSOs, then it makes your case better for supporting the alternate
calling conventions.

-- 
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]