This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]


On Fri, Oct 20, 2017 at 12:24 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2017.10.19 at 15:36 -0700, H.J. Lu wrote:
>> On Thu, Oct 19, 2017 at 2:55 PM, Carlos O'Donell <carlos@redhat.com> wrote:
>> > On 10/19/2017 10:41 AM, H.J. Lu wrote:
>> >> In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
>> >> mask and bound registers.  It simplifies _dl_runtime_resolve and supports
>> >> different calling conventions.  ld.so code size is reduced by more than
>> >> 1 KB.  However, use fxsave/xsave/xsavec takes a little bit more cycles
>> >> than saving and restoring vector and bound registers individually.
>> >>
>> >> Latency for _dl_runtime_resolve to lookup the function, foo, from one
>> >> shared library plus libc.so:
>> >>
>> >>                              Before    After     Change
>> >>
>> >> Westmere (SSE)/fxsave         345      866       151%
>> >> IvyBridge (AVX)/xsave         420      643       53%
>> >> Haswell (AVX)/xsave           713      1252      75%
>> >> Skylake (AVX+MPX)/xsavec      559      719       28%
>> >> Skylake (AVX512+MPX)/xsavec   145      272       87%
>> >
>> > This is a good baseline, but as you note, the change may not be observable
>> > in any real world programs.
>> >
>> > The case I made to David Kreitzer here:
>> > https://sourceware.org/ml/libc-alpha/2017-03/msg00430.html
>> > ~~~
>> >   ... Alternatively a more detailed performance analysis of
>> >   the impact on applications that don't use __regcall is required before adding
>> >   instructions to the hot path of the average application (or removing their use
>> >   in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
>> >   on hardware that supports those vector registers).
>> > ~~~
>> >
>> >> This is the worst case where portion of time spent for saving and
>> >> restoring registers is bigger than majority of cases.  With smaller
>> >> _dl_runtime_resolve code size, overall performance impact is negligible.
>> >>
>> >> On IvyBridge, differences in build and test time of binutils with lazy
>> >> binding GCC and binutils are noises.  On Westmere, differences in
>> >> bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
>> >> binutils are also noises.
>> > Do you have any statistics on the timing for large applications that
>> > use a lot of libraries? I don't see gcc, binutils, or glibc as indicative
>> > of the complexity of shared libraries in terms of loaded shared libraries.
>>
>> _dl_runtime_resolve is only called once when an external function is
>> called the first time.  Many shared libraries isn't a problem unless
>> all execution
>> time is spent in _dl_runtime_resolve.  I don't believe this is a
>> typical behavior.
>>
>> > Something like libreoffice's soffice.bin has 142 DSOs, or chrome's
>> > 103 DSOs. It might be hard to measure if the lazy resolution is impacting
>> > the performance or if you are hitting some other performance boundary, but
>> > a black-box test showing performance didn't get *worse* for startup and
>> > exit, would mean it isn't the bottlneck (but might be some day). To test
>> > this you should be able to use libreoffice's CLI arguments to batch process
>> > some files and time that (or the --cat files option).
>
> I did some testing on my old SSE only machine and everything is in the
> noise. For example:
>
>  ~ % ldd /usr/lib64/libreoffice/program/soffice.bin | wc -l
> 105
>  ~ % hardening-check /usr/lib64/libreoffice/program/soffice.bin
> /usr/lib64/libreoffice/program/soffice.bin:
>  Position Independent Executable: no, normal executable!
>  Stack protected: no, not found!
>  Fortify Source functions: no, not found!
>  Read-only relocations: yes
>  Immediate binding: no, not found!

I have

[hjl@gnu-6 tmp]$ readelf -d  /usr/lib64/libreoffice/program/soffice.bin

Dynamic section at offset 0xdb8 contains 27 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libuno_sal.so.3]
 0x0000000000000001 (NEEDED)             Shared library: [libsofficeapp.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN]
 0x000000000000000c (INIT)               0x710
 0x000000000000000d (FINI)               0x904
 0x0000000000000019 (INIT_ARRAY)         0x200da0
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x200da8
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x298
 0x0000000000000005 (STRTAB)             0x478
 0x0000000000000006 (SYMTAB)             0x2e0
 0x000000000000000a (STRSZ)              301 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x200fa8
 0x0000000000000007 (RELA)               0x608
 0x0000000000000008 (RELASZ)             264 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x0000000000000018 (BIND_NOW)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   _dl_runtime_resolve isn't
used at all.
 0x000000006ffffffb (FLAGS_1)            Flags: NOW ORIGIN PIE
 0x000000006ffffffe (VERNEED)            0x5c8
 0x000000006fffffff (VERNEEDNUM)         2
 0x000000006ffffff0 (VERSYM)             0x5a6
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0
[hjl@gnu-6 tmp]$

> (with H.J.'s patch)
>  Performance counter stats for '/var/tmp/glibc-build/elf/ld.so /usr/lib64/libreoffice/program/soffice.bin --convert-to pdf kandide.odt' (4 runs):
>
>        2463.681675      task-clock (msec)         #    1.040 CPUs utilized            ( +-  0.06% )
>                414      context-switches          #    0.168 K/sec                    ( +-  8.88% )
>                 10      cpu-migrations            #    0.004 K/sec                    ( +- 11.98% )
>             28,227      page-faults               #    0.011 M/sec                    ( +-  0.04% )
>      7,823,762,346      cycles                    #    3.176 GHz                      ( +-  0.15% )  (67.30%)
>      1,360,335,356      stalled-cycles-frontend   #   17.39% frontend cycles idle     ( +-  0.51% )  (66.78%)
>      2,090,675,875      stalled-cycles-backend    #   26.72% backend cycles idle      ( +-  1.02% )  (66.70%)
>      8,984,501,079      instructions              #    1.15  insn per cycle
>                                                   #    0.23  stalled cycles per insn  ( +-  0.11% )  (66.96%)
>      1,866,843,047      branches                  #  757.745 M/sec                    ( +-  0.28% )  (67.25%)
>         73,973,482      branch-misses             #    3.96% of all branches          ( +-  0.15% )  (67.37%)
>
>        2.368775642 seconds time elapsed                                          ( +-  0.21% )
>
> (without)
>  Performance counter stats for '/usr/lib64/libreoffice/program/soffice.bin --convert-to pdf kandide.odt' (4 runs):
>
>        2467.698417      task-clock (msec)         #    1.040 CPUs utilized            ( +-  0.23% )
>                540      context-switches          #    0.219 K/sec                    ( +- 17.02% )
>                 12      cpu-migrations            #    0.005 K/sec                    ( +- 14.85% )
>             28,245      page-faults               #    0.011 M/sec                    ( +-  0.02% )
>      7,806,607,838      cycles                    #    3.164 GHz                      ( +-  0.09% )  (67.06%)
>      1,338,588,952      stalled-cycles-frontend   #   17.15% frontend cycles idle     ( +-  0.30% )  (66.99%)
>      2,103,802,012      stalled-cycles-backend    #   26.95% backend cycles idle      ( +-  0.77% )  (66.92%)
>      9,012,688,271      instructions              #    1.15  insn per cycle
>                                                   #    0.23  stalled cycles per insn  ( +-  0.14% )  (67.02%)
>      1,870,634,478      branches                  #  758.048 M/sec                    ( +-  0.31% )  (67.19%)
>         73,921,605      branch-misses             #    3.95% of all branches          ( +-  0.13% )  (67.08%)
>
>        2.373621006 seconds time elapsed                                          ( +-  0.27% )
>
>
> Compile times using clang, that was built with shared libs, also don't
> change at all.
>
> --
> Markus



-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]