This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]
On Fri, Oct 20, 2017 at 12:24 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2017.10.19 at 15:36 -0700, H.J. Lu wrote:
>> On Thu, Oct 19, 2017 at 2:55 PM, Carlos O'Donell <carlos@redhat.com> wrote:
>> > On 10/19/2017 10:41 AM, H.J. Lu wrote:
>> >> In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
>> >> mask and bound registers. It simplifies _dl_runtime_resolve and supports
>> >> different calling conventions. ld.so code size is reduced by more than
>> >> 1 KB. However, use fxsave/xsave/xsavec takes a little bit more cycles
>> >> than saving and restoring vector and bound registers individually.
>> >>
>> >> Latency for _dl_runtime_resolve to lookup the function, foo, from one
>> >> shared library plus libc.so:
>> >>
>> >> Before After Change
>> >>
>> >> Westmere (SSE)/fxsave 345 866 151%
>> >> IvyBridge (AVX)/xsave 420 643 53%
>> >> Haswell (AVX)/xsave 713 1252 75%
>> >> Skylake (AVX+MPX)/xsavec 559 719 28%
>> >> Skylake (AVX512+MPX)/xsavec 145 272 87%
>> >
>> > This is a good baseline, but as you note, the change may not be observable
>> > in any real world programs.
>> >
>> > The case I made to David Kreitzer here:
>> > https://sourceware.org/ml/libc-alpha/2017-03/msg00430.html
>> > ~~~
>> > ... Alternatively a more detailed performance analysis of
>> > the impact on applications that don't use __regcall is required before adding
>> > instructions to the hot path of the average application (or removing their use
>> > in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
>> > on hardware that supports those vector registers).
>> > ~~~
>> >
>> >> This is the worst case where portion of time spent for saving and
>> >> restoring registers is bigger than majority of cases. With smaller
>> >> _dl_runtime_resolve code size, overall performance impact is negligible.
>> >>
>> >> On IvyBridge, differences in build and test time of binutils with lazy
>> >> binding GCC and binutils are noises. On Westmere, differences in
>> >> bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
>> >> binutils are also noises.
>> > Do you have any statistics on the timing for large applications that
>> > use a lot of libraries? I don't see gcc, binutils, or glibc as indicative
>> > of the complexity of shared libraries in terms of loaded shared libraries.
>>
>> _dl_runtime_resolve is only called once when an external function is
>> called the first time. Many shared libraries isn't a problem unless
>> all execution
>> time is spent in _dl_runtime_resolve. I don't believe this is a
>> typical behavior.
>>
>> > Something like libreoffice's soffice.bin has 142 DSOs, or chrome's
>> > 103 DSOs. It might be hard to measure if the lazy resolution is impacting
>> > the performance or if you are hitting some other performance boundary, but
>> > a black-box test showing performance didn't get *worse* for startup and
>> > exit, would mean it isn't the bottlneck (but might be some day). To test
>> > this you should be able to use libreoffice's CLI arguments to batch process
>> > some files and time that (or the --cat files option).
>
> I did some testing on my old SSE only machine and everything is in the
> noise. For example:
>
> ~ % ldd /usr/lib64/libreoffice/program/soffice.bin | wc -l
> 105
> ~ % hardening-check /usr/lib64/libreoffice/program/soffice.bin
> /usr/lib64/libreoffice/program/soffice.bin:
> Position Independent Executable: no, normal executable!
> Stack protected: no, not found!
> Fortify Source functions: no, not found!
> Read-only relocations: yes
> Immediate binding: no, not found!
I have
[hjl@gnu-6 tmp]$ readelf -d /usr/lib64/libreoffice/program/soffice.bin
Dynamic section at offset 0xdb8 contains 27 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libuno_sal.so.3]
0x0000000000000001 (NEEDED) Shared library: [libsofficeapp.so]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000f (RPATH) Library rpath: [$ORIGIN]
0x000000000000000c (INIT) 0x710
0x000000000000000d (FINI) 0x904
0x0000000000000019 (INIT_ARRAY) 0x200da0
0x000000000000001b (INIT_ARRAYSZ) 8 (bytes)
0x000000000000001a (FINI_ARRAY) 0x200da8
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x000000006ffffef5 (GNU_HASH) 0x298
0x0000000000000005 (STRTAB) 0x478
0x0000000000000006 (SYMTAB) 0x2e0
0x000000000000000a (STRSZ) 301 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000015 (DEBUG) 0x0
0x0000000000000003 (PLTGOT) 0x200fa8
0x0000000000000007 (RELA) 0x608
0x0000000000000008 (RELASZ) 264 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x0000000000000018 (BIND_NOW)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ _dl_runtime_resolve isn't
used at all.
0x000000006ffffffb (FLAGS_1) Flags: NOW ORIGIN PIE
0x000000006ffffffe (VERNEED) 0x5c8
0x000000006fffffff (VERNEEDNUM) 2
0x000000006ffffff0 (VERSYM) 0x5a6
0x000000006ffffff9 (RELACOUNT) 3
0x0000000000000000 (NULL) 0x0
[hjl@gnu-6 tmp]$
> (with H.J.'s patch)
> Performance counter stats for '/var/tmp/glibc-build/elf/ld.so /usr/lib64/libreoffice/program/soffice.bin --convert-to pdf kandide.odt' (4 runs):
>
> 2463.681675 task-clock (msec) # 1.040 CPUs utilized ( +- 0.06% )
> 414 context-switches # 0.168 K/sec ( +- 8.88% )
> 10 cpu-migrations # 0.004 K/sec ( +- 11.98% )
> 28,227 page-faults # 0.011 M/sec ( +- 0.04% )
> 7,823,762,346 cycles # 3.176 GHz ( +- 0.15% ) (67.30%)
> 1,360,335,356 stalled-cycles-frontend # 17.39% frontend cycles idle ( +- 0.51% ) (66.78%)
> 2,090,675,875 stalled-cycles-backend # 26.72% backend cycles idle ( +- 1.02% ) (66.70%)
> 8,984,501,079 instructions # 1.15 insn per cycle
> # 0.23 stalled cycles per insn ( +- 0.11% ) (66.96%)
> 1,866,843,047 branches # 757.745 M/sec ( +- 0.28% ) (67.25%)
> 73,973,482 branch-misses # 3.96% of all branches ( +- 0.15% ) (67.37%)
>
> 2.368775642 seconds time elapsed ( +- 0.21% )
>
> (without)
> Performance counter stats for '/usr/lib64/libreoffice/program/soffice.bin --convert-to pdf kandide.odt' (4 runs):
>
> 2467.698417 task-clock (msec) # 1.040 CPUs utilized ( +- 0.23% )
> 540 context-switches # 0.219 K/sec ( +- 17.02% )
> 12 cpu-migrations # 0.005 K/sec ( +- 14.85% )
> 28,245 page-faults # 0.011 M/sec ( +- 0.02% )
> 7,806,607,838 cycles # 3.164 GHz ( +- 0.09% ) (67.06%)
> 1,338,588,952 stalled-cycles-frontend # 17.15% frontend cycles idle ( +- 0.30% ) (66.99%)
> 2,103,802,012 stalled-cycles-backend # 26.95% backend cycles idle ( +- 0.77% ) (66.92%)
> 9,012,688,271 instructions # 1.15 insn per cycle
> # 0.23 stalled cycles per insn ( +- 0.14% ) (67.02%)
> 1,870,634,478 branches # 758.048 M/sec ( +- 0.31% ) (67.19%)
> 73,921,605 branch-misses # 3.95% of all branches ( +- 0.13% ) (67.08%)
>
> 2.373621006 seconds time elapsed ( +- 0.27% )
>
>
> Compile times using clang, that was built with shared libs, also don't
> change at all.
>
> --
> Markus
--
H.J.