This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]
On 10/20/2017 09:31 AM, Florian Weimer wrote:
> * Florian Weimer:
>
>> On 10/20/2017 03:21 PM, Florian Weimer wrote:
>>> LD_DEBUG=statistics shows this:
>>>
>>> 9506:
>>> 9506: runtime linker statistics:
>>> 9506: total startup time in dynamic loader: 19960074 cycles
>>> 9506: time needed for relocation: 19105814 cycles
>>> (95.7%)
>>> 9506: number of relocations: 87
>>> 9506: number of relocations from cache: 3
>>> 9506: number of relative relocations: 1226
>>> 9506: time needed to load objects: 701382 cycles
>>> (3.5%)
>>> 9506:
>>> 9506: runtime linker statistics:
>>> 9506: final number of relocations: 781589
>>> 9506: final number of relocations from cache: 3
>>>
>>> This is a main program which contains 1,500 function calls. The
>>> functions are defined in a single DSO, and each function calls 520 other
>>> functions, giving a total number of 781,500 relocations from the test.
>>>
>>> On my laptop (Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz), I get this
>>> (ten runs, real time measured in seconds):
>>>
>>> > t.test(prev_laptop, after_laptop)
>>>
>>> Welch Two Sample t-test
>>>
>>> data: prev_laptop and after_laptop
>>> t = -14.932, df = 18, p-value = 1.392e-11
>>> alternative hypothesis: true difference in means is not equal to 0
>>> 95 percent confidence interval:
>>> -0.05749145 -0.04330855
>>> sample estimates:
>>> mean of x mean of y
>>> 0.2345 0.2849
>>>
>>> So it's definitely not in the noise. The penalty appears to be around
>>> 65ns per relocation.
>>
>> I did some more benchmarks. Ryzen takes a bit of a hit (100ns or ~30%).
>> Purley looks very good (35ns or ~5%). The outlier is KNL, with 750ns
>> added per relocation (at lower clock rates admittedly) and ~50% longer
>> relocation times overall. (The caveat is that this is lab hardware,
>> which may or may not match production silicon.)
>
> I found another piece of perhaps interesting hardware, an Intel(R)
> Core(TM) m7-6Y75. The performance hit is around 3% or 16ns. This is
> a lower-power mobile CPU in a tablet. So even there, the numbers are
> good.
Florian, H.J, and Markus,
Thank you very much for doing some more thourough tests on the patch set.
--
Cheers,
Carlos.