This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]


On 10/20/2017 03:21 PM, Florian Weimer wrote:
LD_DEBUG=statistics shows this:

       9506:
       9506:     runtime linker statistics:
       9506:       total startup time in dynamic loader: 19960074 cycles
      9506:                 time needed for relocation: 19105814 cycles (95.7%)
       9506:                      number of relocations: 87
       9506:           number of relocations from cache: 3
       9506:             number of relative relocations: 1226
      9506:                time needed to load objects: 701382 cycles (3.5%)
       9506:
       9506:     runtime linker statistics:
       9506:                final number of relocations: 781589
       9506:     final number of relocations from cache: 3

This is a main program which contains 1,500 function calls.  The functions are defined in a single DSO, and each function calls 520 other functions, giving a total number of 781,500 relocations from the test.

On my laptop (Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz), I get this (ten runs, real time measured in seconds):

 > t.test(prev_laptop, after_laptop)

     Welch Two Sample t-test

data:  prev_laptop and after_laptop
t = -14.932, df = 18, p-value = 1.392e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -0.05749145 -0.04330855
sample estimates:
mean of x mean of y
    0.2345    0.2849

So it's definitely not in the noise.  The penalty appears to be around 65ns per relocation.

I did some more benchmarks. Ryzen takes a bit of a hit (100ns or ~30%). Purley looks very good (35ns or ~5%). The outlier is KNL, with 750ns added per relocation (at lower clock rates admittedly) and ~50% longer relocation times overall. (The caveat is that this is lab hardware, which may or may not match production silicon.)

I still think these numbers are okay. To put this into perspective, the total number of relocations that are processed when running yum, a non-trivial Python application, is less than 22,000. So there is an expected additional startup overhead of about 16.5ms for KNL, but that is still in the noise for a simple yum command such as “yum repolist”.

Any other ideas what else we could benchmark?

Thanks,
Florian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]