This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]


On 10/20/2017 02:58 PM, Florian Weimer wrote:
On 10/20/2017 01:09 PM, H.J. Lu wrote:
When there are many DSOs, it takes more time to lookup a symbol
and time to save/restore vector registers becomes noise.   The only
case when time to save/restore vector registers becomes non-trivial is

1. There are a few DSOs so that symbol lookup takes fewer cycles.  And
2. There are many external function calls which are executed only once.  And
3. These external functions take very few cycles.

I can create such a testcase.  But I don't think it is a typical case.

Completely agree.  Basically, a program which is affected would have to (a) call many functions, (b) with short symbol lookup chains, and (c) do very little actual work.  This seems to be a very unlikely scenario.

I have a test case.  GCC scales poorly with many function calls, and it is difficult to get --export-dynamic to work with recent GCC/binutils. I will try to run it on various machines.

LD_DEBUG=statistics shows this:

      9506:
      9506:     runtime linker statistics:
      9506:       total startup time in dynamic loader: 19960074 cycles
9506: time needed for relocation: 19105814 cycles (95.7%)
      9506:                      number of relocations: 87
      9506:           number of relocations from cache: 3
      9506:             number of relative relocations: 1226
9506: time needed to load objects: 701382 cycles (3.5%)
      9506:
      9506:     runtime linker statistics:
      9506:                final number of relocations: 781589
      9506:     final number of relocations from cache: 3

This is a main program which contains 1,500 function calls. The functions are defined in a single DSO, and each function calls 520 other functions, giving a total number of 781,500 relocations from the test.

On my laptop (Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz), I get this (ten runs, real time measured in seconds):

> t.test(prev_laptop, after_laptop)

	Welch Two Sample t-test

data:  prev_laptop and after_laptop
t = -14.932, df = 18, p-value = 1.392e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.05749145 -0.04330855
sample estimates:
mean of x mean of y
   0.2345    0.2849

So it's definitely not in the noise. The penalty appears to be around 65ns per relocation.

I said in the past that we should use XSAVE in the trampoline, so that we do not have to touch the dynamic linker for each new CPU generation, and I think that alone is worth the slight additional cost.

I'll check a few additional machines over the coming hours.

Note that XSAVE will still not allow us to support *arbitrary* calling conventions, so we shouldn't advertise it as such. But hopefully, it will be sufficient to get the ABI-violating binaries mentioned in bug 21265 back into working order.

Thanks,
Florian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]