This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: Should x86-64 support arbitrary calling conventions?


On 03/17/2017 02:03 PM, Kreitzer, David L wrote:
> H.J. is correct. The __regcall calling convention may use up to 16 vector
> registers for passing arguments. And when not used for passing arguments,
> registers xmm8-xmm15 are callee-save. The convention doesn't pass arguments
> in mask registers nor treat them as callee-save, but there still might be
> situations where it would be useful to pass arguments in mask registers for
> performance reasons.
> 
> Ideally, _dl_runtime_resolve should preserve any registers that it uses,
> similar to an interrupt handler.
> 
> https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=5ed3cc7b66af4758f7849ed6f65f4365be8223be
> 
> It is not strictly necessary to use xsave/xrstor for this purpose, though that
> is a convenient way to do it. An alternative if xsave/xrstor is deemed too
> costly is to avoid using vector registers at all within _dl_runtime_resolve.
> 
> Otherwise, we leave significant performance potential on the table in
> situations where the "one size fits all" calling convention is inefficient.

David,

Thanks for your input and experience on the matter.

Performance spectrum:
---------------------

I absolutely agree that performance is left on the table and it depends on
the choices being made by the developer and the choices being made by the runtime
and developer tooling.

Trade-offs are made at all levels to provide performance versus debugging
or special case versus general case.

I consider a spectrum of optimizations here that range from:

(1) Static linking.

    - No dynamic loader involved (unless using dlopen)
    - Developer can use any regparm or __regcall options they want.
    - There are some natural consequences to not using dynamic loading.

(2) Whole program optimization (in the abstract)

    - Could use special call sequences like those used with -fno-plt to
      make direct calls to functions and bypass the PLT.
    - Likely require the runtime to be exactly that which was used at build time.
    - Depending on the framework you could have inter-module ABI differences e.g.
      the caller might know a given implementation of a shared library
      routine doesn't clobber certain registers and optimize for that.

(3) Dynamic linking with special options.

    - Use -fno-plt or -Wl,-z,now
    - Degraded developer tooling features because of current lack of support for
      alternate function call ABIs.
    - Inability to use LD_AUDIT audit framework without PLT entries.
    - ELF interposition still preserved.

(4) Dynamic linking

    - Following a published ABI.
    - Intra-module function calls may use non-standard procedure call ABIs:
      - Kernel syscalls are an example of a special call ABI (intra-module)
      - Use of regparm and __regcall for certain (intra-module)
      Note: Observable only by a debugger. Not observable by an audit module (LD_AUDIT).

You are positioning ICC's __regcall as something which should fit into (4).

I argue it fits into (3) and will not be supported out of the box.

glibc's position:
-----------------

In https://sourceware.org/bugzilla/show_bug.cgi?id=21265#c7 I state the
general principles that glibc should follow:

(a) Optimize for the special local case.

    - In the special local case glibc uses internal_function for all non-PLT internal
      function calls and that may include using regparm.

(b) Optimize for the global average case.

    - In the global average case glibc strives to make (3/4) as fast as
      possible while still following ELF. The dynamic loader is responsible for running
      a large number of applications, not all of which are compiled with
      __regcall or other arbitrary calling conventions (like stack alignment at function
      entry).

Again, you are positioning ICC's __regcall as something that fits into (b) without
any impact on the global average case.

I argue ICC's __regcall is in (a) and does not warrant changes in the dynamic loader's
runtime resolution trampoline.

Benchmarking:
-------------

If one argues that enabling ICC's __regcall does not slow down 
(4) in a statistically significant way, then I would like to see a contribution
of a microbenchmark that tries to show that so we can have some objective
measurable position on the topic.

The use -fno-plt (as Florian Weimer is suggesting), non-lazy binding, or LTO
(in the future) can make it possible to optimize more of the call ABI.

Florian Weimer noticed that we do use internal_function on 
__libc_pthread_init@GLIBC_PRIVATE, which means that glibc is inconsistent
about (b) above. I could not justify adding more support for alternate calling 
conventions just to satisfy a GLIBC_PRIVATE requirement. In fact I think that
the use of regparm on __libc_pthread_init is a mistake that should be fixed.

Lastly, the use of all of these alternate ABIs can impact the developers
ability to use developer tooling such as systemtap. The developer tooling should
expect external global symbols follow the published ABI for the architecture.

In summary
==========

- Optimizing for the general case in the dynamic loader means that we don't
  support __regcall functions in the PLT with lazy binding. Thus no out-of-the-box
  support for __regcall.

- Application developers have to make a choice to compile for (3) as above,
  choices like using -fno-plt or -Wl,-z,now to safely use these
  high performance features at the cost of debugging (no LD_AUDIT support,
  and problems with uprobe-using tooling like systemtap which expects a given
  ABI).

- The support for regparm on i386 in the dynamic loader trampoline is historical.
  And glibc should remove the one usage in __libc_pthread_init@GLIBC_PRIVATE that
  has external linkage.

- I suggest bug 21265 be RESOLVED as WONTFIX because of the impact on applications
  that don't use __regcall. Alternatively a more detailed performance analysis of
  the impact on applications that don't use __regcall is required before adding
  instructions to the hot path of the average application (or removing their use
  in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
  on hardware that supports those vector registers).

Comments?

-- 
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]