This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: Should x86-64 support arbitrary calling conventions?


On 03/21/2017 04:30 AM, Carlos O'Donell wrote:
On 03/17/2017 02:03 PM, Kreitzer, David L wrote:
H.J. is correct. The __regcall calling convention may use up to 16 vector
registers for passing arguments. And when not used for passing arguments,
registers xmm8-xmm15 are callee-save. The convention doesn't pass arguments
in mask registers nor treat them as callee-save, but there still might be
situations where it would be useful to pass arguments in mask registers for
performance reasons.

Ideally, _dl_runtime_resolve should preserve any registers that it uses,
similar to an interrupt handler.

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=5ed3cc7b66af4758f7849ed6f65f4365be8223be

It is not strictly necessary to use xsave/xrstor for this purpose, though that
is a convenient way to do it. An alternative if xsave/xrstor is deemed too
costly is to avoid using vector registers at all within _dl_runtime_resolve.

Otherwise, we leave significant performance potential on the table in
situations where the "one size fits all" calling convention is inefficient.

David,

Thanks for your input and experience on the matter.

Performance spectrum:
---------------------

I absolutely agree that performance is left on the table and it depends on
the choices being made by the developer and the choices being made by the runtime
and developer tooling.

Trade-offs are made at all levels to provide performance versus debugging
or special case versus general case.

I consider a spectrum of optimizations here that range from:

(1) Static linking.

    - No dynamic loader involved (unless using dlopen)
    - Developer can use any regparm or __regcall options they want.
    - There are some natural consequences to not using dynamic loading.

(2) Whole program optimization (in the abstract)

    - Could use special call sequences like those used with -fno-plt to
      make direct calls to functions and bypass the PLT.
    - Likely require the runtime to be exactly that which was used at build time.
    - Depending on the framework you could have inter-module ABI differences e.g.
      the caller might know a given implementation of a shared library
      routine doesn't clobber certain registers and optimize for that.

(3) Dynamic linking with special options.

    - Use -fno-plt or -Wl,-z,now
    - Degraded developer tooling features because of current lack of support for
      alternate function call ABIs.
    - Inability to use LD_AUDIT audit framework without PLT entries.
    - ELF interposition still preserved.

(4) Dynamic linking

    - Following a published ABI.
    - Intra-module function calls may use non-standard procedure call ABIs:
      - Kernel syscalls are an example of a special call ABI (intra-module)
      - Use of regparm and __regcall for certain (intra-module)
      Note: Observable only by a debugger. Not observable by an audit module (LD_AUDIT).

You are positioning ICC's __regcall as something which should fit into (4).

I argue it fits into (3) and will not be supported out of the box.

I'm quite certain that I made this same point to Intel folks on the GCC side at least a year ago, possibly two.


r~


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]