RFC: Should x86-64 support arbitrary calling conventions?

Richard Henderson rth@twiddle.net
Mon Mar 20 22:05:00 GMT 2017


On 03/21/2017 04:30 AM, Carlos O'Donell wrote:
> On 03/17/2017 02:03 PM, Kreitzer, David L wrote:
>> H.J. is correct. The __regcall calling convention may use up to 16 vector
>> registers for passing arguments. And when not used for passing arguments,
>> registers xmm8-xmm15 are callee-save. The convention doesn't pass arguments
>> in mask registers nor treat them as callee-save, but there still might be
>> situations where it would be useful to pass arguments in mask registers for
>> performance reasons.
>>
>> Ideally, _dl_runtime_resolve should preserve any registers that it uses,
>> similar to an interrupt handler.
>>
>> https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=5ed3cc7b66af4758f7849ed6f65f4365be8223be
>>
>> It is not strictly necessary to use xsave/xrstor for this purpose, though that
>> is a convenient way to do it. An alternative if xsave/xrstor is deemed too
>> costly is to avoid using vector registers at all within _dl_runtime_resolve.
>>
>> Otherwise, we leave significant performance potential on the table in
>> situations where the "one size fits all" calling convention is inefficient.
>
> David,
>
> Thanks for your input and experience on the matter.
>
> Performance spectrum:
> ---------------------
>
> I absolutely agree that performance is left on the table and it depends on
> the choices being made by the developer and the choices being made by the runtime
> and developer tooling.
>
> Trade-offs are made at all levels to provide performance versus debugging
> or special case versus general case.
>
> I consider a spectrum of optimizations here that range from:
>
> (1) Static linking.
>
>     - No dynamic loader involved (unless using dlopen)
>     - Developer can use any regparm or __regcall options they want.
>     - There are some natural consequences to not using dynamic loading.
>
> (2) Whole program optimization (in the abstract)
>
>     - Could use special call sequences like those used with -fno-plt to
>       make direct calls to functions and bypass the PLT.
>     - Likely require the runtime to be exactly that which was used at build time.
>     - Depending on the framework you could have inter-module ABI differences e.g.
>       the caller might know a given implementation of a shared library
>       routine doesn't clobber certain registers and optimize for that.
>
> (3) Dynamic linking with special options.
>
>     - Use -fno-plt or -Wl,-z,now
>     - Degraded developer tooling features because of current lack of support for
>       alternate function call ABIs.
>     - Inability to use LD_AUDIT audit framework without PLT entries.
>     - ELF interposition still preserved.
>
> (4) Dynamic linking
>
>     - Following a published ABI.
>     - Intra-module function calls may use non-standard procedure call ABIs:
>       - Kernel syscalls are an example of a special call ABI (intra-module)
>       - Use of regparm and __regcall for certain (intra-module)
>       Note: Observable only by a debugger. Not observable by an audit module (LD_AUDIT).
>
> You are positioning ICC's __regcall as something which should fit into (4).
>
> I argue it fits into (3) and will not be supported out of the box.

I'm quite certain that I made this same point to Intel folks on the GCC side at 
least a year ago, possibly two.


r~



More information about the Libc-alpha mailing list