RFC: Should x86-64 support arbitrary calling conventions?
Richard Henderson
rth@twiddle.net
Mon Mar 20 22:05:00 GMT 2017
On 03/21/2017 04:30 AM, Carlos O'Donell wrote:
> On 03/17/2017 02:03 PM, Kreitzer, David L wrote:
>> H.J. is correct. The __regcall calling convention may use up to 16 vector
>> registers for passing arguments. And when not used for passing arguments,
>> registers xmm8-xmm15 are callee-save. The convention doesn't pass arguments
>> in mask registers nor treat them as callee-save, but there still might be
>> situations where it would be useful to pass arguments in mask registers for
>> performance reasons.
>>
>> Ideally, _dl_runtime_resolve should preserve any registers that it uses,
>> similar to an interrupt handler.
>>
>> https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=5ed3cc7b66af4758f7849ed6f65f4365be8223be
>>
>> It is not strictly necessary to use xsave/xrstor for this purpose, though that
>> is a convenient way to do it. An alternative if xsave/xrstor is deemed too
>> costly is to avoid using vector registers at all within _dl_runtime_resolve.
>>
>> Otherwise, we leave significant performance potential on the table in
>> situations where the "one size fits all" calling convention is inefficient.
>
> David,
>
> Thanks for your input and experience on the matter.
>
> Performance spectrum:
> ---------------------
>
> I absolutely agree that performance is left on the table and it depends on
> the choices being made by the developer and the choices being made by the runtime
> and developer tooling.
>
> Trade-offs are made at all levels to provide performance versus debugging
> or special case versus general case.
>
> I consider a spectrum of optimizations here that range from:
>
> (1) Static linking.
>
> - No dynamic loader involved (unless using dlopen)
> - Developer can use any regparm or __regcall options they want.
> - There are some natural consequences to not using dynamic loading.
>
> (2) Whole program optimization (in the abstract)
>
> - Could use special call sequences like those used with -fno-plt to
> make direct calls to functions and bypass the PLT.
> - Likely require the runtime to be exactly that which was used at build time.
> - Depending on the framework you could have inter-module ABI differences e.g.
> the caller might know a given implementation of a shared library
> routine doesn't clobber certain registers and optimize for that.
>
> (3) Dynamic linking with special options.
>
> - Use -fno-plt or -Wl,-z,now
> - Degraded developer tooling features because of current lack of support for
> alternate function call ABIs.
> - Inability to use LD_AUDIT audit framework without PLT entries.
> - ELF interposition still preserved.
>
> (4) Dynamic linking
>
> - Following a published ABI.
> - Intra-module function calls may use non-standard procedure call ABIs:
> - Kernel syscalls are an example of a special call ABI (intra-module)
> - Use of regparm and __regcall for certain (intra-module)
> Note: Observable only by a debugger. Not observable by an audit module (LD_AUDIT).
>
> You are positioning ICC's __regcall as something which should fit into (4).
>
> I argue it fits into (3) and will not be supported out of the box.
I'm quite certain that I made this same point to Intel folks on the GCC side at
least a year ago, possibly two.
r~
More information about the Libc-alpha
mailing list