This is the mail archive of the
binutils@sourceware.org
mailing list for the binutils project.
Re: [PATCH] gold: Add linker relaxation of tail calls on sparc.
- From: David Miller <davem at davemloft dot net>
- To: binutils at sourceware dot org
- Cc: iant at google dot com
- Date: Tue, 24 Apr 2012 17:15:10 -0400 (EDT)
- Subject: Re: [PATCH] gold: Add linker relaxation of tail calls on sparc.
- References: <20120423.030821.684142687786581661.davem@davemloft.net>
From: David Miller <davem@davemloft.net>
Date: Mon, 23 Apr 2012 03:08:21 -0400 (EDT)
Ian, could you please review this patch?
Thanks a lot.
> This gets GOLD in line with the relaxations performed by the
> BFD linker on sparc. Basically we can turn a tail call into
> a branch if certain conditions are satisfied, and in particular
> the final call displacement is small enough.
>
> The code is heavily and thoroughly commented.
>
> I've validated that the relaxation triggers during the testsuite on
> both 32-bit and 64-bit sparc, and that the transformations performed
> are correct.
>
> I looked at the relaxation support that exists for the kind of stuff
> ARM does with stubs where section sizes and offsets end up changing,
> and that didn't seem appropriate nor necessary for what I'm doing
> here. So I implemented this similarly to how the TLS code
> optimizations are done.
>
> This is relative to "[PATCH] gold: Maintain sparc ELF header bits
> properly" for which I'm still waiting approval.
>
> After this I only have 3 items on my agenda for GOLD/Sparc:
>
> 1) Implement proper R_SPARC_REGISTER support, this has been bugging
> me for years but last time I tried this it was hard to frob out
> dummy relocations and/or symbols and have GOLD just leave them
> completely alone and let them propagate unimpeded to the target
> relocation function.
>
> My frustration matched that which Roland recently endured with
> those special ARM PLT symbols. :-) Hopefully this kind of thing is
> easier now.
>
> 2) Fix all of the bugs wrt. large PLTs on sparc64. As I mentioned
> in my IFUNC patch posting, in order to get this right we'll need
> to pass PLT indexes around rather than PLT offsets.
>
> The core issue is that PLTs are variably sized on 64-bit sparc, the
> first 32767 PLT slots are one size, and then the format and the
> size of the PLT entries change starting at index 32768. So if you
> have an IFUNC plt slot, you won't know how big it's going to be
> until you know it's final index.
>
> 3) Implement full incremental linking support just like x86-64 does.
> Any known gotchas or implementation issues would be appreciated.
>
> Thanks.
>
> gold/
>
> * sparc.cc (Target_sparc::Relocate::relax_call): New function.
> (Target_sparc::Relocate::relocate): Call it for R_SPARC_WDISP30
> and R_SPARC_WPLT30.
>
> diff --git a/gold/sparc.cc b/gold/sparc.cc
> index 004a265..a9fe9de 100644
> --- a/gold/sparc.cc
> +++ b/gold/sparc.cc
> @@ -333,6 +333,11 @@ class Target_sparc : public Sized_target<size, big_endian>
> typename elfcpp::Elf_types<size>::Elf_Addr,
> section_size_type);
>
> + inline void
> + relax_call(Target_sparc<size, big_endian>* target,
> + unsigned char* view,
> + const elfcpp::Rela<size, big_endian>& rela);
> +
> // Ignore the next relocation which should be R_SPARC_TLS_GD_ADD
> bool ignore_gd_add_;
>
> @@ -3304,6 +3309,8 @@ Target_sparc<size, big_endian>::Relocate::relocate(
> case elfcpp::R_SPARC_WDISP30:
> case elfcpp::R_SPARC_WPLT30:
> Reloc::wdisp30(view, object, psymval, addend, address);
> + if (target->may_relax())
> + relax_call(target, view, rela);
> break;
>
> case elfcpp::R_SPARC_WDISP22:
> @@ -3954,6 +3961,145 @@ Target_sparc<size, big_endian>::Relocate::relocate_tls(
> }
> }
>
> +// Relax a call instruction.
> +
> +template<int size, bool big_endian>
> +inline void
> +Target_sparc<size, big_endian>::Relocate::relax_call(
> + Target_sparc<size, big_endian>* target,
> + unsigned char* view,
> + const elfcpp::Rela<size, big_endian>& rela)
> +{
> + typedef typename elfcpp::Swap<32, true>::Valtype Insntype;
> + Insntype *wv = reinterpret_cast<Insntype*>(view);
> + Insntype call_insn, delay_insn, set_insn;
> + uint32_t op3, reg, off;
> +
> + // This code tries to relax call instructions that meet
> + // certain criteria.
> + //
> + // The first criteria is that the call must be such that the return
> + // address which the call writes into %o7 is unused. Two sequences
> + // meet this criteria, and are used to implement tail calls.
> + //
> + // Leaf function tail call:
> + //
> + // or %o7, %g0, %ANY_REG
> + // call FUNC
> + // or %ANY_REG, %g0, %o7
> + //
> + // Non-leaf function tail call:
> + //
> + // call FUNC
> + // restore
> + //
> + // The second criteria is that the call destination is close. If
> + // the displacement can fit in a signed 22-bit immediate field of a
> + // pre-V9 branch, we can do it. If we are generating a 64-bit
> + // object or a 32-bit object with ELF machine type EF_SPARC32PLUS,
> + // and the displacement fits in a signed 19-bit immediate field,
> + // then we can use a V9 branch.
> +
> + call_insn = elfcpp::Swap<32, true>::readval(wv);
> + delay_insn = elfcpp::Swap<32, true>::readval(wv + 1);
> +
> + // Make sure it is really a call instruction.
> + if (((call_insn >> 30) & 0x3) != 1)
> + return;
> +
> + if (((delay_insn >> 30) & 0x3) != 2)
> + return;
> +
> + // Accept only a restore or an integer arithmetic operation whose
> + // sole side effect is to write the %o7 register (and perhaps set
> + // the condition codes, which are considered clobbered across
> + // function calls).
> + //
> + // For example, we don't want to match a tagged addition or
> + // subtraction. We also don't want to match something like a
> + // divide.
> + //
> + // Specifically we accept add{,cc}, and{,cc}, or{,cc},
> + // xor{,cc}, sub{,cc}, andn{,cc}, orn{,cc}, and xnor{,cc}.
> +
> + op3 = (delay_insn >> 19) & 0x3f;
> + reg = (delay_insn >> 25) & 0x1f;
> + if (op3 != 0x3d
> + && ((op3 & 0x28) != 0 || reg != 15))
> + return;
> +
> + // For non-restore instructions, make sure %o7 isn't
> + // an input.
> + if (op3 != 0x3d)
> + {
> + // First check RS1
> + reg = (delay_insn >> 14) & 0x15;
> + if (reg == 15)
> + return;
> +
> + // And if non-immediate, check RS2
> + if (((delay_insn >> 13) & 1) == 0)
> + {
> + reg = (delay_insn & 0x1f);
> + if (reg == 15)
> + return;
> + }
> + }
> +
> + // Now check the branch distance. We are called after the
> + // call has been relocated, so we just have to peek at the
> + // offset contained in the instruction.
> + off = call_insn & 0x3fffffff;
> + if ((off & 0x3fe00000)
> + && (off & 0x3fe00000) != 0x3fe00000)
> + return;
> +
> + if ((size == 64 || target->elf_machine_ == elfcpp::EM_SPARC32PLUS)
> + && ((off & 0x3c0000) == 0
> + || (off & 0x3c0000) == 0x3c0000))
> + {
> + // ba,pt %xcc, FUNC
> + call_insn = 0x10680000 | (off & 0x07ffff);
> + }
> + else
> + {
> + // ba FUNC
> + call_insn = 0x10800000 | (off & 0x3fffff);
> + }
> + elfcpp::Swap<32, true>::writeval(wv, call_insn);
> +
> + // See if we can NOP out the delay slot instruction. We peek
> + // at the instruction before the call to make sure we're dealing
> + // with exactly the:
> + //
> + // or %o7, %g0, %ANY_REG
> + // call
> + // or %ANY_REG, %g0, %o7
> + //
> + // case. Otherwise this might be a tricky piece of hand written
> + // assembler calculating %o7 in some non-trivial way, and therefore
> + // we can't be sure that NOP'ing out the delay slot is safe.
> + if (op3 == 0x02
> + && rela.get_r_offset() >= 4)
> + {
> + if ((delay_insn & ~(0x1f << 14)) != 0x9e100000)
> + return;
> +
> + set_insn = elfcpp::Swap<32, true>::readval(wv - 1);
> + if ((set_insn & ~(0x1f << 25)) != 0x8013c000)
> + return;
> +
> + reg = (set_insn >> 25) & 0x1f;
> + if (reg == 0 || reg == 15)
> + return;
> + if (reg != ((delay_insn >> 14) & 0x1f))
> + return;
> +
> + // All tests pass, nop it out.
> + elfcpp::Swap<32, true>::writeval(wv + 1, sparc_nop);
> + }
> +}
> +
> // Relocate section data.
>
> template<int size, bool big_endian>