[PATCH] MIPS/binutils: microMIPS linker relaxation fixes
Maciej W. Rozycki
Tue Dec 30 14:40:00 GMT 2014
[Reviving this old pending reply so that it is not lost. The short
conclusion is all the branch shortening in linker relaxation will most
likely have to be removed as it is unsafe in the general case. Of course
with GAS now handling most of this stuff there is supposed to be little if
any negative impact from this loss; I think compact branch relaxation
mentioned below should be straightforward to add to GAS, at least in the
minimal arrangement considered (i.e. no special case where a 16-bit branch
can be used). We already do similar stuff for MIPS16 compact jumps, so
it'll just be a matter of a few extra cases to add.]
On Wed, 16 Nov 2011, Richard Sandiford wrote:
> "Maciej W. Rozycki" <firstname.lastname@example.org> writes:
> >> > So I have actually given it some more thought and my understanding of the
> >> > ABI remains that while orphaned R_MIPS_LO16 relocations are indeed
> >> > permitted, they still must be preceded by a corresponding R_MIPS_HI16,
> >> > although that is not required to be adjacent. I believe this is only
> >> > permitted to allow cases like you quoted to avoid unnecessary extra code
> >> > to add missing R_MIPS_HI16 relocations.
> >> There are still potential problems though. We deliberately allow things like:
> >> lui $4,%hi(foo)
> >> lw $6,%lo(foo)($4)
> >> lw $7,%lo(foo+4)($4)
> >> ...
> >> .align 8
> >> foo:
> >> .word X, Y
> >> and foo is allowed to be in a text section. Does your patch ensure that
> >> foo remains 8-byte aligned, even if we relax code earlier in the section?
> > Sigh, you're right -- I wish we realised this earlier on. No, the
> > alignment of foo will get broken of course just as alignment of standard
> > MIPS code would, as noted with the original submission of this update.
> > Of course if you run this under Linux, the you won't notice unless you
> > observe the performance drop badly.
> Hmm, run the code above you mean? The point is that the code doesn't
> work if foo becomes 0x....7ffc, since foo and foo+4 no longer have
> the same high part.
That's a corner case (not to be ignored of course), but what I have in
mind is actually more serious, i.e. consider foo becoming 2 mod 4.
> >> > Do you have a better idea?
> >> TBH, my inclination is to remove it from trunk too. I imagine
> >> GCC's LTO will catch many of the interesting cases (because then
> >> we assemble the output object's text section at once).
> > OK, so let's see where we are. We've got three kinds of relaxation
> > actions we make:
> > 1. I think with the changes I made to branch relaxation in GAS we are
> > mostly covered. There's one corner case remaining I reckon (I'd have
> > to go back to the code and/or my earlier notes to track it down), where
> > we fail to convert to a short or compact branch. And branches between
> > separate modules are extremely rare, so I wouldn't bother about them.
> > So all the branch relaxation code here should by now have been mostly
> > redundant. I'll have a look into that corner case yet -- I may not be
> > able to do that immediately though.
> Sounds good. :-)
So I have now recalled what the corner case is -- we don't do compact
branch relaxation. E.g. given this piece:
movep $4, $5, $2, $3
given that MOVEP cannot be reordered into a delay slot, the code should be
movep $4, $5, $2, $3
The same applies to 32-bit conditional branches that can be reduced to
BEQZC/BNEZC. However if owing to the displacement any of these branches
can be relaxed to a 16-bit variation, then I thing scheduling a 16-bit NOP
into the delay slot should be preferred -- the code size will be the same,
but on scalar processors the delay slot should prevent a pipeline stall
compact branches presumably incur (and superscalar processors should be
able to kill the NOP so no performance hit there).
This is similar to MIPS16 compact jump relaxation, but this case seems
trickier because the branch is already relaxed. I think we should be able
to handle that with another RELAX_MICROMIPS flag that would delay the
emission of the delay slot NOP until md_convert_frag.
> > 2. Short delay slot relaxation, i.e. JAL->JALS conversion. We actually
> > should be handling JALR->JALRS and BGEZAL/BLTZAL->BGEZALS/BLTZALS as
> > well, but we don't. These can and actually should be done in GAS.
> > There are two cases to handle:
> > * Instructions swapped into a delay slot. I reckon this is a bit
> > tricky, but I think still doable. The instruction to be swapped is
> > already of the right size, it's just not swapped if it's of the wrong
> > size for the delay slot. We should enable that swapping and flip the
> > delay slot size bit in the respective branch/jump opcode.
> Yeah. This doesn't seem too difficult on face value though.
Quite close to what we do for MIPS16 compact jump relaxation actually.
> > * Instructions manually scheduled in a delay slot ("noreorder" mode).
> > Currently the mnemonic used for the branch/jump determines the size
> > of this instruction. I think we should always treat the long delay
> > slot mnemonics as macros; they will often come from assembly written
> > for the standard MIPS mode the conversion of which to the respective
> > short delay slot mnemonics is IMO infeasible. Not even mentioning
> > that if operands are substituted in any way (e.g. by macro
> > expansion), then the size of the instruction may vary between
> > assembly passess.
> > Again, this may be a bit tricky as it requires looking forwards it
> > would seem. But perhaps we can handle this with relaxation, or maybe
> > simpler yet -- by tweaking the previous instruction emitted through
> > the history of instructions we maintain.
> Yeah. We'd still need a variant frag in the latter case, to cope with
> things like ".loc"s between the two instructions. But I agree full
> relaxation isn't needed. We already change variant frags on the fly
> when doing things like nop insertion, so we might be able to do
> something similar here.
About the most complicated case I can think of here is a relaxed
out-of-range BGEZAL/BLTZAL -- we'll have a trailing JAL or JALR (depending
on the ABI) of the preceding variant frag to tweak in addition to the
original BGEZAL/BLTZAL opcode.
> > I think we should have a way to disable this branch/jump conversion,
> > perhaps in the "nomacro" mode or with a new setting (up to debate).
> Not sure if I follow this, probably due to my lack of familiarity with
> microMIPS. If you really want a JALR rather than a JALRS, wouldn't the
> simplest and most explicit way be to add ".32" to the delay slot insn?
> There's then no way the assembler could validly change the JALR.
Correct, good idea, I've forgotten about these size overrides. I think
it makes sense as you can't really make any implicit assumptions about
code offsets in the "macro" mode and once you've set "nomacro", then
you've by definition stopped the guarantee of source code compatibility
for the microMIPS mode. So you have to audit and modify if applicable
> If I'm wrong about that, then I don't think ".set nomacro" is appropriate.
> I think "macro" in that context means "one pseudo-instruction that
> expands to multiple real instructions".
I think you're right after all -- on first thoughts it didn't appear to
me adding another .set knob just for this would be the best idea.
> So I agree it makes sense to treat "JALR" as a macro (in the INSN_MACRO
> sense). For consistency, it seems sensible to allow "JALR ...; FOO.16"
> to be written as a shorthand for "JALRS ...; FOO.16" as well. I.e.
> it seems sensible not to care where the 16-bitness comes from.
Hmm, yes, I think it would make sense to treat JALR and JALRS, etc. as
aliases in the "macro" mode. It would take the unnecessary burden of
matching the jump/branch with the delay slot instruction off the
programmer. Then in the "nomacro" mode the interpretation would be
strict. Is this what you mean?
> > * While at it we might want to think about instruction swapping around
> > JALX -- as noted above we don't do that if the instruction does not
> > satisfy the delay slot size requirement and there's no JALXS
> > instruction. We could convert the instruction to the 32-bit size.
> > But then it may be really tough unless we relax all the 16-bit
> > instructions which, conversely, seems an overkill to me. So I
> > wouldn't put too much effort into it, but still I think it's worth
> > double-checking.
> Agreed on both counts (about it being interesting, but lower priority).
It's interesting to note that JALX is actually mainly produced by the
linker though, so this would address an exotic corner case. I guess let's
Which also brings the topic of JAL vs JALS back -- source code may want
the 32-bit delay slot JAL to be used even if the delay-slot instruction
can be reduced to a 16-bit form for the sake of JAL->JALX linker
conversion. So we're sort of back to the issue of relaxing this
instruction at the link stage only as earlier on we may not know if we'll
end up with a JAL or JALX. This doesn't apply to JALR or BGEZAL/BLTZAL of
> > 3. HI0_LO16 and ADDIUPC relaxation. There's nothing that can be done for
> > the former any earlier than by the linker, period. But do we care? I
> > think the architecture makes this optimisation unlikely to matter.
> > It's really unusual for TLB systems to map these low/high pages. Are
> > they used in BAT systems? I don't know -- can anyone comment? The
> > addresses from 0 up are typically useful in the error exception
> > handlers (where CP0.Status.ERL switches to the identity mapping of the
> > virtual address space), but are they really such a common case as to
> > dedicate a linker optimisation for? I doubt it. So I think we can
> > safely drop this feature and nobody will notice.
> Also sounds good :-)
> > Now as to the ADDIUPC relaxation -- this I think is really worth the
> > trouble as I have seen significant text size reduction as a result of
> > this optimisation. I'll dig out the exact figures I've got with an
> > example app. The problem is again you cannot really make this
> > optimisation any earlier than in the linker. The compiler or assembler
> > do not know what the size of the final executable will be and therefore
> > which references are going to fit in the ADDIUPC's range or not.
> It'd be interesting to see the numbers. The most telling statistic
> would probably be the contribution made by ADDIUPC references to symbols
> that don't live in text sections.
> > Hmm, I wonder if there's anything we could do about this. One thought
> > I've got is to refrain from making this optimisation if there are data
> > symbols in code being processed. But is seems unlikely to me to work
> > reasonably, because (please correct me if I am wrong) at the point
> > relaxation is made all the text sections from all the modules have
> > already been merged into the respective output sections and we cannot
> > only omit the fragments that correspond to modules that had data
> > symbols in text while preserving their alignment too if any of the
> > preceding fragments shrinks. At least without turning half of the BFD
> > linker code upside down, the lone idea of which makes me feel chilly.
> > Correct?
> Taking "fragment" to mean "input section", then I don't think that
> in itself is a problem. Internal alignment within each input section
> can't be larger than the alignment of the input section itself.
Not exactly -- I used "fragments" specifically to refer to pieces of the
output section corresponding to the respective input sections each. Once
we've merged them into the output section can we map them back at all to
their input sections and preserve the requested alignment of each while
shrinking pieces of the output section?
> But I'm not sure we can rely on symbols like "foo" in the example
> above having a different type from ".LXXXX"-style branch targets,
> or from labels inserted for exception handling, debug info, etc.
> You would also need to stop references to local data from being
> converted into section-relative form (%lo(foo) becoming
> %lo(.text + const), etc.) We already have legacy objects
> in which that sort of transformation has happened.
> > Any other thoughts? What do the others do -- or are we the only target
> > doing this kind of linker relaxation? What's LTO BTW?
> Link time optimisation. GCC stores IL in each object file, and then
> instead of linking the original assembly from each object together,
> it can merge the IL and recompile it into a single piece of assembly.
Thanks, good to know. I gather IL stands for "intermediate language"
More information about the Binutils