Introducing a nanoMIPS port for Binutils, GDB and GOLD

Tue May 1 14:55:00 GMT 2018

Earlier today MIPS Tech announced the latest generation of the MIPS family of
architectures called nanoMIPS [1]. As part of the development we have been
designing all the open source tools necessary to support the architecture and,
thanks to the speed with which we were able to prototype, we have also been
using these tools to shape the architecture along the way. This has led to some
really interesting improvements in the tools, which MIPS would like to
contribute back to the community. While doing this work many of us have been
unable to contribute to the community as actively as we would have liked, we
are therefore very grateful for the community support given to the MIPS
architecture over the last 18 months. This announcement has a general
introduction at the start, so if you have already read it for one of the other
tools, you can skip down to the information specific to binutils/gdb/gold.

For anyone who knows the MIPS architecture you may well wonder why we are
introducing another major variant and the question is perfectly valid. We do
admittedly have quite a few: MIPS I through MIPS IV, MIPS32 and MIPS64 through
to MIPS32R6 and MIPS64R6, MIPS16e, MIPS16e2, microMIPSR3 and microMIPSR6. Each
of these serves (or served) a purpose and there is a high level of synergy
between all of them. In general, they build upon the previous and there is a
high level of compatibility, even when switching to a new encoding like moving
from MIPS to microMIPS. The switch to MIPS32R6/MIPS64R6 was a major shift in
the way the architecture innovated and drew more on the original theory of the
architecture, where evolution was not expected to be limited by binary
compatibility. MIPS Release 6 removed instructions and did create some very
minor incompatibility but is also much cleaner to implement from a
micro-architecture perspective. We have taken this idea much further with
nanoMIPS and reimagined the instruction set, by drawing on all the experience
gained from previous designs. Hopefully others will find it as interesting as
we do.

The major driving force behind the nanoMIPS architecture was to achieve
outstanding code density, while also balancing out hardware and software design
cost. As background MIPS has two compressed ISA variants: MIPS16e, which cannot
exist without also implementing MIPS32, and microMIPS, which can exist on its
own. Since MIPS16e has specific limits that cannot be engineered around, we
chose to use an approach similar to the microMIPS design.

nanoMIPS has a variable-length compressed instruction set that is completely
standalone from the other MIPS ISAs. It is designed to compress the highest
frequency instructions to 16-bits, and use 48-bit instructions to efficiently
encode 32-bit constants into the instruction stream. There is also a wider
range of 32-bit instructions, which merge carefully chosen high frequency
instruction sequences into single operations creating more flexible addressing
modes such as indexed and scaled indexed addressing, branch compare with
immediate and macro style instructions. The macro like instructions compress
prologue and epilogue sequences, as well as a small number of high frequency
instruction pairs like two move instructions or a move and function call.
nanoMIPS also totally eliminates branch delay slots which follows a precedent
set by microMIPSR6.

To get the best from a new ISA we also re-engineered the ABI and created a new
symbiotic relationship between the ISA and ABI that pushes code density and
performance further still. The ABI creates a fully link time relaxable model,
which enables us to squeeze every last byte out of the code image even when
deferring final addressing mode and layout decisions to link time. We have been
mindful of MIPS heritage and ensured that while open to any possible change, we
also have minimal impact when porting code from MIPS to nanoMIPS, and have
plenty of support to achieve source compatibility between the two.

The net effect of these changes leads to an average code size reduction of 20%
relative to microMIPSR6. This compression could well be one of the best
achieved by GNU tools for any RISC ISA. Comparing the ISA in terms of number of
instructions to issue vs microMIPS we also see a reduction of between 8% and
11% of dynamic instruction count.

Below we dig into some technical specifics for each of the GNU tools; we
welcome any feedback and questions as we start to look at rebasing this work to
the trunk/master and formally submitting it. nanoMIPS pre-built toolchains and
source code tarballs are available at:

http://codescape.mips.com/components/toolchain/nanomips/2018.04-02/

Binutils and GDB specific details
=================================

The nanoMIPS ports of binutils and GDB borrow heavily from the MIPS port.
Because the MIPS port already supports so many variations, we have decided to
replicate the code and strip away non-essential features of MIPS BFD and
assembler in order to have a port that can evolve independently.

nanoMIPS relies on the GOLD linker by default. This port does not include
support for a BFD linker and consequently the BFD backend provided here only
contains sufficient features and hooks to support the remaining components of
binutils and GDB. It is not representative of a complete nanoMIPS BFD backend.

The 64-bit architecture and ABI are not yet defined. The binutils and GDB ports
contain bits and pieces of 64-bit capability to hook into, once the ISA/ABI are
published.

The assembler/disassembler front-end, both in terms of syntax and command-line
options is mostly consistent with MIPS. Many redundant options and directives
have been removed, although some are silently accepted to allow easy porting of
assembly code. Many new features specific to nanoMIPS have been added. The
assembler interface is documented within the info pages, under Machine
Dependent Features, as usual. Key features include:

- Fine-grained directives and syntax to control linker relaxation
- New formats for instruction selection that interact with linker relaxation
- New ABI and code-generation modes (PC-relative/PIC/PID/C memory models)
- Selective finalization of PC-relative and label-difference relocations
- Carried forward support for traditional MIPS macros and instruction aliases
- Named co-processor 1 registers in assembly and disassembly
- Improved code density for frequently occurring function calls using
  trampolines (balc-stubs)

For GDB there is not much to note except that support is in place for bare
metal debug, with user-mode Linux in progress. The most notable difference for
debugging is that the p32 ABI provides efficient stack unwinding by means of
frame chaining.

Binutils and GDB Contributors
=============================

- GAS
  Faraz Shahbazker
- GDB
  Jaydeep Patil, Stefan Markovic, Sara Popadic
- GNUSIM
  Stefan Markovic, Sara Popadic, Hrvoje Varga, Matija Amidzic, Zlatko Buljan

GOLD specific details
=====================

We chose to base the first nanoMIPS linker on gold rather than the BFD linker
as it is significantly simpler to extend, which allowed faster prototyping.
There are two key nanoMIPS features implemented in the GOLD linker:

- Instruction transformations (Refer to the ABI supplement [2])

The nanoMIPS p32 ABI is designed to support full link-time relaxation and
rewriting. Some care is needed to avoid any hard-coded PC-relative offsets at
assembly time and/or use appropriate annotation to ensure the code written in
assembly text is guaranteed to remain the same after linking. An expansion may
happen during linking if, for instance, a symbol is out-of-range. An example of
this is a BALC instruction being rewritten in-place to LAPC+JALRC. This means
that with default options the linker can automatically expand any out-of-range
reference without the need to recompile.

The opposite of expansion is relaxation, in which the linker may transform
an instruction into a more compressed one. For instance, it may replace a
32-bit BC instruction with a 16-bit BC if a symbol is within the range of the
16-bit instruction. Many rules have to be followed in order to allow complex
transformations of multiple instructions, which help compress the code as much
as possible. Each transformation is carefully designed so that each instruction
can be considered in isolation, as would normally happen in a linker.

Any instruction with a relocation record pointing at it can be rewritten at
link time, unless this feature is disabled. This feature is implemented in a
relaxation loop to allow the transformations to stabilise. 

- Sorting output sections by reference:

In order to replace the maximum number of LWGP/SWGP instructions with
LWGP16/SWGP16, the linker needs to place the most commonly referenced input
sections in the first 512 bytes from the global pointer. The linker calculates
the number of references to each 4-byte data section that can be compressed and
uses that information when sorting to place those sections at the beginning.
Sections that don't go into the first 512 bytes are sorted by ascending size
and descending alignment. This sort can only be used with -fdata-sections
option in the compiler and requires relaxation to be enabled in the linker
(--relax).

Beyond the issues specific to nanoMIPS, we have had to overcome the issue of
getting linker scripting support up to a standard that could cope with complex
bare metal layout rules. As a result, a significant set of issues have been
addressed. Some of them had already been posted to the binutils mailing list,
but most of them are new work. Here is an outline of the improvements:

Patches used from the binutils mailing list:

- Extended support of INPUT and GROUP commands to linker scripts added with -T
  (https://sourceware.org/ml/binutils/2016-07/msg00036.html)
- Fixed the assignment of LMA for subsequent sections which do not have their
  own explicit address set
  (https://sourceware.org/ml/binutils/2016-10/msg00161.html)
- Fixed NOLOAD behavior
  (https://sourceware.org/ml/binutils/2015-09/msg00216.html)

Improvements and fixes for the GOLD linker:

- Added support for STARTUP keyword in the linker script
- Improved relocation overflow errors
- Fixed value of the symbol within a SECTIONS clause when orphan sections are
  placed above symbol assignment to the dot
- Fixed issue with trinary expressions in the linker script
- Fixed issue where archive library member is wrongly included in the link
- Changed a NOBITS section to PROGBITS if user links bss with non-bss input
  sections via a linker script
- Fixed placement of a common symbols when using a SECTIONS clause
- Fixed order of some linker created input sections in output section when
  using a SECTIONS clause
- Prevented discarding of FDE if there is a relocation for PC Range field
- Prevented creating a new segment for program headers if we are doing static
  link and there is no room in the first PT_LOAD segment
- Prevented allocating sections that are created from the linker script if dot
  is not advanced during script processing
- Fixed MEMORY region handling in the linker script
- Fixed alignment of virtual and load address in output sections when using a
  SECTIONS clause
- Sorted two PT_LOAD segments by virtual address if the addresses are set and
  they have the same flags and the same load addresses
- Prevented setting minimum p_align for --nmagic and --omagic options
- Fixed filename match for objects that are in archive
- Always override ENTRY in link script with --entry command
- Prevented resolving relocations for output NOBITS sections

GOLD contributors
=================

- GOLD
  Vladimir Radosavljevic

[1] https://www.mips.com/press/new-mips-i7200-processor-core-delivers-unmatched-performance-and-efficiency-for-advanced-lte5g-communications-and-networking-ic-designs/

[2] Codescape GNU tools for nanoMIPS: ELF ABI Supplement,
    https://codescape.mips.com/components/toolchain/nanomips/2018.04-02/docs/MIPS_nanoMIPS_ABI_supplement_01_02_DN00179.pdf