[RFC] SVP64 Cray-style Vectorisation of the OpenPOWER scalar ISA

Mon Mar 15 14:13:27 GMT 2021

https://bugs.libre-soc.org/show_bug.cgi?id=615
https://gcc.gnu.org/pipermail/gcc/2021-March/234992.html

folks, hi, some introduction may be in order: the Libre-SOC Project,
sponsored by
NLnet, is extending the OpenPOWER v3.0B ISA to add Cray-style Variable
Length Vectorisation, under the watchful eye of the OpenPOWER Foundation
for submission to the OPF ISA WG.  Paul Mackerras and others have been
kept appraised for the past year of ongoing development.

after two years of careful planning, the first successful HDL and Simulator
Vectorised loops were executed last week:
http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2021-March/000251.html

also worth stating up-front: we have NLnet funding for review, implementation
and assistance here. if anyone is able to receive Charitable Donations (some
people working for large Corporations that may be inappropriate) we may pay
our way.

the roadmap involves having binutils and then gcc supporting the
SVP64 "Vector-contextualisation" of OpenPOWER v3.0B scalar instructions.
we are already hitting the need for binutils to support SVP64 augmentation
because we are writing unit tests for the RTL and simulator.

therefore we need a suitable syntax for what is effectively "macro-embedding"
of OpenPOWER v3.0B scalar instructions into an SVP64 Vectorisation "context".
our initial draft is illustrated by the examples at the gcc post
above, copied here
for convenience:

     sv.add/pr=r3   r3.v, r4.v, r72.s

here:

* RT as a destination is Vectorised
* RA as a source is Vectorised
* RB as a source is Scalar but *extended to 7 bits not 5*
* r3 is a Vector Mask Predicate.

the SVP64 Augmentation *extends* the OpenPOWER v3.0B scalar registers
to *beyond* r0-r31 and into the range r0-127 by adding 2 extra bits to each,
and marking them (as far as the hardware Vector loop is concerned) as
scalar or vector operands.

however - and this is critically important to appreciate and understand -
we *do not* alter *in any way* the actual number of OpenPOWER v3.0B
scalar operation arguments, we do not alter the computation of the
OpenPOWER v3.0B scalar operation [except by way of post-processing
phases in hardware, such as a post-result "clamp" phase].

SVP64 is a form of hardware for-loop that may be viewed strictly
as a Sub-Program-Counter that DOES NOT modify, interfere, alter
or interact with the actual scalar instruction being loop-executed,
in EXACTLY the same way that the expectation of the PC does
not interfere or interact with the instruction being executed.

this leaves us in a rather interesting space as far as binutils (and gcc)
are concerned, where even amongst our team over the past 18 months
we have had ideas put forward which alter the number of arguments
of the underlying scalar v3.0B operation, or alter its behaviour and
meaning significantly, and so on.

[caveat to the above: the above is a drastic simplification, reality
is that in some cases such as LD/ST-with-update we allow
RA-as-source to be separately augmented from RA-as-dest
thus allowing limited alternative ranges of the different RAs.
let's make progress in incremental steps]

one perspective is that this requires modification of the syntax
supported by config/ppc-tc.c and that that constitutes a fundamental
modification of the ppc scalar syntax.

to reiterate:
we are not modifying the OpenPOWER v3.0B scalar syntax

we are *embedding* OpenPOWER v3.0B scalar operations into
a *context* (using the v3.1 prefixing system to do so).  SVP64 is
itself a form of hardware-level macro with v3.0B opcodes embedded
in it.

one possible solution here - one that helps explain this rather well -
is to have a completely separate program that is inserted in between
gcc and gas, looking for SVP64 syntax and outputting:

    .long 0xNNNNNN  # EXT01 formatted SVP64 prefix
    asm_v30b M,M,M # OpenPOWER v3.0B scalar suffix

this would work perfectly for us, and is exactly what this prototype
program already does (given that the SVP64 assembly syntax is in
draft):
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/sv/trans/svp64.py;hb=HEAD

my only concern about that is: is there any macro-substitution used
in interaction between gcc and binutils which could possibly interfere
with a separate (pipelined) program between gcc and gas?

if yes, then we will need to come up with a suitable syntax for SVP64
in binutils.  the only request i have here is that that syntax be *really clear*
that it is the *unmodified* scalar OpenPOWER v3.0B operation that is
embedded into the Vectorisation Context.

some ideas that have been floated already include modifying the number
of arguments.  looking at opcodes/ppc-opc.c this would be a huge amount
of work as it would hit many of four and a half THOUSAND assembly
instructions in that one file alone.

this is undesirable in many ways: not least it gives the false impression
that SVP64 *is* an alteration of the underlying scalar instruction when
it most certainly is not.

where do we go from here?  i don't know :)  mostly this is an introduction
conversation to a concept that borrows from many historical innovations
to create some entirely new and innovative that has never been seen
in any commercial ISA - ever.  it's going to take some time to be
absorbed, conceptually.

we do however need to start that conversation because there are funding
time limits and commercial pressures to complete this work.

a simple and practical question: what separators would be reasonable
to use for the "Macro-embedding" Contextual Augmentation?

presently we have the following syntax:

    sv.{v3.0Bopcode} / augmentation=x / augmentation=y  r0.v, r30.s, r75.v

with "=" and "/" already being used in gcc/binutils macro expressions,
what could be used instead?

with thanks,

l.