[PATCH] Split load permutation

Wed Jul 1 08:52:49 GMT 2020

On Tue, 30 Jun 2020, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> >> Another thing we'd like for SVE in general is to allow vectors *with*
> >> gaps throughout the SLP graph, since with predication it's simple to
> >> ignore any inactive element where necessary.  This is often much cheaper
> >> than packing the active elements together to remove gaps.  For example:
> >> 
> >>   a[0] += 1;
> >>   a[2] += 1;
> >>   a[3] += 1;
> >> 
> >> should just be a predicated load, add and store, with no shuffling.
> >> 
> >> Even on targets without predication, it might be better to leave
> >> the gaps in for part of the graph, e.g. if two loads feed a single
> >> permute.  So it would be good if the node representation allowed
> >> arbitrary permutations, including with “dead” elements at any
> >> position.  (I realise that isn't news and that keeping the gap
> >> removal with the load was just “for now” :-))
> >
> > Hmm, OK.  So I'm still experimenting with how to incrementally push
> > these kind of changes to master.  Unfortunately it resists any
> > "serious" change because we've painted us into a corner with respect
> > to load and store handling ;)  Well, it just means the change will
> > need to be bigger than anticipated.
> >
> > As for inactive lanes, for SVE this means a mask register for each
> > operation, correct?
> 
> Certainly for potentially trapping ops and reductions, yeah.
> For others it really doesn't matter (and those wouldn't require
> a target that supports predication).
> 
> So I don't think having gaps necessarily means we have a mask.
> Being able to attach masks to nodes (perhaps independently of gaps)
> would be useful though.
> 
> > At the moment the SLP graph is a representation
> > of the scalar GIMPLE IL and we cannot really change that until we
> > commit to a vectorization factor and more.  So an inactive lane
> > is simply not there and a "load" is simply another way of building
> > up a vector from scalar stmt results much like those "built from scalars"
> > SLP nodes but we of course make them special because we have those
> > DR groups that are used.
> >
> > So with the patch as posted the "gaps" are represented in the
> > load permutation of the "loads" which is where you could create
> > mask registers from and simply push them to SLP parents (up to
> > the point you get to a parent whose childs have differing masks...).
> 
> OK.  But I'd argue that's true of permutations too.  At the point
> that the SLP graph just represents scalar gimple IL, we can simply
> permute SLP_TREE_SCALAR_STMTS and not have any explicit permute
> operations in the graph.

That's true when the only permutes are in leafs.  The 
SLP_TREE_SCALAR_STMTS impose an order of lanes so once
different parents need a different order we need explicit permute
nodes swapping SLP_TREE_SCALAR_STMTS.  Of course that's only
because we have combined multiple scalar stmts into a single
SLP node which really is ...

> Permutations and gaps only come into play once we add more
> vector-related information or restrictions.

... what those "vector-related" restrictions are right now.

> > I think we're going to have a set of post-processing steps on the
> > initial SLP graph for both "optimizing" where (and if) to do permutations
> > and whether to elide gap removal in favor of masks (and we'd then
> > annotate the SLP nodes with the active mask).
> 
> So we'd start out without any permutations and gaps, and then add
> them as post-processing step based on what the target can handle?
> If so, sounds good to me FWIW.

If you start with the notion on loads and stores being
compositions/extracts then yes - based on how we want to
vectorize those operations we'd expose any required permutation
explicitely so we can then combine & optimize those.

For the existing patches I tried to relate loads involving the
same data-ref group by introducing a shared SLP node accessing the
whole group, so that's not starting with no permute and it comes
with some issues.  If we don't relate loads from the same dataref
group early (during SLP discovery, that is), then we have to
try (again based on cost) doing that during that post-processing
which then becomes more complicated.

As said - this is work in progress and I'm experimenting in
multiple directions ...

> > All of this requires pushing back some of the early decisions
> > (I've mostly identified vectorization factor determining) but also
> > do load/store classification early.  For example if we end up
> > with strided loads or stores such operations can fuse with any
> > permutation at no cost.
> >
> > At the moment I'm continuing to poke the code for its least
> > resistance for introducing at least parts of the machinery.
> > I'm targeting post-processing for merging of identical
> > permutes across operations like it appears for
> >
> >   a[0] = b[1] + c[1];
> >   a[1] = b[0] + c[0];
> >
> >> I guess this to some extent feeds into a long-standing TODO to allow
> >> “don't care” elements in things like vec_perm_indices.
> >
> > Hmm, so when there's no masking and we don't want to remove gaps
> > we'd replace the permutation removing the gaps with an operation
> > that turns the inactive lanes to zero (bitwise and) or all ones
> > (bitwise or).  Tracking the number of "vector" lanes compared
> > to the number of "real scalar" lanes in a SLP node might be enough
> > to handle this more or less transparently.
> 
> I was also thinking of cases like:
> 
>   a[0] += b[0];
>   a[1] += c[1];
>   a[2] += b[2];
>   a[3] += c[3];
> 
> (for all targets).
> 
> If we can somehow be smart about the handling of “c” and load a vector
> at c[0] rather than c[1], the rhs of the addition could be a blend of:
> 
>   b[0] ____ b[2] ____
>   ____ c[1] ____ c[3]
> 
> Or, even if we can't:
> 
>   b[0] ____ b[2] ____
>   c[1] ____ c[3] ____
> 
> is a natural permute on some targets (e.g. Arm ones).
> 
> This of course requires the usual proof that the other elements of b[]
> and c[] can be safely loaded.

Yeah, that's another good case (we'd currently outright reject
this because we refer to two different dataref groups ...).

Richard.