[PATCH v2] dse: Remove partial load after full store for high part access[PR71309]

Wed Jul 22 11:05:03 GMT 2020

luoxhu <luoxhu@linux.ibm.com> writes:
> Hi,
>
> On 2020/7/21 23:30, Richard Sandiford wrote:
>> Xiong Hu Luo <luoxhu@linux.ibm.com> writes:>> @@ -1872,9 +1872,27 @@ get_stored_val (store_info *store_info, machine_mode read_mode,
>>>       {
>>>         poly_int64 shift = gap * BITS_PER_UNIT;
>>>         poly_int64 access_size = GET_MODE_SIZE (read_mode) + gap;
>>> -      read_reg = find_shift_sequence (access_size, store_info, read_mode,
>>> -				      shift, optimize_bb_for_speed_p (bb),
>>> -				      require_cst);
>>> +      rtx rhs_subreg = NULL;
>>> +
>>> +      if (known_eq (GET_MODE_BITSIZE (store_mode), shift * 2))
>>> +	{
>>> +	  scalar_int_mode inner_mode = smallest_int_mode_for_size (shift);
>>> +	  poly_uint64 sub_off
>>> +	    = ((!BYTES_BIG_ENDIAN)
>>> +		 ? GET_MODE_SIZE (store_mode) - GET_MODE_SIZE (inner_mode)
>>> +		 : 0);
>>> +
>>> +	  rhs_subreg = simplify_gen_subreg (inner_mode, store_info->rhs,
>>> +					    store_mode, sub_off);
>>> +	  if (rhs_subreg)
>>> +	    read_reg
>>> +	      = extract_low_bits (read_mode, inner_mode, copy_rtx (rhs_subreg));
>>> +	}
>>> +
>>> +      if (read_reg == NULL)
>>> +	read_reg
>>> +	  = find_shift_sequence (access_size, store_info, read_mode, shift,
>>> +				 optimize_bb_for_speed_p (bb), require_cst);
>> 
>> Did you consider doing this in find_shift_sequence instead?
>> ISTM that this is really using subregs to optimise:
>> 
>>        /* In theory we could also check for an ashr.  Ian Taylor knows
>> 	 of one dsp where the cost of these two was not the same.  But
>> 	 this really is a rare case anyway.  */
>>        target = expand_binop (new_mode, lshr_optab, new_reg,
>> 			     gen_int_shift_amount (new_mode, shift),
>> 			     new_reg, 1, OPTAB_DIRECT);
>> 
>> I think everything up to:
>> 
>>        /* Also try a wider mode if the necessary punning is either not
>> 	 desirable or not possible.  */
>>        if (!CONSTANT_P (store_info->rhs)
>> 	  && !targetm.modes_tieable_p (new_mode, store_mode))
>> 	continue;
>> 
>> is either neutral or helpful for the subreg case too, so maybe
>> we could just add the optimisation after that.  (It probably isn't
>> worth reusing any of the later loop body code though, since the
>> subreg case is much simpler.)
>> 
>> I don't think we need to restrict this case to modes of size
>> shift * 2.  We can just check whether the shift is a multiple of
>> the new_mode calculated by find_shift_sequence (using multiple_p).
>> 
>> An easier way of converting the shift to a subreg byte offset
>> is to use subreg_offset_from_lsb, which also handles
>> BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN.
>> 
>
> Thanks, I've updated the patch by moving it into find_shift_sequence.
> Not sure whether meets your comments precisely though it still works:)
> There is a comment mentioned that 
>
>   /* Some machines like the x86 have shift insns for each size of
>      operand.  Other machines like the ppc or the ia-64 may only have
>      shift insns that shift values within 32 or 64 bit registers.
>      This loop tries to find the smallest shift insn that will right
>      justify the value we want to read but is available in one insn on
>      the machine.  */
>
> So it will early break without some additional check as the new_mode is
> TImode here:
>
>       if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD)
> 	break;
>
>
>
> [PATCH v2] dse: Remove partial load after full store for high part access[PR71309]
>
>
> This patch could optimize (works for char/short/int/void*):
>
> 6: r119:TI=[r118:DI+0x10]
> 7: [r118:DI]=r119:TI
> 8: r121:DI=[r118:DI+0x8]
>
> =>
>
> 6: r119:TI=[r118:DI+0x10]
> 18: r122:TI=r119:TI
> 16: r123:TI#0=r122:TI#8 0>>0
> 17: r123:TI#8=0
> 19: r124:DI=r123:TI#0
> 7: [r118:DI]=r119:TI
> 8: r121:DI=r124:DI
>
> Final ASM will be as below without partial load after full store(stxv+ld):
>   mr 9,3
>   ld 3,24(3)
>   ld 10,16(3)
>   std 3,8(9)
>   std 10,0(9)
>   blr
>
> It could achieve ~25% performance improvement for typical cases on
> Power9.  Bootstrap and regression testing on Power9-LE.
>
> For AArch64, one ldr is replaced by mov:
>
> ldp     x2, x3, [x0, 16]
> stp     x2, x3, [x0]
> ldr     x0, [x0, 8]
>
> =>
>
> mov     x1, x0
> ldp     x2, x0, [x0, 16]
> stp     x2, x0, [x1]
>
> gcc/ChangeLog:
>
> 2020-07-22  Xionghu Luo  <luoxhu@linux.ibm.com>
>
> 	PR rtl-optimization/71309
> 	* dse.c (find_shift_sequence): Use subreg of shifted from high part
> 	register to avoid loading from address.
>
> gcc/testsuite/ChangeLog:
>
> 2020-07-22  Xionghu Luo  <luoxhu@linux.ibm.com>
>
> 	PR rtl-optimization/71309
> 	* gcc.target/powerpc/pr71309.c: New test.
> ---
>  gcc/dse.c                                  | 15 +++++++++-
>  gcc/testsuite/gcc.target/powerpc/pr71309.c | 33 ++++++++++++++++++++++
>  2 files changed, 47 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr71309.c
>
> diff --git a/gcc/dse.c b/gcc/dse.c
> index bbe792e48e8..e06a9fbb0cd 100644
> --- a/gcc/dse.c
> +++ b/gcc/dse.c
> @@ -1736,7 +1736,8 @@ find_shift_sequence (poly_int64 access_size,
>        int cost;
>  
>        new_mode = new_mode_iter.require ();
> -      if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD)
> +      if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD
> +	  && !multiple_p (GET_MODE_BITSIZE (new_mode), shift))
>  	break;
>  
>        /* If a constant was stored into memory, try to simplify it here,
> @@ -1793,6 +1794,18 @@ find_shift_sequence (poly_int64 access_size,
>        shift_seq = get_insns ();
>        end_sequence ();
>  
> +      /* Use subreg shifting from high part to avoid full store followed by
> +	 partial load.  See PR71309.  */
> +      if (multiple_p (GET_MODE_BITSIZE (new_mode), shift) && shift_seq)
> +	{
> +	  new_lhs = extract_low_bits (new_mode, store_mode,
> +				      copy_rtx (store_info->rhs));
> +	  emit_move_insn (new_reg, new_lhs);
> +	  emit_insn (shift_seq);
> +	  read_reg = extract_low_bits (read_mode, new_mode, target);
> +	  break;
> +	}
> +

This wasn't really what I meant.  Using subregs is fine, but I was
thinking of:

      /* Also try a wider mode if the necessary punning is either not
	 desirable or not possible.  */
      if (!CONSTANT_P (store_info->rhs)
	  && !targetm.modes_tieable_p (new_mode, store_mode))
	continue;

      if (multiple_p (shift, GET_MODE_BITSIZE (new_mode)))
	{
	  /* Try to implement the shift using a subreg.  */
	  poly_int64 offset = subreg_offset_from_lsb (new_mode, store_mode,
						      shift);
	  rhs_subreg = simplify_gen_subreg (new_mode, store_info->rhs,
					    store_mode, offset);
	  if (rhs_subreg)
	    {
	      ...
	      break;
	    }
	}

where the rhs_subreg is from your original patch.

The multiple_p should be that way round: the shift needs to be a
multiple of the new_mode for the subreg to be valid.

I think this should also avoid the BITS_PER_WORD problem.  On the
other hand, I agree BITS_PER_UNIT isn't a very sensible limit if
we're using subregs, so maybe moving it to after the multiple_p
if block would still make sense.

Thanks,
Richard