[PATCH v3] vect/rs6000: Support vector with length cost modeling

Fri Jul 24 16:21:49 GMT 2020

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi,
>
> Sorry, please ignore the previously attached file, which isn't the latest one
> although almost are the same.   The latest tested is attached here.  
>
> Sorry for the inconvenience.
>
> BR,
> Kewen
>
> on 2020/7/22 下午11:48, Kewen.Lin via Gcc-patches wrote:
>> 
>> It's a great idea, by following your subsequent suggestion to make the structure
>> like:
>> 
>>   - calculate peel_iters_prologue
>>   - calculate peel_iters_epilogue
>>   - add costs associated with peel_iters_prologue
>>   - add costs associated with peel_iters_epilogue
>>   - add costs related to branch taken/not_taken.
>> 
>> the updated v3 is attached.
>> 
>> Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>> param vect-partial-vector-usage=1, I'll test it without partial vectors
>> setting, also on aarch64 later.
>> 
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>> 
>> 	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.
>> 	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.
>> 	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code
>> 	to determine peel_iters_epilogue to function ...
>> 	(vect_get_peel_iters_epilogue): ... this.  New function.
>> 	(vect_estimate_min_profitable_iters):  Add cost modeling for vector
>> 	with length, refactor cost calculation on peel_iters_prologue and
>> 	peel_iters_epilogue.
>> 

Thanks, the rearrangement of the existing code looks good.  Could you
split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff
out into separate patches?

> -/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
> -int
> -vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
> -                             int *peel_iters_epilogue,
> -                             stmt_vector_for_cost *scalar_cost_vec,
> -			     stmt_vector_for_cost *prologue_cost_vec,
> -			     stmt_vector_for_cost *epilogue_cost_vec)
> +/* Calculate how many iterations peeled for epilogue with information LOOP_VINFO
> +   and PEEL_ITERS_PROLOGUE.  */

Maybe:

/* Estimate the number of peeled epilogue iterations for LOOP_VINFO.
   PEEL_ITERS_PROLOGUE is the number of peeled prologue iterations,
   or -1 if not known.  */

> […]
> -      peel_iters_prologue = niters < peel_iters_prologue ?
> -                            niters : peel_iters_prologue;
> -      *peel_iters_epilogue = (niters - peel_iters_prologue) % assumed_vf;
> +      int npeel_prol
> +	= niters < peel_iters_prologue ? niters : peel_iters_prologue;
> +      int npeel_epil = (niters - npeel_prol) % assumed_vf;

Here and in the rest of the patch, I think we should stick with
the “ogue” rather than shorten it this much.

This is a comment about the original code, but it seems simpler to
write as either:

  if (peel_iters_prologue > niters)
    peel_iters_prologue = niters;

or:

  peel_iters_prologue = MIN (niters, peel_iters_prologue);

> […]
> +      /* Refer to the functions vect_set_loop_condition_partial_vectors

s/Refer/Referring/

> +	 and vect_set_loop_controls_directly, we need to generate each
> +	 length in the prologue and in the loop body if required. Although
> +	 there are some possible optimization, we consider the worst case

optimizations

> +	 here.  */
> +
> +      /* For now we only operate length-based partial vectors on Power,
> +	 which has constant VF all the time, we need some tweakings below
> +	 if it doesn't hold in future.  */
> +      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());

Where do you rely on this?  There didn't seem to be any obvious
to_constant uses.  Since this is “only” a cost calculation, we should
be using assumed_vf.

> +
> +      /* For wrap around checking.  */
> +      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +      unsigned int compare_precision = TYPE_PRECISION (compare_type);
> +      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> +
> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> +      bool need_iterate_p
> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
> +
> +      /* Init min/max, shift and minus cost relative to single
> +	 scalar_stmt. For now we only use length-based partial vectors on
> +	 Power, target specific cost tweaking may be needed for other
> +	 ports in future.  */
> +      unsigned int min_max_cost = 2;
> +      unsigned int shift_cost = 1, minus_cost = 1;
> +
> +      /* Init cost relative to single scalar_stmt.  */
> +      unsigned int prol_cnt = 0;
> +      unsigned int body_cnt = 0;
> +
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +	if (rgc->type)
> +	  {
> +	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
>  
> -      /* If peeled iterations are unknown, count a taken branch and a not taken
> -         branch per peeled loop. Even if scalar loop iterations are known,
> -         vector iterations are not known since peeled prologue iterations are
> -         not known. Hence guards remain the same.  */
> -      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
> -			    NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, 1, cond_branch_not_taken,
> -			    NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
> -			    NULL, NULL_TREE, 0, vect_epilogue);
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, 1, cond_branch_not_taken,
> -			    NULL, NULL_TREE, 0, vect_epilogue);
> -      stmt_info_for_cost *si;
> -      int j;
> -      FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
> -	{
> -	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
> -				si->count * peel_iters_prologue,
> -				si->kind, si->stmt_info, si->vectype,
> -				si->misalign,
> -				vect_prologue);
> -	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
> -				si->count * peel_iters_epilogue,
> -				si->kind, si->stmt_info, si->vectype,
> -				si->misalign,
> -				vect_epilogue);
> -	}
> -    }
> -  else
> -    {
> -      stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec;
> -      stmt_info_for_cost *si;
> -      int j;
> -      void *data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
> +	    /* Need one shift for niters_total computation.  */
> +	    if (!niters_known_p && nitems != 1)
> +	      prol_cnt += shift_cost;
>  
> -      prologue_cost_vec.create (2);
> -      epilogue_cost_vec.create (2);
> -      peel_iters_prologue = npeel;
> +	    /* Need to handle wrap around.  */
> +	    if (iv_limit == -1
> +		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
> +		    > compare_precision))
> +	      prol_cnt += (min_max_cost + minus_cost);
>  
> -      (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologue,
> -					  &peel_iters_epilogue,
> -					  &LOOP_VINFO_SCALAR_ITERATION_COST
> -					    (loop_vinfo),
> -					  &prologue_cost_vec,
> -					  &epilogue_cost_vec);
> +	    /* Need to handle batch limit excepting for the 1st one.  */
> +	    prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
>  
> -      FOR_EACH_VEC_ELT (prologue_cost_vec, j, si)
> -	(void) add_stmt_cost (loop_vinfo,
> -			      data, si->count, si->kind, si->stmt_info,
> -			      si->vectype, si->misalign, vect_prologue);
> +	    unsigned int num_vectors = num_vectors_m1 + 1;
> +	    /* Need to set up lengths in prologue, only one MIN required
> +	       since start index is zero.  */
> +	    prol_cnt += min_max_cost * num_vectors;
>  
> -      FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si)
> -	(void) add_stmt_cost (loop_vinfo,
> -			      data, si->count, si->kind, si->stmt_info,
> -			      si->vectype, si->misalign, vect_epilogue);
> +	    /* Need to update lengths in body for next iteration.  */
> +	    if (need_iterate_p)
> +	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
> +	  }
>  
> -      prologue_cost_vec.release ();
> -      epilogue_cost_vec.release ();
> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
> +			    NULL, NULL_TREE, 0, vect_prologue);
> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
> +			    NULL, NULL_TREE, 0, vect_body);

IMO this seems to be reproducing too much of the functions that you
referred to.  And the danger with that is that they could easily
get out of sync later.

Thanks,
Richard