Refactor peel_iters_{pro,epi}logue cost model handlings

Tue Jul 28 08:01:00 GMT 2020

Hi Richard,

on 2020/7/27 下午9:10, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi,
>>
>> As Richard S. suggested in the thread:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550633.html
>>
>> this patch is separated from the one of that thread, mainly to refactor the
>> existing peel_iters_{pro,epi}logue cost model handlings.
>>
>> I've addressed Richard S.'s review comments there, moreover because of one
>> failure of aarch64 testing, I updated it a bit more to keep the logic unchanged
>> as before first (refactor_cost.diff).
> 
> Heh, nice when a clean-up exposes an existing bug. ;-)  I agree the
> updates look correct.  E.g. if vf is 1, we should assume that there
> are no peeled iterations even if the number of iterations is unknown.
> 
> Both patches are OK with some very minor nits (sorry):

Thanks for the comments!  I've addressed them and commit refactor_cost.diff
via r11-2378.

For adjust.diff, it works well on powerpc64le-linux-gnu (bootstrap/regtest),
but got one failure on aarch64 as below:

PASS->FAIL: gcc.target/aarch64/sve/cost_model_2.c -march=armv8.2-a+sve  scan-assembler-times \\tst1w\\tz 1

I spent some time investigating it and thought it's expected.  As you mentioned
above, the patch can fix the cases with VF 1, the VF of this case is exactly one.  :)

Without the proposed adjustment, the cost for V16QI looks like:

  Vector inside of loop cost: 1
  Vector prologue cost: 4
  Vector epilogue cost: 3
  Scalar iteration cost: 4
  Scalar outside cost: 6
  Vector outside cost: 7
  prologue iterations: 0
  epilogue iterations: 0

After the change, the cost becomes to:

    Vector inside of loop cost: 1
    Vector prologue cost: 1
    Vector epilogue cost: 0
    Scalar iteration cost: 4
    Scalar outside cost: 6
    Vector outside cost: 1
    prologue iterations: 0
    epilogue iterations: 0

which outperforms VNx16QI that isn't affected by this patch with partial vectors.

The test case expects VNx16QI to be used, the asm looks like:

        cmp     w2, 0
        ble     .L1
        mov     x1, 0
        sbfiz   x2, x2, 2, 32
        mov     z0.s, #1
        whilelo p0.s, xzr, x2
        .p2align 3,,7
.L3:
        st1w    z0.s, p0, [x0, x1, lsl 2]
        add     x1, x1, 4
        whilelo p0.s, x1, x2
        b.any   .L3

But with this patch, instead it ends up with:

.LFB0:
        .cfi_startproc
        cmp     w2, 0
        ble     .L1
        movi    v0.4s, 0x1
        sub     w1, w2, #1
        add     x2, x0, 16
        add     x1, x2, x1, uxtw 4
        .p2align 3,,7
.L3:
        str     q0, [x0], 16
        cmp     x0, x1
        bne     .L3

I updated the expected asm as the current generated asm, not sure whether it's
a fix good enough.

Bootstrapped/regtested on powerpc64le-linux-gnu and aarch64-linux-gnu.

BR,
Kewen
-----
gcc/ChangeLog:

	* tree-vect-loop.c (vect_get_known_peeling_cost): Don't consider branch
	taken costs for prologue and epilogue if they don't exist.
	(vect_estimate_min_profitable_iters): Likewise.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/cost_model_2.c: Adjust due to cost model
	change.
-------------- next part --------------

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_2.c
index d9d707893d8..a7d748397d6 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_2.c
@@ -9,4 +9,4 @@ vset (int *restrict dst, int *restrict src, int count)
       *dst++ = 1;
 }
 
-/* { dg-final { scan-assembler-times {\tst1w\tz} 1 } } */
+/* { dg-final { scan-assembler-times {\tstr\tq} 1 } } */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 185019c3305..43ec4fb04d5 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3520,10 +3520,12 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
     {
       /* If peeled iterations are known but number of scalar loop
 	 iterations are unknown, count a taken branch per peeled loop.  */
-      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, NULL,
-				 NULL_TREE, 0, vect_prologue);
-      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken, NULL,
-				  NULL_TREE, 0, vect_epilogue);
+      if (peel_iters_prologue > 0)
+	retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
+				   NULL, NULL_TREE, 0, vect_prologue);
+      if (*peel_iters_epilogue > 0)
+	retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken,
+				    NULL, NULL_TREE, 0, vect_epilogue);
     }
 
   stmt_info_for_cost *si;
@@ -3670,7 +3672,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
   bool prologue_need_br_not_taken_cost = false;
 
   /* Calculate peel_iters_prologue.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
     peel_iters_prologue = 0;
   else if (npeel < 0)
     {
@@ -3689,7 +3691,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
   else
     {
       peel_iters_prologue = npeel;
-      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && peel_iters_prologue > 0)
 	/* If peeled iterations are known but number of scalar loop
 	   iterations are unknown, count a taken branch per peeled loop.  */
 	prologue_need_br_taken_cost = true;
@@ -3719,7 +3721,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
   else
     {
       peel_iters_epilogue = vect_get_peel_iters_epilogue (loop_vinfo, npeel);
-      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && peel_iters_epilogue > 0)
 	/* If peeled iterations are known but number of scalar loop
 	   iterations are unknown, count a taken branch per peeled loop.  */
 	epilogue_need_br_taken_cost = true;