This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB



On 30-06-2015 00:14, OndÅej BÃlka wrote:
> On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 29-06-2015 18:18, OndÅej BÃlka wrote:
>>> On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
>>>> Lets look at a real customer example. The customer wants to use the P8
>>>> 128-bit add/sub but also wants to be able to unit test code on existing
>>>> P7 machines. Which results in something like this:
>>>>
>>>> static inline vui32_t
>>>> vec_addcuq (vui32_t a, vui32_t b)
>>>> {
>>>>         vui32_t t;
>>>>
>>>>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
>>>>                 {
>>>>                 
>>>>                         __asm__(
>>>>                             "vaddcuq %0,%1,%2;"
>>>>                             : "=v" (t)
>>>>                             : "v" (a),
>>>>                               "v" (b)
>>>>                             : );
>>>>                 }
>>>>                 else
>>>>                         vui32_t c, c2, co;
>>>>                         vui32_t z= {0,0,0,0};
>>>>                         __asm__(
>>>>                             "vaddcuw %3,%4,%5;\n"
>>>>                             "\tvadduwm %0,%4,%5;\n"
>>>>                             "\tvsldoi %1,%3,%6,4;\n"
>>>>                             "\tvaddcuw %2,%0,%1;\n"
>>>>                             "\tvadduwm %0,%0,%1;\n"
>>>>                             "\tvor %3,%3,%2;\n"
>>>>                             "\tvsldoi %1,%2,%6,4;\n"
>>>>                             "\tvaddcuw %2,%0,%1;\n"
>>>>                             "\tvadduwm %0,%0,%1;\n"
>>>>                             "\tvor %3,%3,%2;\n"
>>>>                             "\tvsldoi %1,%2,%6,4;\n"
>>>>                             "\tvadduwm %0,%0,%1;\n"
>>>>                             : "=&v" (t), /* 0 */
>>>>                               "=&v" (c), /* 1 */
>>>>                               "=&v" (c2), /* 2 */
>>>>                               "=&v" (co) /* 3 */
>>>>                             : "v" (a), /* 4 */
>>>>                               "v" (b), /* 5 */
>>>>                               "v" (z)  /* 6 */
>>>>                             : );
>>>>                         t = co;
>>>>                 }
>>>>         return (t);
>>>> }
>>>>
>>>> So it is clear to me that executing 14+ instruction to decide if I can
>>>> optimize to use new single instruction optimization is not a good deal.
>>>>
>>> No, this is prime example that average programmers shouldn't use hwcap
>>> as that results in moronic code like this.
>>>
>>> When you poorly reinvent wheel you will get terrible performance like
>>> fallback here. Gcc already has 128 ints so tell average programmers to
>>> use them instead and don't touch features that they don't understand.
>>
>> Again your patronizing tone only shows your lack of knowledge in this
>> subject: the above code aims to use ISA 2.07
> 
> Sorry, but could you explain how did you come to conclusion to use ISA
> 2.07? Only check done is

Because 'vaddcuq' is ISA 2.07 *only* and I think Steve has made a mistake
here, the test should be __builtin_cpu_supports("PPC_FEATURE2_ARCH_2_07").

> 
>>>>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
> 
> And vsx is part of ISA 2.06
> 
>> *vector* instructions to
>> multiply 128-bits integer in vector *registers*.
> 
> Your sentence has three problems.
> 1. From start of mail.
>>> On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
>>>> Lets look at a real customer example. The customer wants to use the P8
>>>> 128-bit add/sub
> 2. Function is named vec_addcuq so accoring to name it does...? I guess
> division.
> 
> 3. Power isa describes these instructions pretty clearly:
> 
> Vector Add and Write Carry-Out Unsigned
> Word                                        VX-form
> vaddcuw      VRT,VRA,VRB
>     4        VRT      VRA      VRB          384
> 0         6        11       16       21                31
> do i=0 to 127 by 32
>    aop      EXTZ((VRA)i:i+31)
>    bop      EXTZ((VRB)i:i+31)
>    VRTi:i+31    Chop( ( aop +int bop ) >>ui 32,1)
> For each vector element i from 0 to 3, do the following.
>     Unsigned-integer word element i in VRA is added
>     to unsigned-integer word element i in VRB. The
>     carry out of the 32-bit sum is zero-extended to 32
>     bits and placed into word element i of VRT.
> Special Registers Altered:
>      None
> 
> If you still believe that it somehow does multiplication just try this
> and see that result is all zeroes.
> 
>    __vector uint32_t x={3,2,0,3},y={0,0,0,0};
>    y = vec_addcuq(x,x);
>    printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]);
> 
> Again your patronizing tone only shows your lack of knowledge of powerpc
> assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/

Seriously, you need to start admitting your lack of knowledge in PowerISA
(I am meant addition instead of multiplication, my mistake).  And repeating
myself to prove a point only makes you childish, I am not competing with
you.

> 
> 
> I did mistake that I read to bit fast and seen only add instead of
> instruction to get carry. Still thats with gpr two additions with carry,
> then add zero with carry to set desired bit. 
> 
>> It has nothing to do
>> with uint128_t support on GCC and only recently GCC added support to 
>> such builtins [1]. And although there is plan to add support to use 
>> vector instruction for uint128_t, right now they are done in GRP register
>> in powerpc.
>>
> Customer just wants to do 128 additions. If a fastest way
> is with GPR registers then he should use gpr registers.
> 
> My claim was that this leads to slow code on power7. Fallback above
> takes 14 cycles on power8 and 128bit addition is similarly slow.
> 
> Yes you could craft expressions that exploit vectors by doing ands/ors
> with 128bit constants but if you mostly need to sum integers and use 128
> bits to prevent overflows then gpr is correct choice due to transfer
> cost.

Again this is something, as Steve has pointed out, you only assume without
knowing the subject in depth: it is operating on *vector* registers and
thus it will be more costly to move to and back GRP than just do in
VSX registers.  And as Steven has pointed out, the idea is to *validate*
on POWER7.

> 
>> Also, it is up to developers to select the best way to use the CPU
>> features.  Although I am not very found of providing the hwcap in TCB
>> (my suggestion was to use local __thread in libgcc instead), the idea
>> here is to provide *tools*.
>>
> If you want to provide tools then you should try to make best tool
> possible instead of being satisfied with tool that poorly fits job and
> is dangerous to use.
> 
> I am telling all time that there are better alternatives where this
> doesn't matter.
> 
> One example would be write gcc pass that runs after early inlining to
> find all functions containing __builtin_cpu_supports, cloning them to
> replace it by constant and adding ifunc to automatically select variant.

Using internal PLT calls to such mechanism is really not the way to handle
performance for powerpc.  

> 
> You would also need to keep list of existing processor features to
> remove nonexisting combinations. That easiest way to avoid combinatorial
> explosion.
> 
> 
> 
>  
>> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html
>>
>>>
>>> As gcc compiles addition into pair of addc, adde instructions a
>>> performance gain is minimal while code is harder to maintain. Due to
>>> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
>>> on following example on power8.
>>>
>>>
>>> int main()
>>> {
>>>   unsigned long i;
>>>   __int128 u = 0;
>>> //long u = 0;
>>>   for (i = 0; i < 1000000000; i++)
>>>     u += i * i;
>>>   return u >> 35;
>>> }
>>>
>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
>>> [neleai@gcc2-power8 ~]$ time ./a.out 
>>>
>>> real	0m0.957s
>>> user	0m0.956s
>>> sys	0m0.001s
>>>
>>> [neleai@gcc2-power8 ~]$ vim uu.c 
>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
>>> [neleai@gcc2-power8 ~]$ time ./a.out 
>>>
>>> real	0m1.040s
>>> user	0m1.039s
>>> sys	0m0.001s
>>
>> This is due the code is not using any vector instruction, which is the aim of the
>> code snippet Steven has posted.
> 
> Wait do you want to have fast code or just show off your elite skills
> with vector registers?

What does it have to do with vectors? I just saying that in split-core mode
the CPU group dispatches are statically allocated for the eight threads
and thus pipeline gain are lower.  And indeed it was not the case for the
example (I rushed without doing the math, my mistake again).

> 
> A vector 128bit addition is on power7 lot slower than 128bit addition in
> gpr. This is valid use case when I produce 64bit integers and want to
> compute their sum in 128bit variable. You could construct lot of use
> cases where gpr wins, for example summing an array(possibly with applied
> arithmetic expression).
> 
> Unless you show real world examples how could you prove that vector
> registers are better choice?

How said they are better? As Steve has pointed out, *you* assume it, the
idea afaik is only to be able to *validate* the code on a POWER7 machine.

Anyway, I will conclude again because I am not in the mood to get back
at this subject (you can be the big boy and have the final line).
I tend to see the TCB is not the way to accomplish it, but not for
performance reasons.  My main issue is tie compiler code generation ABI
with runtime in a way it should be avoided (for instance implementing it
on libgcc).  And your performance analysis mostly do not hold true for
powerpc.

> 
> 
> 
>>  Also, it really depends in which mode the CPU is
>> set, on a POWER split-core mode, where the CPU dispatch groups are shared among
>> threads in an non-dynamic way the difference is bigger:
>>
>> [[fedora@glibc-ppc64le ~]$ time ./test
>>
>> real    0m1.730s
>> user    0m1.726s
>> sys     0m0.003s
>> [fedora@glibc-ppc64le ~]$ time ./test-long
>>
>> real    0m1.593s
>> user    0m1.591s
>> sys     0m0.002s
>>
> Difference? What difference? Only ratio matters to remove things like
> different frequency of processors and that thread sharing slows you down
> by constant. When I do math difference between these two ratios is 0.06%
> 
> 1.593/1.730 = 0.9208092485549133
> 0.957/1.040 = 0.9201923076923076
> 
> 
> 
>>>
>>>
>>>  
>>>> One instruction (plus the __builtin_cpu_supports which should be and
>>>> immediate,  branch conditional) is a better deal. Inlining so the
>>>> compiler can do common sub-expression about larger blocks is an even
>>>> better deal.
>>>>
>>> That doesn't change fact that its mistake. A code above was bad as it
>>> added check for single instruction that takes a cycle. When difference
>>> between implementations is few cycles then each cycle matter (otherwise
>>> you should just stick to generic one). Then a hwcap check itself causes
>>> slowdown that matters and you should use ifunc to eliminate.
>>>
>>> Or hope that its moved out of loop, but when its loop with 100
>>> iterations a __builtin_cpu_supports time becomes imaterial.
>>>
>>>
>>>> I just do not understand why there is so much resistance to this simple
>>>> platform ABI specific request.
>>>
>>>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]