This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB



On 29-06-2015 18:18, OndÅej BÃlka wrote:
> On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
>> Lets look at a real customer example. The customer wants to use the P8
>> 128-bit add/sub but also wants to be able to unit test code on existing
>> P7 machines. Which results in something like this:
>>
>> static inline vui32_t
>> vec_addcuq (vui32_t a, vui32_t b)
>> {
>>         vui32_t t;
>>
>>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
>>                 {
>>                 
>>                         __asm__(
>>                             "vaddcuq %0,%1,%2;"
>>                             : "=v" (t)
>>                             : "v" (a),
>>                               "v" (b)
>>                             : );
>>                 }
>>                 else
>>                         vui32_t c, c2, co;
>>                         vui32_t z= {0,0,0,0};
>>                         __asm__(
>>                             "vaddcuw %3,%4,%5;\n"
>>                             "\tvadduwm %0,%4,%5;\n"
>>                             "\tvsldoi %1,%3,%6,4;\n"
>>                             "\tvaddcuw %2,%0,%1;\n"
>>                             "\tvadduwm %0,%0,%1;\n"
>>                             "\tvor %3,%3,%2;\n"
>>                             "\tvsldoi %1,%2,%6,4;\n"
>>                             "\tvaddcuw %2,%0,%1;\n"
>>                             "\tvadduwm %0,%0,%1;\n"
>>                             "\tvor %3,%3,%2;\n"
>>                             "\tvsldoi %1,%2,%6,4;\n"
>>                             "\tvadduwm %0,%0,%1;\n"
>>                             : "=&v" (t), /* 0 */
>>                               "=&v" (c), /* 1 */
>>                               "=&v" (c2), /* 2 */
>>                               "=&v" (co) /* 3 */
>>                             : "v" (a), /* 4 */
>>                               "v" (b), /* 5 */
>>                               "v" (z)  /* 6 */
>>                             : );
>>                         t = co;
>>                 }
>>         return (t);
>> }
>>
>> So it is clear to me that executing 14+ instruction to decide if I can
>> optimize to use new single instruction optimization is not a good deal.
>>
> No, this is prime example that average programmers shouldn't use hwcap
> as that results in moronic code like this.
> 
> When you poorly reinvent wheel you will get terrible performance like
> fallback here. Gcc already has 128 ints so tell average programmers to
> use them instead and don't touch features that they don't understand.

Again your patronizing tone only shows your lack of knowledge in this
subject: the above code aims to use ISA 2.07 *vector* instructions to
multiply 128-bits integer in vector *registers*. It has nothing to do
with uint128_t support on GCC and only recently GCC added support to 
such builtins [1]. And although there is plan to add support to use 
vector instruction for uint128_t, right now they are done in GRP register
in powerpc.

Also, it is up to developers to select the best way to use the CPU
features.  Although I am not very found of providing the hwcap in TCB
(my suggestion was to use local __thread in libgcc instead), the idea
here is to provide *tools*.

[1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html

> 
> As gcc compiles addition into pair of addc, adde instructions a
> performance gain is minimal while code is harder to maintain. Due to
> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
> on following example on power8.
> 
> 
> int main()
> {
>   unsigned long i;
>   __int128 u = 0;
> //long u = 0;
>   for (i = 0; i < 1000000000; i++)
>     u += i * i;
>   return u >> 35;
> }
> 
> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> [neleai@gcc2-power8 ~]$ time ./a.out 
> 
> real	0m0.957s
> user	0m0.956s
> sys	0m0.001s
> 
> [neleai@gcc2-power8 ~]$ vim uu.c 
> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> [neleai@gcc2-power8 ~]$ time ./a.out 
> 
> real	0m1.040s
> user	0m1.039s
> sys	0m0.001s

This is due the code is not using any vector instruction, which is the aim of the
code snippet Steven has posted.  Also, it really depends in which mode the CPU is
set, on a POWER split-core mode, where the CPU dispatch groups are shared among
threads in an non-dynamic way the difference is bigger:

[[fedora@glibc-ppc64le ~]$ time ./test

real    0m1.730s
user    0m1.726s
sys     0m0.003s
[fedora@glibc-ppc64le ~]$ time ./test-long

real    0m1.593s
user    0m1.591s
sys     0m0.002s

> 
> 
>  
>> One instruction (plus the __builtin_cpu_supports which should be and
>> immediate,  branch conditional) is a better deal. Inlining so the
>> compiler can do common sub-expression about larger blocks is an even
>> better deal.
>>
> That doesn't change fact that its mistake. A code above was bad as it
> added check for single instruction that takes a cycle. When difference
> between implementations is few cycles then each cycle matter (otherwise
> you should just stick to generic one). Then a hwcap check itself causes
> slowdown that matters and you should use ifunc to eliminate.
> 
> Or hope that its moved out of loop, but when its loop with 100
> iterations a __builtin_cpu_supports time becomes imaterial.
> 
> 
>> I just do not understand why there is so much resistance to this simple
>> platform ABI specific request.
> 
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]