This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [COMMITTED] PowerPC: Remove 64 bits instructions in PPC32 code

On 26-05-2014 18:04, Segher Boessenkool wrote:
>>>> This patch replaces the insrdi by insrwi in powerpc32 assembly.  Although they
>>>> are not wrong, since all POWER chips supported in 32-bits are 64-bits and the chips
>>>> do not thrown an illegal exception when running these instructions, valgrind
>>>> fails accusing an invalid one.
>>> This code is CPU-specific; as you say, those CPUs can use rldimi just
>>> fine.  The reason the code uses rldimi instead of rlwimi is because
>>> it is faster (at least on power4, power5).  Fix valgrind instead?
>>> Segher
>> Well, using on a POWER5 (1.9 GHz) I get:
>>> ./test 
>> rldimi: min: 7 | max: 9
>> rlwimi: min: 7 | max: 10
>> And by issuing 16 instruction per test function I get:
>>> ./test 
>> rldimi: min: 7 | max: 9
>> rlwimi: min: 7 | max: 13
>> Newer processor (POWER7) also shows the same behavior.
> On a POWER7 I get that rlwimi is almost twice as slow as rldimi,
> just as expected.  The way you constructed your test with a blr
> immediately after a single rl*imi you get only one per group no
> matter what.

That's why I also constructed a example with 16 instructions (which I didn't included
in the link).  A more comprehensible example: and running
it on POWER7 (3.5GHz):

$ taskset -c 24 ./test
rldimi1:  min: 12 | max: 19
rldimi4:  min: 12 | max: 20
rldimi8:  min: 12 | max: 20
rldimi16: min: 12 | max: 23
rlwimi1:  min: 12 | max: 23
rlwimi4:  min: 12 | max: 23
rlwimi8:  min: 12 | max: 24
rlwimi16: min: 12 | max: 29

>> And the instructions
>> and not in hot path in the code (it is only called once), so I hardly consider
>> this a performance regression. 
> That might well be.  But see
> where (part of) this code was added.

Yeap I aware, I recalled this thread.  Anyway, in attachments I'm sending you the strchr
benchtest output in a POWER7 with and without the modification.  Since the code patch is
taken only once I would expect some latency difference with short length, however the
results does not really shown any noticeable slowdown.  I will check later on a POWER5
machine, but I also don't really expect much difference.

>> Anyway, I would prefer to keep consistent and using only 32-bits in 32-bits 
>> assembly code to avoid such issues with external tools (valgrind is only an
>> example) and to allow possible future implementation in different chips that
>> do not implement the 64-bits instructions to use powerN code.
> I find this not a convincing argument at all.  But it's not my call ;-)

POWER4 ifunc call selection, for instance, use the hwcap bits flags to select the best
implementation.  If a future chips would like to use the same code to add ifunc as well,
it will require a new way to differ from PPC_FEATURE_POWER4 (which is bit 0x00080000).
It will add more complexity on this code.  It is the for external tools: it will need
to add more logic to handle different chips.  Also, AFAIK GCC does not generate 64-bits
instructions for -m32 (I might be wrong on this one).

Anyway, as I said I just change this because 1. it fixes valgrind checks, 2. I didn't 
see performance compelling reasons, and 3. it do see it more consistent to use 32 bits
instructions on 32 bit code.  However, 1. can be fixed if 2. is false (and 3. just can
be ignored).

> Segher

Attachment: bench-strchr.out
Description: Text document

Attachment: bench-strchr-patch.out
Description: Text document

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]