This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Tue, Jul 30, 2013 at 01:35:49PM +0800, Ling Ma wrote: > >> >> +L(less_128bytes): > >> >> + xor %esi, %esi > >> >> + mov %ecx, %esi > >> > And this? A C equivalent of this is > >> > x = 0; > >> > x = y; > >> Ling: we used mov %sil, %cl in above code, now %esi become as > >> destination register(mov %ecx, %esi), there is one false dependence > >> hazard, we use xor r1, r1 to ask decode stage to break the dependence, > >> and insight pipeline xor r1, r1 will be removed before entering into > >> execution stage. > >> > > That is pointless as mov breaks false dependencies. > > > > Anyway a code you use is redudnand. You already have that computed so > > simple mov %xmm0, %rcx will do a job. > > Ling: Usually rename stage can help us to resolve most of WAR, WAW, > but we use %sil, instead of %esi, which is related with patial > register access. It does not matter. Also you have plenty of free registers available > i remember mov xmm0, r32/64 will cause cross-domain operation, it is > not good on nehalem, i may test whether it exists on haswell. Wrong again even on nehalem. I tested your and xmm code and you are 50% slowerand you are 50% slower (and I am not counting that computing pshufb is free in our case). A time in seconds to calculate broadcast 1000000000 times is: your sse 0.37 0.23 0.36 0.23 0.36 0.23 0.36 0.24 0.36 0.23 0.36 0.23 0.37 0.24 0.36 0.23 0.36 0.23 0.36 0.23 In attached benchmark.
Attachment:
test_broadcast.tar.bz2
Description: Binary data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |