This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction


>> > +L(256bytesormore):
>> > +
>> > +#ifdef USE_AS_MEMMOVE
>> > +       cmp     %rsi, %rdi
>> > +       jae     L(copy_backward)
>> > +#endif
>
> Test by following condition
> (uint64_t)((src - dest)-n) < 2*n
> it makes branch predicable instead two unpredicable branches.
>
> Also alias memmove_avx to memcpy_avx. As they differ only when you copy 256+
>
> bytes so performance penalty of this check can be payed by halving
> memcpy icache usage alone.
Ling: Ok, I will try in new version.

>> > +       mov     %rdx, %rcx
>> > +       rep     movsb
>> > +       ret
>> > +
> Did haswell got optimized movsb? If so at which interval it works well?

Ling :rep movsb is good for most cases, haswell enhanced it by
combining data into 32
 bytes per cycle. Because memcpy may know the loop number before copying data,
rep movsb  seems to use similar loop-counter concept to avoid branch
prediction miss, adaptively prefech next loop data if current loop
data is not in L1 cache.
However it need to cost long time to warm up, so when data is less
2048, we choose avx instruction according to our experiment, and when
data is over L3 cache, it doesn't give us better result than
non-temporary instruction.

Thanks
Ling


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]