This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Ling Ma <ling dot ma dot program at gmail dot com>
- Cc: "H.J. Lu" <hjl dot tools at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>, yumkam at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
- Date: Fri, 11 Jul 2014 11:54:04 +0200
- Subject: Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction
- Authentication-results: sourceware.org; auth=none
- References: <CAOGi=dOJX3saKoa5YiDdveOqAb_=Sev4cBKyh7_gkXBU8_4=+g at mail dot gmail dot com> <CAMe9rOpEhNffr5iZUZLFp4QyBAE-Xrxna8-BQFv=tZXEXdSLSg at mail dot gmail dot com> <CAOGi=dNk7H2+aWh=+3_qwVH9LvWN-eNKcLciW=0J7x1dVL9v+g at mail dot gmail dot com> <CAOGi=dMsSdQi8SuXi2pzCbMm6bCrwJru0rAjtg=cn24CLgOgRg at mail dot gmail dot com> <CAMe9rOqZpj4BE7kXABOAueaD-o1PgRjL_R48KeDcJBDSmHXPdg at mail dot gmail dot com> <20140625163416 dot GA14763 at domone dot podge> <CAOGi=dMn+zr3u_1YJvmxOO0NF9BTGKeCJNV0nkDTBd7x2dx4eg at mail dot gmail dot com> <CAOGi=dPjVosbXjX9k2kB_o_dsDpHk6DAZXMjyqEVXS3g-dpejA at mail dot gmail dot com> <20140710133648 dot GA18783 at domone dot podge> <CAOGi=dM6iALQNc40=2rpojj49is9e5ms44phtVcrpcwBPAhNbQ at mail dot gmail dot com>
On Fri, Jul 11, 2014 at 09:20:58AM +0800, Ling Ma wrote:
> Yes, so I refined the code and sent the latest version according to
> your comments.
>
> Now new memmove code as below as gzipped attachement :
>
> +#ifdef USE_AS_MEMMOVE
> +L(gobble_mem_fwd_llc_start):
> +#endif
> + mov %rdx, %rcx
> + mov %rdx, %rcx
> + rep movsb
> + ret
> +
> + .p2align 4
> +L(gobble_big_data_fwd):
> +#ifdef USE_AS_MEMMOVE
> + mov %rsi, %r10
> + sub %rdi, %r10
> + cmp %rcx, %r10
> + jb L(gobble_mem_fwd_llc_start)
>
> Ling: if the code go here, rdx > rcx, but if the distance between rsi
> and rdi is smaller than rcx, the dst and src are must overlap, because
> the distance is located in LLC,
> that means src can help dst to get LLC hit. So we jump back, instead
> of using non-temporary store mode.
>
And do you have application where this actually happen? You lose
on performance every time this does not happen and given how rare are
large inputs I doubt it this will pay for itself.