This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Compile AVX libm functions with -mavx
On Tue, Oct 02, 2012 at 10:41:35PM -0700, Matt Turner wrote:
> On Tue, Oct 2, 2012 at 4:45 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> > On Tue, Oct 2, 2012 at 4:07 PM, Matt Turner <mattst88@gmail.com> wrote:
> >> On Tue, Oct 2, 2012 at 1:19 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >>> On Tue, Oct 2, 2012 at 12:47 PM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> >>>>
> >>>> could it be a 60 cycle penalty when switching between legagy sse and avx
> >>>> state?
> >>>
> >>> This true. We can use -mprefer-avx128 to make sure that only 128bit AVX
> >>> instructions are used.
> >>>
> >>> --
> >>> H.J.
> >>
> >> The latency for switching between old SSE and new (AVX-style
> >
> > Latency comes from switching between the 128-bit SSE context and
> > the 256-bit AVX context. If we only use the lower 128-bit AVX context,
> > there is no latency.
>
> I'm having a hard time confirming that.
>
> >From pages 53/54 of the pdf -- http://software.intel.com/file/36945 :
>
> > However, there is a performance impact with intermixing VEX-encoded SIMD
> > instructions (AVX, FMA) and legacy SSE instructions that only operate on
> > the XMM register state.
>
> And more to the point:
>
> > Intermixed 256-bit, 128-bit or scalar SIMD instructions that are encoded
> > with VEX prefixes have no transition delay due to internal state management.
>
> >> 3-operand) form is what causes the penalty. What is the purpose of
> >> -mprefer-avx128? I can't find a description of it online.
> >
> > I just fixed it:
> >
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54785
> >
> > -mprefer-avx128 will avoid 256-bit AVX instructions. Only 128-bit
> > AVX instructions are generated. It has the same effect on context
> > switch as -msse2avx.
>
> I think that your claim is that legacy 128-bit SSE + 256-bit AVX
> produces stalls, but I believe the documentation to say that it's
> VEX-prefixed instructions in general (256-bit or otherwise) plus
> legacy SSE instructions that lead to stalls.
For intel detailed description is in
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
chapter 11-3
They mention alternative to add vzeroupper at end of each avx function.
--
Your modem doesn't speak English.