FW: [PATCH 1/2] [PATCH 1/2] Enable Intel AVX512_FP16 instructions

Tue Jul 20 11:13:51 GMT 2021


> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, July 20, 2021 4:46 PM
> To: Cui, Lili <lili.cui@intel.com>
> Cc: hjl.tools@gmail.com; binutils@sourceware.org
> Subject: Re: FW: [PATCH 1/2] [PATCH 1/2] Enable Intel AVX512_FP16
> instructions
> 
> On 20.07.2021 09:08, Cui, Lili wrote:
> >
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Wednesday, July 14, 2021 11:21 PM
> >> To: Cui, Lili <lili.cui@intel.com>
> >> Cc: hjl.tools@gmail.com; binutils@sourceware.org
> >> Subject: Re: [PATCH 1/2] [PATCH 1/2] Enable Intel AVX512_FP16
> >> instructions
> >>
> >> On 13.07.2021 08:58, Cui, Lili wrote:
> >>
> >> Disassembler:
> >>
> >> d_scalar_mode looks to be unused.
> >>
> >> This
> >>
> >>   /* EVEX_W_MAP5_2A_P_1 */
> >>   {
> >>     { "vcvtsi2sh{%LQ|}",	{ XMScalar, VexScalar, EXxEVexR, Ed }, 0 },
> >>     { "vcvtsi2sh{%LQ|}",	{ XMScalar, VexScalar, EXxEVexR, Eq }, 0 },
> >>   },
> >>
> >> can imo be expressed without decoding EVEX.W, by using Edq instead of
> >> (separately) Ed and Eq. There's at least one similar case elsewhere.
> >> Interestingly in the 2si/2usi conversions you do use Gdq already,
> >> which I think handles the EVEX.W=1 case correctly outside of 64-bit
> >> mode (unlike Eq, which will unconditionally produce 64-bit register names
> afaict).
> >>
> >> As to a broader question on decoding EVEX.W: Did you consider
> >> introducing e.g. %XH (paralleling %XW, just that EVEX.W=1 is not a
> >> valid encoding), to avoid this decode step for perhaps almost all
> >> entries? And if that's not an option, decoding EVEX.W first for all
> >> the opcodes which previously had no meaning at all would, in some
> >> cases, reduce the overall number of table entries (and in all other
> >> cases this would then merely be for consistency, as it also wouldn't
> increase the number of table entries). To give an example:
> >>
> >>     { PREFIX_TABLE (PREFIX_EVEX_0F3AC2) },
> >>
> >> =>
> >>
> >>   /* PREFIX_EVEX_0F3AC2 */
> >>   {
> >>     { VEX_W_TABLE (EVEX_W_0F3AC2_P_0) },
> >>     { VEX_W_TABLE (EVEX_W_0F3AC2_P_1) },
> >>   },
> >>
> >> =>
> >>
> >>   /* EVEX_W_0F3AC2_P_0 */
> >>   {
> >>     { "vcmpph",	{ XMask, Vex, EXxh, EXxEVexS, Ib }, 0 },
> >>   },
> >>   /* EVEX_W_0F3AC2_P_1 */
> >>   {
> >>     { "vcmpsh",	{ XMask, VexScalar, EXxmm_mw, EXxEVexS, Ib }, 0 },
> >>   },
> >>
> >> i.e. a total of 1 + 4 + 2 * 2 entries. Whereas decoding W first would
> >> yield 1
> >> (evex) + 2 (evex_w) + 4 (prefix) entries.
> >
> > Hi Jan,
> >
> > Do you want me to change it like this?
> >      { PREFIX_TABLE (PREFIX_EVEX_0F3AC2) },
> >
> >  =>
> >
> >    /* PREFIX_EVEX_0F3AC2 */
> >    {
> >      { "vcmp%XH",	{ XMask, Vex, EXxh, EXxEVexS, Ib }, 0 },
> >      { "vcmp%XH",	{ XMask, VexScalar, EXxmm_mw, EXxEVexS, Ib }, 0 },
> >    },
> >
> > "XH" => print 'ph', 'sh' depending on the EVEX.ll bit, if EVEX.W==W1 report
> bad code.
> > if  (EVEX.LL== EVEX.LLIG)
> >       print 'sh'
> > else
> >       print 'ph'
> 
> Not exactly, no. %XH was meant to parallel %XW, which prints 's' or 'd'
> depending on VEX.W. %XH would print 'h' if EVEX.W is clear and produce an
> appropriate indication of the encoding being bad if EVEX.W is set.
> IOW something like
> 
>    /* PREFIX_EVEX_0F3AC2 */
>    {
>      { "vcmpp%XH",	{ XMask, Vex, EXxh, EXxEVexS, Ib }, 0 },
>      { "vcmps%XH",	{ XMask, VexScalar, EXxmm_mw, EXxEVexS, Ib }, 0 },
>    },
> 
> >> The delta is even larger for something like MAP5_7D: 1 + 4 + 4 * 2
> >> vs. 1 + 2 + 4. This also results in more related entries ending up
> >> closer to one another.
> >>
> > I don't quite understand here,  should I let all FP16 disassembler go
> through W_TABLE fist? or just add something like %XH instead of going
> through W_TABLE? Thanks.
> 
> Where beneficial you will want to decode EVEX.W first, yes. Unless, as per
> above, you can avoid that decoding step altogether by using %XH.
> 
Okay, It is clear to me,  many thanks!

Lili