x86: further optimization opportunities

Jan Beulich jbeulich@suse.com
Fri Aug 26 12:12:24 GMT 2022


H.J.,

over time I've accumulated a list of possible transformations we could
do in addition to what we do already. Some are a little exotic, so may
not be worth it. Hence I'd like to ask for your view on things, if you
don't mind.

1) {,X}OR r<N>,0 and AND/TEST r<N>,~0  -->  TEST r<N>,r<N>

Except for 32-bit forms in 64-bit mode. Note that ADD/CMP/SUB can't
be replaced this way, because TEST leaves AF undefined. But perhaps
IMUL r<N>,1 can be, unless we feared people depending on a particular
implementation's setting of PF, SF, and ZF.

2) AND r<N>,0 and perhaps IMUL r<N>,r<M>,0  -->  XOR r<N>,r<N>

3) {,V}PCMPEQQ  -->  e.g. {,V}PCMPEQD 
   {,V}PCMPGTQ  -->  {,V}PXOR.

when both source operands match, for being a 1 byte shorter encoding.
Some of the respective AVX512 forms can be transformed into KX{,N}OR*.

4) P{AND{,N},{,X}OR} and {AND{,N},{,X}OR}PD  -->  {AND{,N},{,X}OR}PS
   MOVDQ{A,U} and MOV{A,U}PD  -->  MOV{A,U}PS

for saving the prefix byte. Perhaps only when -Os.

5) PSHUFD  --> SHUFPS

with identical register operands, and again perhaps only when -Os.

6) VPCMP{,U}{B,W,D,Q} and VPCOM{,U}{B,W,D,Q}  -->  VPCMP{EQ,GT}{B,W,D,Q}

where suitable, saving the immediate byte and in the latter case
also possibly allowing for the shorter VEX2 encoding.

7) VPSUB{,U}S{B,W,D,Q}  -->  VPXOR
   VPCMPGT{B,W,D,Q} (pre-AVX512)  -->  VPXOR

when both source operands are identical.

8) VFMADD{P,S}{S,D} et al  -->  VFMADD{132,231,213}{P,S}{S,D}

when one operand is suitably repeated. (This requires CpuFMA to be
explicitly enabled, as that's not a prereq to CpuFMA4.)

9) MOVZX

with 64-bit destination to drop the REX64 prefix.

10) RET/RETF/LRET

with immediate of zero to immediate-less form.

11) 32-bit TEST

with {8..15}-bit immediate in 16-bit mode.

12) MOVABS

displacement optimization with -Os, using 32-bit addressing mode as
applicable.

13) BT{,C,R,S}

with in-range immediate to operand-size-prefix-less forms. For memory
operands only by reducing nominal operand size (for register operands
going from 16- to 32-bit operand size is okay) and with an adjustment
to the displacement as necessary (perhaps leaving alone ones with LOCK
prefix).

14) BT{,C,R,S}

with memory operand and out-of-range immediate, transforming the upper
immediate bits into an adjustment to the displacement. Accompanied by
a warning, as the upper bits would no longer end up being ignored. The
SDM in fact suggests this as a model assemblers might follow.

Note that examples of 4 and 5 can actually be found in Linux'es crypto
code.


More information about the Binutils mailing list