This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: V3 [PATCH 2/4] i386: Align branches within a fixed boundary


On Mon, Dec 9, 2019 at 6:54 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 06.12.2019 20:10,  H.J. Lu  wrote:
> > Add 3 command-line options to align branches within a fixed boundary
> > with segment prefixes or NOPs:
> >
> > 1. -malign-branch-boundary=NUM aligns branches within NUM byte boundary.
> > 2. -malign-branch=TYPE[+TYPE...] specifies types of branches to align.
> > The supported branches are:
> >   a. Conditional jump.
> >   b. Fused conditional jump.
> >   c. Unconditional jump.
> >   d. Call.
> >   e. Ret.
> >   f. Indirect jump and call.
> > 3. -malign-branch-prefix-size=NUM aligns branches with NUM segment
> > prefixes per instruction.
> >
> > 3 new rs_machine_dependent frag types are added:
> >
> > 1. BRANCH_PADDING.  The variable size frag to insert NOP before branch.
> > 2. BRANCH_PREFIX.  The variable size frag to insert segment prefixes to
> > an instruction.  The choices of prefixes are:
> >    a. Use the existing segment prefix if there is one.
> >    b. Use CS segment prefix in 64-bit mode.
> >    c. In 32-bit mode, use SS segment prefix with ESP/EBP base register
> >    and use DS segment prefix without ESP/EBP base register.
> > 3. FUSED_JCC_PADDING.  The variable size frag to insert NOP before fused
> > conditional jump.
> >
> > The new rs_machine_dependent frags aren't inserted if the previous item
> > is a prefix or a constant directive, which may be used to hardcode an
> > instruction, since there is no clear instruction boundary.  Segment
> > prefixes and NOP padding are disabled before relaxable TLS relocations
> > and tls_get_addr calls to keep TLS instruction sequence unchanged.
> >
> > md_estimate_size_before_relax() and i386_generic_table_relax_frag() are
> > used to handled BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING frags.
> > i386_generic_table_relax_frag() grows or shrinks sizes of segment prefix
> > and NOP to align the next branch frag:
> >
> > 1. First try to add segment prefixes to instructions before a branch.
> > 2. If there is no sufficient room to add segment prefixes, NOP will be
> > inserted before a branch.
> >
> >       * config/tc-i386.c (_i386_insn): Add has_gotpc_tls_reloc.
> >       (tls_get_addr): New.
> >       (last_insn): New.
> >       (align_branch_power): New.
> >       (align_branch_kind): New.
> >       (align_branch_bit): New.
> >       (align_branch): New.
> >       (MAX_FUSED_JCC_PADDING_SIZE): New.
> >       (align_branch_prefix_size): New.
> >       (BRANCH_PADDING): New.
> >       (BRANCH_PREFIX): New.
> >       (FUSED_JCC_PADDING): New.
> >       (i386_generate_nops): Support BRANCH_PADDING and FUSED_JCC_PADDING.
> >       (md_begin): Abort if align_branch_prefix_size <
> >       MAX_FUSED_JCC_PADDING_SIZE.
> >       (md_assemble): Set last_insn.
> >       (maybe_fused_with_jcc_p): New.
> >       (add_fused_jcc_padding_frag_p): New.
> >       (add_branch_prefix_frag_p): New.
> >       (add_branch_padding_frag_p): New.
> >       (output_insn): Generate a BRANCH_PADDING, FUSED_JCC_PADDING or
> >       BRANCH_PREFIX frag and terminate each frag to align branches.
> >       (output_disp): Set i.has_gotpc_tls_reloc to TRUE for GOTPC and
> >       relaxable TLS relocations.
> >       (output_imm): Likewise.
> >       (i386_next_non_empty_frag): New.
> >       (i386_next_jcc_frag): New.
> >       (i386_classify_machine_dependent_frag): New.
> >       (i386_branch_padding_size): New.
> >       (i386_generic_table_relax_frag): New.
> >       (md_estimate_size_before_relax): Handle COND_JUMP_PADDING,
> >       FUSED_JCC_PADDING and COND_JUMP_PREFIX frags.
> >       (md_convert_frag): Handle BRANCH_PADDING, BRANCH_PREFIX and
> >       FUSED_JCC_PADDING frags.
> >       (OPTION_MALIGN_BRANCH_BOUNDARY): New.
> >       (OPTION_MALIGN_BRANCH_PREFIX_SIZE): New.
> >       (OPTION_MALIGN_BRANCH): New.
> >       (md_longopts): Add -malign-branch-boundary=,
> >       -malign-branch-prefix-size= and -malign-branch=.
> >       (md_parse_option): Handle -malign-branch-boundary=,
> >       -malign-branch-prefix-size= and -malign-branch=.
> >       (md_show_usage): Display -malign-branch-boundary=,
> >       -malign-branch-prefix-size= and -malign-branch=.
> >       (i386_target_format): Set tls_get_addr.
> >       (i386_cons_align): New.
> >       * config/tc-i386.h (i386_cons_align): New.
> >       (md_cons_align): New.
> >       (i386_generic_table_relax_frag): New.
> >       (md_generic_table_relax_frag): New.
> >       (i386_tc_frag_data): Add u, padding_address, length,
> >       max_prefix_length, prefix_length, default_prefix, cmp_size,
> >       classified and branch_type.
> >       (TC_FRAG_INIT): Initialize u, padding_address, length,
> >       max_prefix_length, prefix_length, default_prefix, cmp_size,
> >       classified and branch_type.
> >       * doc/c-i386.texi: Document -malign-branch-boundary=,
> >       -malign-branch= and -malign-branch-prefix-size=.
> > ---
> >  gas/config/tc-i386.c | 1046 +++++++++++++++++++++++++++++++++++++++++-
> >  gas/config/tc-i386.h |   31 ++
> >  gas/doc/c-i386.texi  |   26 ++
> >  3 files changed, 1100 insertions(+), 3 deletions(-)
> >
> > diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
> > index b62af34268..0ab6651f24 100644
> > --- a/gas/config/tc-i386.c
> > +++ b/gas/config/tc-i386.c
> > @@ -368,6 +368,9 @@ struct _i386_insn
> >      /* Has ZMM register operands.  */
> >      bfd_boolean has_regzmm;
> >
> > +    /* Has GOTPC or TLS relocation.  */
> > +    bfd_boolean has_gotpc_tls_reloc;
> > +
> >      /* RM and SIB are the modrm byte and the sib byte where the
> >         addressing modes of this insn are encoded.  */
> >      modrm_byte rm;
> > @@ -562,6 +565,8 @@ static enum flag_code flag_code;
> >  static unsigned int object_64bit;
> >  static unsigned int disallow_64bit_reloc;
> >  static int use_rela_relocations = 0;
> > +/* __tls_get_addr/___tls_get_addr symbol for TLS.  */
> > +static const char *tls_get_addr;
> >
> >  #if ((defined (OBJ_MAYBE_COFF) && defined (OBJ_MAYBE_AOUT)) \
> >       || defined (OBJ_ELF) || defined (OBJ_MAYBE_ELF) \
> > @@ -622,6 +627,21 @@ static int omit_lock_prefix = 0;
> >     "lock addl $0, (%{re}sp)".  */
> >  static int avoid_fence = 0;
> >
> > +/* Type of the previous instruction.  */
> > +static struct
> > +  {
> > +    segT seg;
> > +    const char *file;
> > +    const char *name;
> > +    unsigned int line;
> > +    enum last_insn_kind
> > +      {
> > +     last_insn_other = 0,
> > +     last_insn_directive,
> > +     last_insn_prefix
> > +      } kind;
> > +  } last_insn;
> > +
> >  /* 1 if the assembler should generate relax relocations.  */
> >
> >  static int generate_relax_relocations
> > @@ -635,6 +655,44 @@ static enum check_kind
> >    }
> >  sse_check, operand_check = check_warning;
> >
> > +/* Non-zero if branches should be aligned within power of 2 boundary.  */
> > +static int align_branch_power = 0;
> > +
> > +/* Types of branches to align.  */
> > +enum align_branch_kind
> > +  {
> > +    align_branch_none = 0,
> > +    align_branch_jcc = 1,
> > +    align_branch_fused = 2,
> > +    align_branch_jmp = 3,
> > +    align_branch_call = 4,
> > +    align_branch_indirect = 5,
> > +    align_branch_ret = 6
> > +  };
> > +
> > +/* Type bits of branches to align.  */
> > +enum align_branch_bit
> > +  {
> > +    align_branch_jcc_bit = 1 << align_branch_jcc,
> > +    align_branch_fused_bit = 1 << align_branch_fused,
> > +    align_branch_jmp_bit = 1 << align_branch_jmp,
> > +    align_branch_call_bit = 1 << align_branch_call,
> > +    align_branch_indirect_bit = 1 << align_branch_indirect,
> > +    align_branch_ret_bit = 1 << align_branch_ret
> > +  };
> > +
> > +static unsigned int align_branch = (align_branch_jcc_bit
> > +                                 | align_branch_fused_bit
> > +                                 | align_branch_jmp_bit);
> > +
> > +/* The maximum padding size for fused jcc.  CMP like instruction can
> > +   be 9 bytes and jcc can be 6 bytes.  Leave room just in case for
> > +   prefixes.   */
> > +#define MAX_FUSED_JCC_PADDING_SIZE 20
> > +
> > +/* The maximum number of prefixes added for an instruction.  */
> > +static unsigned int align_branch_prefix_size = 5;
> > +
> >  /* Optimization:
> >     1. Clear the REX_W bit with register operand if possible.
> >     2. Above plus use 128bit vector instruction to clear the full vector
> > @@ -738,12 +796,19 @@ int x86_cie_data_alignment;
> >  /* Interface to relax_segment.
> >     There are 3 major relax states for 386 jump insns because the
> >     different types of jumps add different sizes to frags when we're
> > -   figuring out what sort of jump to choose to reach a given label.  */
> > +   figuring out what sort of jump to choose to reach a given label.
> > +
> > +   BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING are used to align
> > +   branches which are handled by md_estimate_size_before_relax() and
> > +   i386_generic_table_relax_frag().  */
> >
> >  /* Types.  */
> >  #define UNCOND_JUMP 0
> >  #define COND_JUMP 1
> >  #define COND_JUMP86 2
> > +#define BRANCH_PADDING 3
> > +#define BRANCH_PREFIX 4
> > +#define FUSED_JCC_PADDING 5
> >
> >  /* Sizes.  */
> >  #define CODE16       1
> > @@ -1384,6 +1449,12 @@ i386_generate_nops (fragS *fragP, char *where, offsetT count, int limit)
> >      case rs_fill_nop:
> >      case rs_align_code:
> >        break;
> > +    case rs_machine_dependent:
> > +      /* Allow NOP padding for jumps and calls.  */
> > +      if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PADDING
> > +       || TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == FUSED_JCC_PADDING)
> > +     break;
> > +      /* Fall through.  */
> >      default:
> >        return;
> >      }
> > @@ -1528,7 +1599,7 @@ i386_generate_nops (fragS *fragP, char *where, offsetT count, int limit)
> >         return;
> >       }
> >      }
> > -  else
> > +  else if (fragP->fr_type != rs_machine_dependent)
> >      fragP->fr_var = count;
> >
> >    if ((count / max_single_nop_size) > max_number_of_nops)
> > @@ -3011,6 +3082,11 @@ md_begin (void)
> >        x86_dwarf2_return_column = 8;
> >        x86_cie_data_alignment = -4;
> >      }
> > +
> > +  /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it
> > +     can be turned into BRANCH_PREFIX frag.  */
> > +  if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)
> > +    abort ();
> >  }
> >
> >  void
> > @@ -4536,6 +4612,17 @@ md_assemble (char *line)
> >
> >    /* We are ready to output the insn.  */
> >    output_insn ();
> > +
> > +  last_insn.seg = now_seg;
> > +
> > +  if (i.tm.opcode_modifier.isprefix)
> > +    {
> > +      last_insn.kind = last_insn_prefix;
> > +      last_insn.name = i.tm.name;
> > +      last_insn.file = as_where (&last_insn.line);
> > +    }
> > +  else
> > +    last_insn.kind = last_insn_other;
> >  }
> >
> >  static char *
> > @@ -8193,11 +8280,206 @@ encoding_length (const fragS *start_frag, offsetT start_off,
> >    return len - start_off + (frag_now_ptr - frag_now->fr_literal);
> >  }
> >
> > +/* Return 1 for test, and, cmp, add, sub, inc and dec which may
> > +   be macro-fused with conditional jumps.  */
> > +
> > +static int
> > +maybe_fused_with_jcc_p (void)
> > +{
> > +  /* No RIP address.  */
> > +  if (i.base_reg && i.base_reg->reg_num == RegIP)
> > +    return 0;
> > +
> > +  /* No VEX/EVEX encoding.  */
> > +  if (is_any_vex_encoding (&i.tm))
> > +    return 0;
> > +
> > +  /* and, add, sub with destination register.  */
> > +  if ((i.tm.base_opcode >= 0x20 && i.tm.base_opcode <= 0x25)
> > +      || i.tm.base_opcode <= 5
> > +      || (i.tm.base_opcode >= 0x28 && i.tm.base_opcode <= 0x2d)
> > +      || ((i.tm.base_opcode | 3) == 0x83
> > +       && ((i.tm.extension_opcode | 1) == 0x5
>
> One more minor suggestion to reduce the number of branches to
> result from this code: Just like the "| 1" here, couldn't you
>
>   if (((i.tm.base_opcode | 8) >= 0x28 && (i.tm.base_opcode | 8) <= 0x2d)

We are working on a followup patch to separate and from add/sub.

> > +           || i.tm.extension_opcode == 0x0)))
> > +    return (i.types[1].bitfield.class == Reg
> > +         || i.types[1].bitfield.instance == Accum);
> > +
> > +  /* test, cmp with any register.  */
> > +  if ((i.tm.base_opcode | 1) == 0x85
> > +      || (i.tm.base_opcode | 1) == 0xa9
> > +      || ((i.tm.base_opcode | 1) == 0xf7
> > +       && i.tm.extension_opcode == 0)
> > +      || (i.tm.base_opcode >= 0x38 && i.tm.base_opcode <= 0x3d)
> > +      || ((i.tm.base_opcode | 3) == 0x83
> > +       && (i.tm.extension_opcode == 0x7)))
> > +    return (i.types[0].bitfield.class == Reg
> > +         || i.types[0].bitfield.instance == Accum
> > +         || i.types[1].bitfield.class == Reg
> > +         || i.types[1].bitfield.instance == Accum);
> > +
> > +  /* inc, dec with any register.   */
> > +  if ((i.tm.cpu_flags.bitfield.cpuno64
> > +       && (i.tm.base_opcode | 0xf) == 0x4f)
> > +      || ((i.tm.base_opcode | 1) == 0xff
> > +       && (i.tm.extension_opcode | 1) == 0x1))
>
> Maybe
>
>           && i.tm.extension_opcode <= 0x1)
>
> ?

Will change it.

> > +/* Return 1 if a BRANCH_PREFIX frag should be generated.  */
> > +
> > +static int
> > +add_branch_prefix_frag_p (void)
> > +{
> > +  /* NB: Don't work with COND_JUMP86 without i386.  Don't add prefix
> > +     to PadLock instructions since they include prefixes in opcode.  */
> > +  if (!align_branch_power
> > +      || !align_branch_prefix_size
> > +      || now_seg == absolute_section
> > +      || i.tm.cpu_flags.bitfield.cpupadlock
>
> Didn't you confirm you'd take care of here of other insns than just
> the PadLock ones including prefixes in their base_opcode? (I still

I added the check for i.tm.opcode_length.   PadLock is the only
exception.

> don't see what's wrong in this case, as you only mean to add
> segment overrides, but that's a separate aspect.)
>
> > @@ -8473,9 +8815,105 @@ output_insn (void)
> >         if (j > 15)
> >           as_warn (_("instruction length of %u bytes exceeds the limit of 15"),
> >                    j);
> > +       else if (fragP)
> > +         {
> > +           /* NB: Don't add prefix with GOTPC relocation since
> > +              output_disp() above depends on the fixed encoding
> > +              length.  Can't add prefix with TLS relocation since
> > +              it breaks TLS linker optimization.  */
> > +           unsigned int max = i.has_gotpc_tls_reloc ? 0 : 15 - j;
> > +           /* Prefix count on the current instruction.  */
> > +           unsigned int count = i.vex.length;
> > +           unsigned int k;
> > +           for (k = 0; k < ARRAY_SIZE (i.prefix); k++)
> > +             /* REX byte is encoded in VEX/EVEX prefix.  */
> > +             if (i.prefix[k] && (k != REX_PREFIX || !i.vex.length))
> > +               count++;
> > +
> > +           /* Count SSE prefix.  */
> > +           if (!i.vex.length)
> > +             switch (i.tm.opcode_length)
> > +               {
> > +               case 3:
> > +                 if (((i.tm.base_opcode >> 16) & 0xff) == 0xf)
> > +                   {
> > +                     count++;
> > +                     switch ((i.tm.base_opcode >> 8) & 0xff)
> > +                       {
> > +                       case 0x38:
> > +                       case 0x3a:
> > +                         count++;
> > +                         break;
> > +                       default:
> > +                         break;
> > +                       }
> > +                   }
> > +                 break;
> > +               case 2:
> > +                 if (((i.tm.base_opcode >> 8) & 0xff) == 0xf)
> > +                   count++;
>
> In particular (but not only) for this case the "SSE" in the comment
> looks sufficiently misleading. How about "extended opcode maps"?
>

Will make the change.



--
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]