This is the mail archive of the
binutils@sourceware.org
mailing list for the binutils project.
Re: V3 [PATCH 2/4] i386: Align branches within a fixed boundary
On Mon, Dec 9, 2019 at 6:54 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 06.12.2019 20:10, H.J. Lu wrote:
> > Add 3 command-line options to align branches within a fixed boundary
> > with segment prefixes or NOPs:
> >
> > 1. -malign-branch-boundary=NUM aligns branches within NUM byte boundary.
> > 2. -malign-branch=TYPE[+TYPE...] specifies types of branches to align.
> > The supported branches are:
> > a. Conditional jump.
> > b. Fused conditional jump.
> > c. Unconditional jump.
> > d. Call.
> > e. Ret.
> > f. Indirect jump and call.
> > 3. -malign-branch-prefix-size=NUM aligns branches with NUM segment
> > prefixes per instruction.
> >
> > 3 new rs_machine_dependent frag types are added:
> >
> > 1. BRANCH_PADDING. The variable size frag to insert NOP before branch.
> > 2. BRANCH_PREFIX. The variable size frag to insert segment prefixes to
> > an instruction. The choices of prefixes are:
> > a. Use the existing segment prefix if there is one.
> > b. Use CS segment prefix in 64-bit mode.
> > c. In 32-bit mode, use SS segment prefix with ESP/EBP base register
> > and use DS segment prefix without ESP/EBP base register.
> > 3. FUSED_JCC_PADDING. The variable size frag to insert NOP before fused
> > conditional jump.
> >
> > The new rs_machine_dependent frags aren't inserted if the previous item
> > is a prefix or a constant directive, which may be used to hardcode an
> > instruction, since there is no clear instruction boundary. Segment
> > prefixes and NOP padding are disabled before relaxable TLS relocations
> > and tls_get_addr calls to keep TLS instruction sequence unchanged.
> >
> > md_estimate_size_before_relax() and i386_generic_table_relax_frag() are
> > used to handled BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING frags.
> > i386_generic_table_relax_frag() grows or shrinks sizes of segment prefix
> > and NOP to align the next branch frag:
> >
> > 1. First try to add segment prefixes to instructions before a branch.
> > 2. If there is no sufficient room to add segment prefixes, NOP will be
> > inserted before a branch.
> >
> > * config/tc-i386.c (_i386_insn): Add has_gotpc_tls_reloc.
> > (tls_get_addr): New.
> > (last_insn): New.
> > (align_branch_power): New.
> > (align_branch_kind): New.
> > (align_branch_bit): New.
> > (align_branch): New.
> > (MAX_FUSED_JCC_PADDING_SIZE): New.
> > (align_branch_prefix_size): New.
> > (BRANCH_PADDING): New.
> > (BRANCH_PREFIX): New.
> > (FUSED_JCC_PADDING): New.
> > (i386_generate_nops): Support BRANCH_PADDING and FUSED_JCC_PADDING.
> > (md_begin): Abort if align_branch_prefix_size <
> > MAX_FUSED_JCC_PADDING_SIZE.
> > (md_assemble): Set last_insn.
> > (maybe_fused_with_jcc_p): New.
> > (add_fused_jcc_padding_frag_p): New.
> > (add_branch_prefix_frag_p): New.
> > (add_branch_padding_frag_p): New.
> > (output_insn): Generate a BRANCH_PADDING, FUSED_JCC_PADDING or
> > BRANCH_PREFIX frag and terminate each frag to align branches.
> > (output_disp): Set i.has_gotpc_tls_reloc to TRUE for GOTPC and
> > relaxable TLS relocations.
> > (output_imm): Likewise.
> > (i386_next_non_empty_frag): New.
> > (i386_next_jcc_frag): New.
> > (i386_classify_machine_dependent_frag): New.
> > (i386_branch_padding_size): New.
> > (i386_generic_table_relax_frag): New.
> > (md_estimate_size_before_relax): Handle COND_JUMP_PADDING,
> > FUSED_JCC_PADDING and COND_JUMP_PREFIX frags.
> > (md_convert_frag): Handle BRANCH_PADDING, BRANCH_PREFIX and
> > FUSED_JCC_PADDING frags.
> > (OPTION_MALIGN_BRANCH_BOUNDARY): New.
> > (OPTION_MALIGN_BRANCH_PREFIX_SIZE): New.
> > (OPTION_MALIGN_BRANCH): New.
> > (md_longopts): Add -malign-branch-boundary=,
> > -malign-branch-prefix-size= and -malign-branch=.
> > (md_parse_option): Handle -malign-branch-boundary=,
> > -malign-branch-prefix-size= and -malign-branch=.
> > (md_show_usage): Display -malign-branch-boundary=,
> > -malign-branch-prefix-size= and -malign-branch=.
> > (i386_target_format): Set tls_get_addr.
> > (i386_cons_align): New.
> > * config/tc-i386.h (i386_cons_align): New.
> > (md_cons_align): New.
> > (i386_generic_table_relax_frag): New.
> > (md_generic_table_relax_frag): New.
> > (i386_tc_frag_data): Add u, padding_address, length,
> > max_prefix_length, prefix_length, default_prefix, cmp_size,
> > classified and branch_type.
> > (TC_FRAG_INIT): Initialize u, padding_address, length,
> > max_prefix_length, prefix_length, default_prefix, cmp_size,
> > classified and branch_type.
> > * doc/c-i386.texi: Document -malign-branch-boundary=,
> > -malign-branch= and -malign-branch-prefix-size=.
> > ---
> > gas/config/tc-i386.c | 1046 +++++++++++++++++++++++++++++++++++++++++-
> > gas/config/tc-i386.h | 31 ++
> > gas/doc/c-i386.texi | 26 ++
> > 3 files changed, 1100 insertions(+), 3 deletions(-)
> >
> > diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
> > index b62af34268..0ab6651f24 100644
> > --- a/gas/config/tc-i386.c
> > +++ b/gas/config/tc-i386.c
> > @@ -368,6 +368,9 @@ struct _i386_insn
> > /* Has ZMM register operands. */
> > bfd_boolean has_regzmm;
> >
> > + /* Has GOTPC or TLS relocation. */
> > + bfd_boolean has_gotpc_tls_reloc;
> > +
> > /* RM and SIB are the modrm byte and the sib byte where the
> > addressing modes of this insn are encoded. */
> > modrm_byte rm;
> > @@ -562,6 +565,8 @@ static enum flag_code flag_code;
> > static unsigned int object_64bit;
> > static unsigned int disallow_64bit_reloc;
> > static int use_rela_relocations = 0;
> > +/* __tls_get_addr/___tls_get_addr symbol for TLS. */
> > +static const char *tls_get_addr;
> >
> > #if ((defined (OBJ_MAYBE_COFF) && defined (OBJ_MAYBE_AOUT)) \
> > || defined (OBJ_ELF) || defined (OBJ_MAYBE_ELF) \
> > @@ -622,6 +627,21 @@ static int omit_lock_prefix = 0;
> > "lock addl $0, (%{re}sp)". */
> > static int avoid_fence = 0;
> >
> > +/* Type of the previous instruction. */
> > +static struct
> > + {
> > + segT seg;
> > + const char *file;
> > + const char *name;
> > + unsigned int line;
> > + enum last_insn_kind
> > + {
> > + last_insn_other = 0,
> > + last_insn_directive,
> > + last_insn_prefix
> > + } kind;
> > + } last_insn;
> > +
> > /* 1 if the assembler should generate relax relocations. */
> >
> > static int generate_relax_relocations
> > @@ -635,6 +655,44 @@ static enum check_kind
> > }
> > sse_check, operand_check = check_warning;
> >
> > +/* Non-zero if branches should be aligned within power of 2 boundary. */
> > +static int align_branch_power = 0;
> > +
> > +/* Types of branches to align. */
> > +enum align_branch_kind
> > + {
> > + align_branch_none = 0,
> > + align_branch_jcc = 1,
> > + align_branch_fused = 2,
> > + align_branch_jmp = 3,
> > + align_branch_call = 4,
> > + align_branch_indirect = 5,
> > + align_branch_ret = 6
> > + };
> > +
> > +/* Type bits of branches to align. */
> > +enum align_branch_bit
> > + {
> > + align_branch_jcc_bit = 1 << align_branch_jcc,
> > + align_branch_fused_bit = 1 << align_branch_fused,
> > + align_branch_jmp_bit = 1 << align_branch_jmp,
> > + align_branch_call_bit = 1 << align_branch_call,
> > + align_branch_indirect_bit = 1 << align_branch_indirect,
> > + align_branch_ret_bit = 1 << align_branch_ret
> > + };
> > +
> > +static unsigned int align_branch = (align_branch_jcc_bit
> > + | align_branch_fused_bit
> > + | align_branch_jmp_bit);
> > +
> > +/* The maximum padding size for fused jcc. CMP like instruction can
> > + be 9 bytes and jcc can be 6 bytes. Leave room just in case for
> > + prefixes. */
> > +#define MAX_FUSED_JCC_PADDING_SIZE 20
> > +
> > +/* The maximum number of prefixes added for an instruction. */
> > +static unsigned int align_branch_prefix_size = 5;
> > +
> > /* Optimization:
> > 1. Clear the REX_W bit with register operand if possible.
> > 2. Above plus use 128bit vector instruction to clear the full vector
> > @@ -738,12 +796,19 @@ int x86_cie_data_alignment;
> > /* Interface to relax_segment.
> > There are 3 major relax states for 386 jump insns because the
> > different types of jumps add different sizes to frags when we're
> > - figuring out what sort of jump to choose to reach a given label. */
> > + figuring out what sort of jump to choose to reach a given label.
> > +
> > + BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING are used to align
> > + branches which are handled by md_estimate_size_before_relax() and
> > + i386_generic_table_relax_frag(). */
> >
> > /* Types. */
> > #define UNCOND_JUMP 0
> > #define COND_JUMP 1
> > #define COND_JUMP86 2
> > +#define BRANCH_PADDING 3
> > +#define BRANCH_PREFIX 4
> > +#define FUSED_JCC_PADDING 5
> >
> > /* Sizes. */
> > #define CODE16 1
> > @@ -1384,6 +1449,12 @@ i386_generate_nops (fragS *fragP, char *where, offsetT count, int limit)
> > case rs_fill_nop:
> > case rs_align_code:
> > break;
> > + case rs_machine_dependent:
> > + /* Allow NOP padding for jumps and calls. */
> > + if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PADDING
> > + || TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == FUSED_JCC_PADDING)
> > + break;
> > + /* Fall through. */
> > default:
> > return;
> > }
> > @@ -1528,7 +1599,7 @@ i386_generate_nops (fragS *fragP, char *where, offsetT count, int limit)
> > return;
> > }
> > }
> > - else
> > + else if (fragP->fr_type != rs_machine_dependent)
> > fragP->fr_var = count;
> >
> > if ((count / max_single_nop_size) > max_number_of_nops)
> > @@ -3011,6 +3082,11 @@ md_begin (void)
> > x86_dwarf2_return_column = 8;
> > x86_cie_data_alignment = -4;
> > }
> > +
> > + /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it
> > + can be turned into BRANCH_PREFIX frag. */
> > + if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)
> > + abort ();
> > }
> >
> > void
> > @@ -4536,6 +4612,17 @@ md_assemble (char *line)
> >
> > /* We are ready to output the insn. */
> > output_insn ();
> > +
> > + last_insn.seg = now_seg;
> > +
> > + if (i.tm.opcode_modifier.isprefix)
> > + {
> > + last_insn.kind = last_insn_prefix;
> > + last_insn.name = i.tm.name;
> > + last_insn.file = as_where (&last_insn.line);
> > + }
> > + else
> > + last_insn.kind = last_insn_other;
> > }
> >
> > static char *
> > @@ -8193,11 +8280,206 @@ encoding_length (const fragS *start_frag, offsetT start_off,
> > return len - start_off + (frag_now_ptr - frag_now->fr_literal);
> > }
> >
> > +/* Return 1 for test, and, cmp, add, sub, inc and dec which may
> > + be macro-fused with conditional jumps. */
> > +
> > +static int
> > +maybe_fused_with_jcc_p (void)
> > +{
> > + /* No RIP address. */
> > + if (i.base_reg && i.base_reg->reg_num == RegIP)
> > + return 0;
> > +
> > + /* No VEX/EVEX encoding. */
> > + if (is_any_vex_encoding (&i.tm))
> > + return 0;
> > +
> > + /* and, add, sub with destination register. */
> > + if ((i.tm.base_opcode >= 0x20 && i.tm.base_opcode <= 0x25)
> > + || i.tm.base_opcode <= 5
> > + || (i.tm.base_opcode >= 0x28 && i.tm.base_opcode <= 0x2d)
> > + || ((i.tm.base_opcode | 3) == 0x83
> > + && ((i.tm.extension_opcode | 1) == 0x5
>
> One more minor suggestion to reduce the number of branches to
> result from this code: Just like the "| 1" here, couldn't you
>
> if (((i.tm.base_opcode | 8) >= 0x28 && (i.tm.base_opcode | 8) <= 0x2d)
We are working on a followup patch to separate and from add/sub.
> > + || i.tm.extension_opcode == 0x0)))
> > + return (i.types[1].bitfield.class == Reg
> > + || i.types[1].bitfield.instance == Accum);
> > +
> > + /* test, cmp with any register. */
> > + if ((i.tm.base_opcode | 1) == 0x85
> > + || (i.tm.base_opcode | 1) == 0xa9
> > + || ((i.tm.base_opcode | 1) == 0xf7
> > + && i.tm.extension_opcode == 0)
> > + || (i.tm.base_opcode >= 0x38 && i.tm.base_opcode <= 0x3d)
> > + || ((i.tm.base_opcode | 3) == 0x83
> > + && (i.tm.extension_opcode == 0x7)))
> > + return (i.types[0].bitfield.class == Reg
> > + || i.types[0].bitfield.instance == Accum
> > + || i.types[1].bitfield.class == Reg
> > + || i.types[1].bitfield.instance == Accum);
> > +
> > + /* inc, dec with any register. */
> > + if ((i.tm.cpu_flags.bitfield.cpuno64
> > + && (i.tm.base_opcode | 0xf) == 0x4f)
> > + || ((i.tm.base_opcode | 1) == 0xff
> > + && (i.tm.extension_opcode | 1) == 0x1))
>
> Maybe
>
> && i.tm.extension_opcode <= 0x1)
>
> ?
Will change it.
> > +/* Return 1 if a BRANCH_PREFIX frag should be generated. */
> > +
> > +static int
> > +add_branch_prefix_frag_p (void)
> > +{
> > + /* NB: Don't work with COND_JUMP86 without i386. Don't add prefix
> > + to PadLock instructions since they include prefixes in opcode. */
> > + if (!align_branch_power
> > + || !align_branch_prefix_size
> > + || now_seg == absolute_section
> > + || i.tm.cpu_flags.bitfield.cpupadlock
>
> Didn't you confirm you'd take care of here of other insns than just
> the PadLock ones including prefixes in their base_opcode? (I still
I added the check for i.tm.opcode_length. PadLock is the only
exception.
> don't see what's wrong in this case, as you only mean to add
> segment overrides, but that's a separate aspect.)
>
> > @@ -8473,9 +8815,105 @@ output_insn (void)
> > if (j > 15)
> > as_warn (_("instruction length of %u bytes exceeds the limit of 15"),
> > j);
> > + else if (fragP)
> > + {
> > + /* NB: Don't add prefix with GOTPC relocation since
> > + output_disp() above depends on the fixed encoding
> > + length. Can't add prefix with TLS relocation since
> > + it breaks TLS linker optimization. */
> > + unsigned int max = i.has_gotpc_tls_reloc ? 0 : 15 - j;
> > + /* Prefix count on the current instruction. */
> > + unsigned int count = i.vex.length;
> > + unsigned int k;
> > + for (k = 0; k < ARRAY_SIZE (i.prefix); k++)
> > + /* REX byte is encoded in VEX/EVEX prefix. */
> > + if (i.prefix[k] && (k != REX_PREFIX || !i.vex.length))
> > + count++;
> > +
> > + /* Count SSE prefix. */
> > + if (!i.vex.length)
> > + switch (i.tm.opcode_length)
> > + {
> > + case 3:
> > + if (((i.tm.base_opcode >> 16) & 0xff) == 0xf)
> > + {
> > + count++;
> > + switch ((i.tm.base_opcode >> 8) & 0xff)
> > + {
> > + case 0x38:
> > + case 0x3a:
> > + count++;
> > + break;
> > + default:
> > + break;
> > + }
> > + }
> > + break;
> > + case 2:
> > + if (((i.tm.base_opcode >> 8) & 0xff) == 0xf)
> > + count++;
>
> In particular (but not only) for this case the "SSE" in the comment
> looks sufficiently misleading. How about "extended opcode maps"?
>
Will make the change.
--
H.J.