This is the mail archive of the
binutils@sourceware.org
mailing list for the binutils project.
Re: V3 [PATCH 2/4] i386: Align branches within a fixed boundary
On 06.12.2019 20:10, H.J. Lu wrote:
> Add 3 command-line options to align branches within a fixed boundary
> with segment prefixes or NOPs:
>
> 1. -malign-branch-boundary=NUM aligns branches within NUM byte boundary.
> 2. -malign-branch=TYPE[+TYPE...] specifies types of branches to align.
> The supported branches are:
> a. Conditional jump.
> b. Fused conditional jump.
> c. Unconditional jump.
> d. Call.
> e. Ret.
> f. Indirect jump and call.
> 3. -malign-branch-prefix-size=NUM aligns branches with NUM segment
> prefixes per instruction.
>
> 3 new rs_machine_dependent frag types are added:
>
> 1. BRANCH_PADDING. The variable size frag to insert NOP before branch.
> 2. BRANCH_PREFIX. The variable size frag to insert segment prefixes to
> an instruction. The choices of prefixes are:
> a. Use the existing segment prefix if there is one.
> b. Use CS segment prefix in 64-bit mode.
> c. In 32-bit mode, use SS segment prefix with ESP/EBP base register
> and use DS segment prefix without ESP/EBP base register.
> 3. FUSED_JCC_PADDING. The variable size frag to insert NOP before fused
> conditional jump.
>
> The new rs_machine_dependent frags aren't inserted if the previous item
> is a prefix or a constant directive, which may be used to hardcode an
> instruction, since there is no clear instruction boundary. Segment
> prefixes and NOP padding are disabled before relaxable TLS relocations
> and tls_get_addr calls to keep TLS instruction sequence unchanged.
>
> md_estimate_size_before_relax() and i386_generic_table_relax_frag() are
> used to handled BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING frags.
> i386_generic_table_relax_frag() grows or shrinks sizes of segment prefix
> and NOP to align the next branch frag:
>
> 1. First try to add segment prefixes to instructions before a branch.
> 2. If there is no sufficient room to add segment prefixes, NOP will be
> inserted before a branch.
>
> * config/tc-i386.c (_i386_insn): Add has_gotpc_tls_reloc.
> (tls_get_addr): New.
> (last_insn): New.
> (align_branch_power): New.
> (align_branch_kind): New.
> (align_branch_bit): New.
> (align_branch): New.
> (MAX_FUSED_JCC_PADDING_SIZE): New.
> (align_branch_prefix_size): New.
> (BRANCH_PADDING): New.
> (BRANCH_PREFIX): New.
> (FUSED_JCC_PADDING): New.
> (i386_generate_nops): Support BRANCH_PADDING and FUSED_JCC_PADDING.
> (md_begin): Abort if align_branch_prefix_size <
> MAX_FUSED_JCC_PADDING_SIZE.
> (md_assemble): Set last_insn.
> (maybe_fused_with_jcc_p): New.
> (add_fused_jcc_padding_frag_p): New.
> (add_branch_prefix_frag_p): New.
> (add_branch_padding_frag_p): New.
> (output_insn): Generate a BRANCH_PADDING, FUSED_JCC_PADDING or
> BRANCH_PREFIX frag and terminate each frag to align branches.
> (output_disp): Set i.has_gotpc_tls_reloc to TRUE for GOTPC and
> relaxable TLS relocations.
> (output_imm): Likewise.
> (i386_next_non_empty_frag): New.
> (i386_next_jcc_frag): New.
> (i386_classify_machine_dependent_frag): New.
> (i386_branch_padding_size): New.
> (i386_generic_table_relax_frag): New.
> (md_estimate_size_before_relax): Handle COND_JUMP_PADDING,
> FUSED_JCC_PADDING and COND_JUMP_PREFIX frags.
> (md_convert_frag): Handle BRANCH_PADDING, BRANCH_PREFIX and
> FUSED_JCC_PADDING frags.
> (OPTION_MALIGN_BRANCH_BOUNDARY): New.
> (OPTION_MALIGN_BRANCH_PREFIX_SIZE): New.
> (OPTION_MALIGN_BRANCH): New.
> (md_longopts): Add -malign-branch-boundary=,
> -malign-branch-prefix-size= and -malign-branch=.
> (md_parse_option): Handle -malign-branch-boundary=,
> -malign-branch-prefix-size= and -malign-branch=.
> (md_show_usage): Display -malign-branch-boundary=,
> -malign-branch-prefix-size= and -malign-branch=.
> (i386_target_format): Set tls_get_addr.
> (i386_cons_align): New.
> * config/tc-i386.h (i386_cons_align): New.
> (md_cons_align): New.
> (i386_generic_table_relax_frag): New.
> (md_generic_table_relax_frag): New.
> (i386_tc_frag_data): Add u, padding_address, length,
> max_prefix_length, prefix_length, default_prefix, cmp_size,
> classified and branch_type.
> (TC_FRAG_INIT): Initialize u, padding_address, length,
> max_prefix_length, prefix_length, default_prefix, cmp_size,
> classified and branch_type.
> * doc/c-i386.texi: Document -malign-branch-boundary=,
> -malign-branch= and -malign-branch-prefix-size=.
> ---
> gas/config/tc-i386.c | 1046 +++++++++++++++++++++++++++++++++++++++++-
> gas/config/tc-i386.h | 31 ++
> gas/doc/c-i386.texi | 26 ++
> 3 files changed, 1100 insertions(+), 3 deletions(-)
>
> diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
> index b62af34268..0ab6651f24 100644
> --- a/gas/config/tc-i386.c
> +++ b/gas/config/tc-i386.c
> @@ -368,6 +368,9 @@ struct _i386_insn
> /* Has ZMM register operands. */
> bfd_boolean has_regzmm;
>
> + /* Has GOTPC or TLS relocation. */
> + bfd_boolean has_gotpc_tls_reloc;
> +
> /* RM and SIB are the modrm byte and the sib byte where the
> addressing modes of this insn are encoded. */
> modrm_byte rm;
> @@ -562,6 +565,8 @@ static enum flag_code flag_code;
> static unsigned int object_64bit;
> static unsigned int disallow_64bit_reloc;
> static int use_rela_relocations = 0;
> +/* __tls_get_addr/___tls_get_addr symbol for TLS. */
> +static const char *tls_get_addr;
>
> #if ((defined (OBJ_MAYBE_COFF) && defined (OBJ_MAYBE_AOUT)) \
> || defined (OBJ_ELF) || defined (OBJ_MAYBE_ELF) \
> @@ -622,6 +627,21 @@ static int omit_lock_prefix = 0;
> "lock addl $0, (%{re}sp)". */
> static int avoid_fence = 0;
>
> +/* Type of the previous instruction. */
> +static struct
> + {
> + segT seg;
> + const char *file;
> + const char *name;
> + unsigned int line;
> + enum last_insn_kind
> + {
> + last_insn_other = 0,
> + last_insn_directive,
> + last_insn_prefix
> + } kind;
> + } last_insn;
> +
> /* 1 if the assembler should generate relax relocations. */
>
> static int generate_relax_relocations
> @@ -635,6 +655,44 @@ static enum check_kind
> }
> sse_check, operand_check = check_warning;
>
> +/* Non-zero if branches should be aligned within power of 2 boundary. */
> +static int align_branch_power = 0;
> +
> +/* Types of branches to align. */
> +enum align_branch_kind
> + {
> + align_branch_none = 0,
> + align_branch_jcc = 1,
> + align_branch_fused = 2,
> + align_branch_jmp = 3,
> + align_branch_call = 4,
> + align_branch_indirect = 5,
> + align_branch_ret = 6
> + };
> +
> +/* Type bits of branches to align. */
> +enum align_branch_bit
> + {
> + align_branch_jcc_bit = 1 << align_branch_jcc,
> + align_branch_fused_bit = 1 << align_branch_fused,
> + align_branch_jmp_bit = 1 << align_branch_jmp,
> + align_branch_call_bit = 1 << align_branch_call,
> + align_branch_indirect_bit = 1 << align_branch_indirect,
> + align_branch_ret_bit = 1 << align_branch_ret
> + };
> +
> +static unsigned int align_branch = (align_branch_jcc_bit
> + | align_branch_fused_bit
> + | align_branch_jmp_bit);
> +
> +/* The maximum padding size for fused jcc. CMP like instruction can
> + be 9 bytes and jcc can be 6 bytes. Leave room just in case for
> + prefixes. */
> +#define MAX_FUSED_JCC_PADDING_SIZE 20
> +
> +/* The maximum number of prefixes added for an instruction. */
> +static unsigned int align_branch_prefix_size = 5;
> +
> /* Optimization:
> 1. Clear the REX_W bit with register operand if possible.
> 2. Above plus use 128bit vector instruction to clear the full vector
> @@ -738,12 +796,19 @@ int x86_cie_data_alignment;
> /* Interface to relax_segment.
> There are 3 major relax states for 386 jump insns because the
> different types of jumps add different sizes to frags when we're
> - figuring out what sort of jump to choose to reach a given label. */
> + figuring out what sort of jump to choose to reach a given label.
> +
> + BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING are used to align
> + branches which are handled by md_estimate_size_before_relax() and
> + i386_generic_table_relax_frag(). */
>
> /* Types. */
> #define UNCOND_JUMP 0
> #define COND_JUMP 1
> #define COND_JUMP86 2
> +#define BRANCH_PADDING 3
> +#define BRANCH_PREFIX 4
> +#define FUSED_JCC_PADDING 5
>
> /* Sizes. */
> #define CODE16 1
> @@ -1384,6 +1449,12 @@ i386_generate_nops (fragS *fragP, char *where, offsetT count, int limit)
> case rs_fill_nop:
> case rs_align_code:
> break;
> + case rs_machine_dependent:
> + /* Allow NOP padding for jumps and calls. */
> + if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PADDING
> + || TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == FUSED_JCC_PADDING)
> + break;
> + /* Fall through. */
> default:
> return;
> }
> @@ -1528,7 +1599,7 @@ i386_generate_nops (fragS *fragP, char *where, offsetT count, int limit)
> return;
> }
> }
> - else
> + else if (fragP->fr_type != rs_machine_dependent)
> fragP->fr_var = count;
>
> if ((count / max_single_nop_size) > max_number_of_nops)
> @@ -3011,6 +3082,11 @@ md_begin (void)
> x86_dwarf2_return_column = 8;
> x86_cie_data_alignment = -4;
> }
> +
> + /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it
> + can be turned into BRANCH_PREFIX frag. */
> + if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)
> + abort ();
> }
>
> void
> @@ -4536,6 +4612,17 @@ md_assemble (char *line)
>
> /* We are ready to output the insn. */
> output_insn ();
> +
> + last_insn.seg = now_seg;
> +
> + if (i.tm.opcode_modifier.isprefix)
> + {
> + last_insn.kind = last_insn_prefix;
> + last_insn.name = i.tm.name;
> + last_insn.file = as_where (&last_insn.line);
> + }
> + else
> + last_insn.kind = last_insn_other;
> }
>
> static char *
> @@ -8193,11 +8280,206 @@ encoding_length (const fragS *start_frag, offsetT start_off,
> return len - start_off + (frag_now_ptr - frag_now->fr_literal);
> }
>
> +/* Return 1 for test, and, cmp, add, sub, inc and dec which may
> + be macro-fused with conditional jumps. */
> +
> +static int
> +maybe_fused_with_jcc_p (void)
> +{
> + /* No RIP address. */
> + if (i.base_reg && i.base_reg->reg_num == RegIP)
> + return 0;
> +
> + /* No VEX/EVEX encoding. */
> + if (is_any_vex_encoding (&i.tm))
> + return 0;
> +
> + /* and, add, sub with destination register. */
> + if ((i.tm.base_opcode >= 0x20 && i.tm.base_opcode <= 0x25)
> + || i.tm.base_opcode <= 5
> + || (i.tm.base_opcode >= 0x28 && i.tm.base_opcode <= 0x2d)
> + || ((i.tm.base_opcode | 3) == 0x83
> + && ((i.tm.extension_opcode | 1) == 0x5
One more minor suggestion to reduce the number of branches to
result from this code: Just like the "| 1" here, couldn't you
if (((i.tm.base_opcode | 8) >= 0x28 && (i.tm.base_opcode | 8) <= 0x2d)
> + || i.tm.extension_opcode == 0x0)))
> + return (i.types[1].bitfield.class == Reg
> + || i.types[1].bitfield.instance == Accum);
> +
> + /* test, cmp with any register. */
> + if ((i.tm.base_opcode | 1) == 0x85
> + || (i.tm.base_opcode | 1) == 0xa9
> + || ((i.tm.base_opcode | 1) == 0xf7
> + && i.tm.extension_opcode == 0)
> + || (i.tm.base_opcode >= 0x38 && i.tm.base_opcode <= 0x3d)
> + || ((i.tm.base_opcode | 3) == 0x83
> + && (i.tm.extension_opcode == 0x7)))
> + return (i.types[0].bitfield.class == Reg
> + || i.types[0].bitfield.instance == Accum
> + || i.types[1].bitfield.class == Reg
> + || i.types[1].bitfield.instance == Accum);
> +
> + /* inc, dec with any register. */
> + if ((i.tm.cpu_flags.bitfield.cpuno64
> + && (i.tm.base_opcode | 0xf) == 0x4f)
> + || ((i.tm.base_opcode | 1) == 0xff
> + && (i.tm.extension_opcode | 1) == 0x1))
Maybe
&& i.tm.extension_opcode <= 0x1)
?
> +/* Return 1 if a BRANCH_PREFIX frag should be generated. */
> +
> +static int
> +add_branch_prefix_frag_p (void)
> +{
> + /* NB: Don't work with COND_JUMP86 without i386. Don't add prefix
> + to PadLock instructions since they include prefixes in opcode. */
> + if (!align_branch_power
> + || !align_branch_prefix_size
> + || now_seg == absolute_section
> + || i.tm.cpu_flags.bitfield.cpupadlock
Didn't you confirm you'd take care of here of other insns than just
the PadLock ones including prefixes in their base_opcode? (I still
don't see what's wrong in this case, as you only mean to add
segment overrides, but that's a separate aspect.)
> @@ -8473,9 +8815,105 @@ output_insn (void)
> if (j > 15)
> as_warn (_("instruction length of %u bytes exceeds the limit of 15"),
> j);
> + else if (fragP)
> + {
> + /* NB: Don't add prefix with GOTPC relocation since
> + output_disp() above depends on the fixed encoding
> + length. Can't add prefix with TLS relocation since
> + it breaks TLS linker optimization. */
> + unsigned int max = i.has_gotpc_tls_reloc ? 0 : 15 - j;
> + /* Prefix count on the current instruction. */
> + unsigned int count = i.vex.length;
> + unsigned int k;
> + for (k = 0; k < ARRAY_SIZE (i.prefix); k++)
> + /* REX byte is encoded in VEX/EVEX prefix. */
> + if (i.prefix[k] && (k != REX_PREFIX || !i.vex.length))
> + count++;
> +
> + /* Count SSE prefix. */
> + if (!i.vex.length)
> + switch (i.tm.opcode_length)
> + {
> + case 3:
> + if (((i.tm.base_opcode >> 16) & 0xff) == 0xf)
> + {
> + count++;
> + switch ((i.tm.base_opcode >> 8) & 0xff)
> + {
> + case 0x38:
> + case 0x3a:
> + count++;
> + break;
> + default:
> + break;
> + }
> + }
> + break;
> + case 2:
> + if (((i.tm.base_opcode >> 8) & 0xff) == 0xf)
> + count++;
In particular (but not only) for this case the "SSE" in the comment
looks sufficiently misleading. How about "extended opcode maps"?
Jan