This is the mail archive of the
binutils@sourceware.org
mailing list for the binutils project.
Re: [PATCH 3/5] i386: Align branches within a fixed boundary
On Fri, Nov 15, 2019 at 1:16 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 15.11.2019 00:21, H.J. Lu wrote:
> > On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 12.11.2019 17:19, H.J. Lu wrote:
> >>> @@ -632,6 +652,31 @@ static enum check_kind
> >>> }
> >>> sse_check, operand_check = check_warning;
> >>>
> >>> +/* Non-zero if branches should be aligned within power of 2 boundary. */
> >>> +static int align_branch_power = 0;
> >>> +
> >>> +/* Types of branches to align. */
> >>> +enum align_branch_kind
> >>> + {
> >>> + align_branch_none = 0,
> >>> + align_branch_jcc = 1 << 0,
> >>> + align_branch_fused = 1 << 1,
> >>> + align_branch_jmp = 1 << 2,
> >>> + align_branch_call = 1 << 3,
> >>> + align_branch_indirect = 1 << 4,
> >>> + align_branch_ret = 1 << 5
> >>> + };
> >>> +
> >>> +static unsigned int align_branch = (align_branch_jcc
> >>> + | align_branch_fused
> >>> + | align_branch_jmp);
> >>> +
> >>> +/* The maximum padding size for fused jcc. */
> >>> +#define MAX_FUSED_JCC_PADDING_SIZE 20
> >>
> >> What is this number derived from?
> >
> > CMP can be 9 bytes and jcc can be 6 bytes. I leave some room
> > for prefixes.
>
> Can you extend the comment to this effect please?
I did:
/* The maximum padding size for fused jcc. CMP can be 9 bytes and jcc
can be 6 bytes. Leave room just in case for prefixes. */
#define MAX_FUSED_JCC_PADDING_SIZE 20
> >>> @@ -3012,6 +3070,11 @@ md_begin (void)
> >>> x86_dwarf2_return_column = 8;
> >>> x86_cie_data_alignment = -4;
> >>> }
> >>> +
> >>> + /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it
> >>> + can be turned into BRANCH_PREFIX frag. */
> >>> + if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)
> >>> + abort ();
> >>
> >> Wouldn't such a value better be rejected at command line option
> >> handling time?
> >
> > There is
> >
> > /* The maximum number of prefixes added for an instruction. */
> > static unsigned int align_branch_prefix_size = 5;
> >
> > I'd like to check it even if it isn't changed.
>
> Ok, okay.
>
> >> Like elsewhere I think you want to use i.tm checks here. This
> >> would also eliminate my worry of the code not catching cases
> >> where optimization changes the opcode in i.tm without also
> >> changing the name.
> >
> > It don't think CMP instructions can be replaced by totally different
> > instructions.
>
> But there's also e.g. AND in the set, which I would expect
> optimization to be able to switch _to_ (not from, as you were
> implying).
What other instructions can be encoded as AND?
> >>> +static int
> >>> +add_fused_jcc_padding_frag_p (void)
> >>> +{
> >>> + if (!align_branch_power
> >>> + || now_seg == absolute_section
> >>> + || !cpu_arch_flags.bitfield.cpui386
> >>
> >> Why the i386 dependency?
> >
> > It doesn't work with COND_JUMP86 without i386.
>
> Why would it not work? People using the option may need to
> re-code some Jcc to keep the targets reachable despite the
> inserted padding, but that's not a reason to suppress the
> adjustments altogether.
This option isn't for processors without i386.
> >>> +static int
> >>> +add_branch_prefix_frag_p (void)
> >>> +{
> >>> + if (!align_branch_power
> >>> + || now_seg == absolute_section
> >>> + || i.tm.cpu_flags.bitfield.cpupadlock
> >>> + || !cpu_arch_flags.bitfield.cpui386)
> >>
> >> Why the PadLock and i386 dependencies?
> >
> > Since PadLock instructions include prefixes in opcode, I don't
> > want to mess with them.
>
> How are PadLock insns different in this regard from e.g. SSE insns?
There are
if (i.tm.base_opcode & 0xff000000)
{
prefix = (i.tm.base_opcode >> 24) & 0xff;
if (!i.tm.cpu_flags.bitfield.cpupadlock
|| prefix != REPE_PREFIX_OPCODE
|| (i.prefix[REP_PREFIX] != REPE_PREFIX_OPCODE))
add_prefix (prefix);
I don't want to change it for PadLock.
> >>> + else if (i.tm.base_opcode == 0xc2
> >>> + || i.tm.base_opcode == 0xc3
> >>> + || i.tm.base_opcode == 0xca
> >>> + || i.tm.base_opcode == 0xcb)
> >>
> >> These will also match various VEX/XOP/EVEX encoded templates. Perhaps
> >> exclude any such right away in the first if()?
> >>
> >> Also you handle near and far returns here, but ...
> >
> > I will change it to
> >
> > else if (is_any_vex_encoding (&i.tm))
> > return 0;
>
> Why not move this even earlier in the function?
I'd like to delay is_any_vex_encoding after
if (i.tm.opcode_modifier.jump)
> > else if (i.tm.base_opcode == 0xc2 || i.tm.base_opcode == 0xc3)
>
> Maybe (i.tm.base_opcode | 1) == 0xc3?
I put it in.
> >>> @@ -8279,6 +8523,30 @@ output_insn (void)
> >>> insn_start_frag = frag_now;
> >>> insn_start_off = frag_now_fix ();
> >>>
> >>> + if (add_branch_padding_frag_p (&branch))
> >>> + {
> >>> + char *p;
> >>> + unsigned int max_branch_padding_size = 14;
> >>
> >> Where is this number coming from? Like above (and I think also in
> >> one more case below) any such seemingly arbitrary numbers could do
> >> with a comment.
> >
> > Branch can be 8 bytes. I leave some room for prefixes.
>
> Again please explain this in a comment.
I did:
/* Branch can be 8 bytes. Leave some room for prefixes. */
unsigned int max_branch_padding_size = 14;
> >>> + /* Use existing segment prefix if possible. Use CS
> >>> + segment prefix in 64-bit mode. In 32-bit mode, use SS
> >>> + segment prefix with ESP/EBP base register and use DS
> >>> + segment prefix without ESP/EBP base register. */
> >>
> >> Is there a reason for the preference of CS: in 64-bit mode?
> >
> > CS is ignored in 64-bit mode.
>
> As are DS, ES, and SS. I.e. my question could also be put as "Why
We pick CS.
> different rules for 64- and 32-bit code?"
Since CS isn't ignored in 32-bit mode, we can't use CS. We pick the
default segment register in 32-bit mode.
> >>> +static void
> >>> +i386_classify_machine_dependent_frag (fragS *fragP)
> >>> +{
> >>> + fragS *cmp_fragP;
> >>> + fragS *pad_fragP;
> >>> + fragS *branch_fragP;
> >>> + fragS *next_fragP;
> >>> + unsigned int max_prefix_length;
> >>> +
> >>> + if (fragP->tc_frag_data.classified)
> >>> + return;
> >>> +
> >>> + /* First scan for BRANCH_PADDING and FUSED_JCC_PADDING. Convert
> >>> + FUSED_JCC_PADDING and merge BRANCH_PADDING. */
> >>> + for (next_fragP = fragP;
> >>> + next_fragP != NULL;
> >>> + next_fragP = next_fragP->fr_next)
> >>> + {
> >>> + next_fragP->tc_frag_data.classified = 1;
> >>> + if (next_fragP->fr_type == rs_machine_dependent)
> >>> + switch (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype))
> >>> + {
> >>> + case BRANCH_PADDING:
> >>> + /* The BRANCH_PADDING frag must be followed by a branch
> >>> + frag. */
> >>> + branch_fragP = i386_next_non_empty_frag (next_fragP);
> >>> + next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;
> >>> + break;
> >>> + case FUSED_JCC_PADDING:
> >>> + /* Check if this is a fused jcc:
> >>> + FUSED_JCC_PADDING
> >>> + CMP
> >>
> >> Here and in a number of other places I wonder why you say CMP when
> >> the set of macro-fusion candidate insns is quite a bit larger.
> >
> > I use CMP to refer all these instructions.
>
> May I ask that at least in comments you clarify it's not just CMP.
> This doesn't mean to re-enumerate all the insns, but at least say
> something like "et al".
I replaced CMP with CMP like instruction.
> >>> + if (!fits_in_signed_byte (padding_size))
> >>> + abort ();
> >>
> >> Why? The function's return type is int (should probably be
> >> unsigned int).
> >
> > It must fit tc_frag_data.length.
>
> And how would one recognize the connection when e.g. widening the
> field?
I did
/* The return value may be saved in tc_frag_data.length which is
unsigned byte. */
if (!fits_in_unsigned_byte (padding_size))
abort ();
> >>> @@ -11483,6 +12323,78 @@ md_parse_option (int c, const char *arg)
> >>> as_fatal (_("invalid -mrelax-relocations= option: `%s'"), arg);
> >>> break;
> >>>
> >>> + case OPTION_MALIGN_BRANCH_BOUNDARY:
> >>> + {
> >>> + char *end;
> >>> + int align = strtoul (arg, &end, 0);
> >>> + if (*end == '\0')
> >>> + {
> >>> + if (align == 0)
> >>> + {
> >>> + align_branch_power = 0;
> >>> + break;
> >>> + }
> >>> + else if (align >= 32)
> >>
> >> Why the enforcement of 32 as the lower boundary? I.e. why would 16
> >> not be an option as well?
> >
> > CMP can be 9 bytes and jcc can be 6 bytes. 15 bytes is too close to
> > 16 bytes.
>
> I don't understand: Making such a pair fit within a single aligned
> 16-byte block doesn't seem impossible at all.
It can be 16 bytes:
[hjl@gnu-cfl-1 jcc-1]$ cat f.s
cmpw -12000(%eax,%ebx,8), %r8w
je foo
[hjl@gnu-cfl-1 jcc-1]$ gcc -c f.s
[hjl@gnu-cfl-1 jcc-1]$ objdump -dw f.o
f.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <.text>:
0: 67 66 44 3b 84 d8 20 d1 ff ff cmp -0x2ee0(%eax,%ebx,8),%r8w
a: 0f 84 00 00 00 00 je 0x10
[hjl@gnu-cfl-1 jcc-1]$
I prefer not to lower it to 16 bytes unless there is a good reason.
> >>> + {
> >>> + int align_power;
> >>> + for (align_power = 0;
> >>> + (align & 1) == 0;
> >>> + align >>= 1, align_power++)
> >>> + continue;
> >>> + if (align == 1)
> >>> + {
> >>> + align_branch_power = align_power;
> >>> + break;
> >>> + }
> >>> + }
> >>> + }
> >>> + as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);
> >>> + }
> >>> + break;
> >>> +
> >>> + case OPTION_MALIGN_BRANCH_PREFIX_SIZE:
> >>> + {
> >>> + char *end;
> >>> + int align = strtoul (arg, &end, 0);
> >>> + if (*end == '\0' && align >= 0 && align < 6)
> >>
> >> Similarly here - why is 5 the upper boundary on the accepted values?
> >
> > I am told that some processors only support 5 prefixes.
>
> Now that very certainly requires a comment to explain. Furthermore -
> what would happen on such processors if there was a 6th one? Is this
> merely a performance concern, or worse?
I don't know exactly will will happen. I just cap it to 5.
> >>> + unsigned int classified : 1;
> >>> + unsigned int branch_type : 7;
> >>
> >> This doesn't require 7 bits, does it?
> >>
> >
> > I will change it to
> >
> > unsigned int branch_type : 6;
>
> Isn't 6 still too high? I realize the enumerators stored into here
> aren't truly enumerators, but add_branch_padding_frag_p() only ever
> stores one of the values (i.e. they don't get OR-ed together).
>
I changed it to 3 bits with
/* Types of branches to align. */
enum align_branch_kind
{
align_branch_none = 0,
align_branch_jcc = 1,
align_branch_fused = 2,
align_branch_jmp = 3,
align_branch_call = 4,
align_branch_indirect = 5,
align_branch_ret = 6
};
/* Type bits of branches to align. */
enum align_branch_bit
{
align_branch_jcc_bit = 1 << align_branch_jcc,
align_branch_fused_bit = 1 << align_branch_fused,
align_branch_jmp_bit = 1 << align_branch_jmp,
align_branch_call_bit = 1 << align_branch_call,
align_branch_indirect_bit = 1 << align_branch_indirect,
align_branch_ret_bit = 1 << align_branch_ret
};
Thanks.
--
H.J.