[PATCH 3/5] i386: Align branches within a fixed boundary

Fri Nov 15 23:12:00 GMT 2019

On Fri, Nov 15, 2019 at 1:16 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 15.11.2019 00:21,  H.J. Lu  wrote:
> > On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 12.11.2019 17:19,  H.J. Lu  wrote:
> >>> @@ -632,6 +652,31 @@ static enum check_kind
> >>>    }
> >>>  sse_check, operand_check = check_warning;
> >>>
> >>> +/* Non-zero if branches should be aligned within power of 2 boundary.  */
> >>> +static int align_branch_power = 0;
> >>> +
> >>> +/* Types of branches to align.  */
> >>> +enum align_branch_kind
> >>> +  {
> >>> +    align_branch_none = 0,
> >>> +    align_branch_jcc = 1 << 0,
> >>> +    align_branch_fused = 1 << 1,
> >>> +    align_branch_jmp = 1 << 2,
> >>> +    align_branch_call = 1 << 3,
> >>> +    align_branch_indirect = 1 << 4,
> >>> +    align_branch_ret = 1 << 5
> >>> +  };
> >>> +
> >>> +static unsigned int align_branch = (align_branch_jcc
> >>> +                                 | align_branch_fused
> >>> +                                 | align_branch_jmp);
> >>> +
> >>> +/* The maximum padding size for fused jcc.  */
> >>> +#define MAX_FUSED_JCC_PADDING_SIZE 20
> >>
> >> What is this number derived from?
> >
> > CMP can be 9 bytes and jcc can be 6 bytes.   I leave some room
> > for prefixes.
>
> Can you extend the comment to this effect please?

I did:

/* The maximum padding size for fused jcc.  CMP can be 9 bytes and jcc
   can be 6 bytes.  Leave room just in case for prefixes.   */
#define MAX_FUSED_JCC_PADDING_SIZE 20

> >>> @@ -3012,6 +3070,11 @@ md_begin (void)
> >>>        x86_dwarf2_return_column = 8;
> >>>        x86_cie_data_alignment = -4;
> >>>      }
> >>> +
> >>> +  /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it
> >>> +     can be turned into BRANCH_PREFIX frag.  */
> >>> +  if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)
> >>> +    abort ();
> >>
> >> Wouldn't such a value better be rejected at command line option
> >> handling time?
> >
> > There is
> >
> > /* The maximum number of prefixes added for an instruction.  */
> > static unsigned int align_branch_prefix_size = 5;
> >
> > I'd like to check it even if it isn't changed.
>
> Ok, okay.
>
> >> Like elsewhere I think you want to use i.tm checks here. This
> >> would also eliminate my worry of the code not catching cases
> >> where optimization changes the opcode in i.tm without also
> >> changing the name.
> >
> > It don't think CMP instructions can be replaced by totally different
> > instructions.
>
> But there's also e.g. AND in the set, which I would expect
> optimization to be able to switch _to_ (not from, as you were
> implying).

What other instructions can be encoded as AND?

> >>> +static int
> >>> +add_fused_jcc_padding_frag_p (void)
> >>> +{
> >>> +  if (!align_branch_power
> >>> +      || now_seg == absolute_section
> >>> +      || !cpu_arch_flags.bitfield.cpui386
> >>
> >> Why the i386 dependency?
> >
> > It doesn't work with COND_JUMP86 without i386.
>
> Why would it not work? People using the option may need to
> re-code some Jcc to keep the targets reachable despite the
> inserted padding, but that's not a reason to suppress the
> adjustments altogether.

This option isn't for processors without i386.

> >>> +static int
> >>> +add_branch_prefix_frag_p (void)
> >>> +{
> >>> +  if (!align_branch_power
> >>> +      || now_seg == absolute_section
> >>> +      || i.tm.cpu_flags.bitfield.cpupadlock
> >>> +      || !cpu_arch_flags.bitfield.cpui386)
> >>
> >> Why the PadLock and i386 dependencies?
> >
> > Since PadLock instructions include prefixes in opcode, I don't
> > want to mess with them.
>
> How are PadLock insns different in this regard from e.g. SSE insns?

There are

             if (i.tm.base_opcode & 0xff000000)
                {
                  prefix = (i.tm.base_opcode >> 24) & 0xff;
                  if (!i.tm.cpu_flags.bitfield.cpupadlock
                      || prefix != REPE_PREFIX_OPCODE
                      || (i.prefix[REP_PREFIX] != REPE_PREFIX_OPCODE))
                    add_prefix (prefix);

I don't want to change it for PadLock.

> >>> +  else if (i.tm.base_opcode == 0xc2
> >>> +        || i.tm.base_opcode == 0xc3
> >>> +        || i.tm.base_opcode == 0xca
> >>> +        || i.tm.base_opcode == 0xcb)
> >>
> >> These will also match various VEX/XOP/EVEX encoded templates. Perhaps
> >> exclude any such right away in the first if()?
> >>
> >> Also you handle near and far returns here, but ...
> >
> > I will change it to
> >
> >   else if (is_any_vex_encoding (&i.tm))
> >     return 0;
>
> Why not move this even earlier in the function?

I'd like to delay is_any_vex_encoding after

if (i.tm.opcode_modifier.jump)

> >   else if (i.tm.base_opcode == 0xc2 || i.tm.base_opcode == 0xc3)
>
> Maybe (i.tm.base_opcode | 1) == 0xc3?

I put it in.

> >>> @@ -8279,6 +8523,30 @@ output_insn (void)
> >>>    insn_start_frag = frag_now;
> >>>    insn_start_off = frag_now_fix ();
> >>>
> >>> +  if (add_branch_padding_frag_p (&branch))
> >>> +    {
> >>> +      char *p;
> >>> +      unsigned int max_branch_padding_size = 14;
> >>
> >> Where is this number coming from? Like above (and I think also in
> >> one more case below) any such seemingly arbitrary numbers could do
> >> with a comment.
> >
> > Branch can be 8 bytes.  I leave some room for prefixes.
>
> Again please explain this in a comment.

I did:

     /* Branch can be 8 bytes.  Leave some room for prefixes.  */
      unsigned int max_branch_padding_size = 14;

> >>> +           /* Use existing segment prefix if possible.  Use CS
> >>> +              segment prefix in 64-bit mode.  In 32-bit mode, use SS
> >>> +              segment prefix with ESP/EBP base register and use DS
> >>> +              segment prefix without ESP/EBP base register.  */
> >>
> >> Is there a reason for the preference of CS: in 64-bit mode?
> >
> > CS is ignored in 64-bit mode.
>
> As are DS, ES, and SS. I.e. my question could also be put as "Why

We pick CS.

> different rules for 64- and 32-bit code?"

Since CS isn't ignored in 32-bit mode, we can't use CS.  We pick the
default segment register in 32-bit mode.

> >>> +static void
> >>> +i386_classify_machine_dependent_frag (fragS *fragP)
> >>> +{
> >>> +  fragS *cmp_fragP;
> >>> +  fragS *pad_fragP;
> >>> +  fragS *branch_fragP;
> >>> +  fragS *next_fragP;
> >>> +  unsigned int max_prefix_length;
> >>> +
> >>> +  if (fragP->tc_frag_data.classified)
> >>> +    return;
> >>> +
> >>> +  /* First scan for BRANCH_PADDING and FUSED_JCC_PADDING.  Convert
> >>> +     FUSED_JCC_PADDING and merge BRANCH_PADDING.  */
> >>> +  for (next_fragP = fragP;
> >>> +       next_fragP != NULL;
> >>> +       next_fragP = next_fragP->fr_next)
> >>> +    {
> >>> +      next_fragP->tc_frag_data.classified = 1;
> >>> +      if (next_fragP->fr_type == rs_machine_dependent)
> >>> +     switch (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype))
> >>> +       {
> >>> +       case BRANCH_PADDING:
> >>> +         /* The BRANCH_PADDING frag must be followed by a branch
> >>> +            frag.  */
> >>> +         branch_fragP = i386_next_non_empty_frag (next_fragP);
> >>> +         next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;
> >>> +         break;
> >>> +       case FUSED_JCC_PADDING:
> >>> +         /* Check if this is a fused jcc:
> >>> +            FUSED_JCC_PADDING
> >>> +            CMP
> >>
> >> Here and in a number of other places I wonder why you say CMP when
> >> the set of macro-fusion candidate insns is quite a bit larger.
> >
> > I use CMP to refer all these instructions.
>
> May I ask that at least in comments you clarify it's not just CMP.
> This doesn't mean to re-enumerate all the insns, but at least say
> something like "et al".

I replaced CMP with CMP like instruction.

> >>> +  if (!fits_in_signed_byte (padding_size))
> >>> +    abort ();
> >>
> >> Why? The function's return type is int (should probably be
> >> unsigned int).
> >
> > It must fit tc_frag_data.length.
>
> And how would one recognize the connection when e.g. widening the
> field?

I did

  /* The return value may be saved in tc_frag_data.length which is
     unsigned byte.  */
  if (!fits_in_unsigned_byte (padding_size))
    abort ();

> >>> @@ -11483,6 +12323,78 @@ md_parse_option (int c, const char *arg)
> >>>          as_fatal (_("invalid -mrelax-relocations= option: `%s'"), arg);
> >>>        break;
> >>>
> >>> +    case OPTION_MALIGN_BRANCH_BOUNDARY:
> >>> +      {
> >>> +     char *end;
> >>> +     int align = strtoul (arg, &end, 0);
> >>> +     if (*end == '\0')
> >>> +       {
> >>> +         if (align == 0)
> >>> +           {
> >>> +             align_branch_power = 0;
> >>> +             break;
> >>> +           }
> >>> +         else if (align >= 32)
> >>
> >> Why the enforcement of 32 as the lower boundary? I.e. why would 16
> >> not be an option as well?
> >
> > CMP can be 9 bytes and jcc can be 6 bytes.  15 bytes is too close to
> > 16 bytes.
>
> I don't understand: Making such a pair fit within a single aligned
> 16-byte block doesn't seem impossible at all.

It can be 16 bytes:

[hjl@gnu-cfl-1 jcc-1]$ cat f.s
cmpw  -12000(%eax,%ebx,8), %r8w
je foo
[hjl@gnu-cfl-1 jcc-1]$ gcc -c f.s
[hjl@gnu-cfl-1 jcc-1]$ objdump -dw f.o

f.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <.text>:
   0: 67 66 44 3b 84 d8 20 d1 ff ff cmp    -0x2ee0(%eax,%ebx,8),%r8w
   a: 0f 84 00 00 00 00    je     0x10
[hjl@gnu-cfl-1 jcc-1]$

I prefer not to lower it to 16 bytes unless there is a good reason.

> >>> +           {
> >>> +             int align_power;
> >>> +             for (align_power = 0;
> >>> +                  (align & 1) == 0;
> >>> +                  align >>= 1, align_power++)
> >>> +               continue;
> >>> +             if (align == 1)
> >>> +               {
> >>> +                 align_branch_power = align_power;
> >>> +                 break;
> >>> +               }
> >>> +           }
> >>> +       }
> >>> +     as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);
> >>> +      }
> >>> +      break;
> >>> +
> >>> +    case OPTION_MALIGN_BRANCH_PREFIX_SIZE:
> >>> +      {
> >>> +     char *end;
> >>> +     int align = strtoul (arg, &end, 0);
> >>> +     if (*end == '\0' && align >= 0 && align < 6)
> >>
> >> Similarly here - why is 5 the upper boundary on the accepted values?
> >
> > I am told that some processors only support 5 prefixes.
>
> Now that very certainly requires a comment to explain. Furthermore -
> what would happen on such processors if there was a 6th one? Is this
> merely a performance concern, or worse?

I don't know exactly will will happen.  I just cap it to 5.

> >>> +  unsigned int classified : 1;
> >>> +  unsigned int branch_type : 7;
> >>
> >> This doesn't require 7 bits, does it?
> >>
> >
> > I will change it to
> >
> > unsigned int branch_type : 6;
>
> Isn't 6 still too high? I realize the enumerators stored into here
> aren't truly enumerators, but add_branch_padding_frag_p() only ever
> stores one of the values (i.e. they don't get OR-ed together).
>

I changed it to 3 bits with

/* Types of branches to align.  */
enum align_branch_kind
  {
    align_branch_none = 0,
    align_branch_jcc = 1,
    align_branch_fused = 2,
    align_branch_jmp = 3,
    align_branch_call = 4,
    align_branch_indirect = 5,
    align_branch_ret = 6
  };

/* Type bits of branches to align.  */
enum align_branch_bit
  {
    align_branch_jcc_bit = 1 << align_branch_jcc,
    align_branch_fused_bit = 1 << align_branch_fused,
    align_branch_jmp_bit = 1 << align_branch_jmp,
    align_branch_call_bit = 1 << align_branch_call,
    align_branch_indirect_bit = 1 << align_branch_indirect,
    align_branch_ret_bit = 1 << align_branch_ret
  };

Thanks.

-- 
H.J.