This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 3/5] i386: Align branches within a fixed boundary

From: Jan Beulich <jbeulich at suse dot com>
To: "H.J. Lu" <hjl dot tools at gmail dot com>
Cc: Binutils <binutils at sourceware dot org>
Date: Fri, 15 Nov 2019 10:16:49 +0100
Subject: Re: [PATCH 3/5] i386: Align branches within a fixed boundary
References: <20191112161905.10048-1-hjl.tools@gmail.com> <20191112161905.10048-4-hjl.tools@gmail.com> <d053d48c-7726-c177-c4f2-f91a4a64160d@suse.com> <CAMe9rOqru4pY+peF6dJFQJG8=b+yY455Y81YUoNgb0LzyTXQ=Q@mail.gmail.com>

On 15.11.2019 00:21,  H.J. Lu  wrote:
> On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 12.11.2019 17:19,  H.J. Lu  wrote:
>>> @@ -632,6 +652,31 @@ static enum check_kind
>>>    }
>>>  sse_check, operand_check = check_warning;
>>>
>>> +/* Non-zero if branches should be aligned within power of 2 boundary.  */
>>> +static int align_branch_power = 0;
>>> +
>>> +/* Types of branches to align.  */
>>> +enum align_branch_kind
>>> +  {
>>> +    align_branch_none = 0,
>>> +    align_branch_jcc = 1 << 0,
>>> +    align_branch_fused = 1 << 1,
>>> +    align_branch_jmp = 1 << 2,
>>> +    align_branch_call = 1 << 3,
>>> +    align_branch_indirect = 1 << 4,
>>> +    align_branch_ret = 1 << 5
>>> +  };
>>> +
>>> +static unsigned int align_branch = (align_branch_jcc
>>> +                                 | align_branch_fused
>>> +                                 | align_branch_jmp);
>>> +
>>> +/* The maximum padding size for fused jcc.  */
>>> +#define MAX_FUSED_JCC_PADDING_SIZE 20
>>
>> What is this number derived from?
> 
> CMP can be 9 bytes and jcc can be 6 bytes.   I leave some room
> for prefixes.

Can you extend the comment to this effect please?

>>> @@ -3012,6 +3070,11 @@ md_begin (void)
>>>        x86_dwarf2_return_column = 8;
>>>        x86_cie_data_alignment = -4;
>>>      }
>>> +
>>> +  /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it
>>> +     can be turned into BRANCH_PREFIX frag.  */
>>> +  if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)
>>> +    abort ();
>>
>> Wouldn't such a value better be rejected at command line option
>> handling time?
> 
> There is
> 
> /* The maximum number of prefixes added for an instruction.  */
> static unsigned int align_branch_prefix_size = 5;
> 
> I'd like to check it even if it isn't changed.

Ok, okay.

>> Like elsewhere I think you want to use i.tm checks here. This
>> would also eliminate my worry of the code not catching cases
>> where optimization changes the opcode in i.tm without also
>> changing the name.
> 
> It don't think CMP instructions can be replaced by totally different
> instructions.

But there's also e.g. AND in the set, which I would expect
optimization to be able to switch _to_ (not from, as you were
implying).

>>> +static int
>>> +add_fused_jcc_padding_frag_p (void)
>>> +{
>>> +  if (!align_branch_power
>>> +      || now_seg == absolute_section
>>> +      || !cpu_arch_flags.bitfield.cpui386
>>
>> Why the i386 dependency?
> 
> It doesn't work with COND_JUMP86 without i386.

Why would it not work? People using the option may need to
re-code some Jcc to keep the targets reachable despite the
inserted padding, but that's not a reason to suppress the
adjustments altogether.

>>> +static int
>>> +add_branch_prefix_frag_p (void)
>>> +{
>>> +  if (!align_branch_power
>>> +      || now_seg == absolute_section
>>> +      || i.tm.cpu_flags.bitfield.cpupadlock
>>> +      || !cpu_arch_flags.bitfield.cpui386)
>>
>> Why the PadLock and i386 dependencies?
> 
> Since PadLock instructions include prefixes in opcode, I don't
> want to mess with them.

How are PadLock insns different in this regard from e.g. SSE insns?

>>> +  else if (i.tm.base_opcode == 0xc2
>>> +        || i.tm.base_opcode == 0xc3
>>> +        || i.tm.base_opcode == 0xca
>>> +        || i.tm.base_opcode == 0xcb)
>>
>> These will also match various VEX/XOP/EVEX encoded templates. Perhaps
>> exclude any such right away in the first if()?
>>
>> Also you handle near and far returns here, but ...
> 
> I will change it to
> 
>   else if (is_any_vex_encoding (&i.tm))
>     return 0;

Why not move this even earlier in the function?

>   else if (i.tm.base_opcode == 0xc2 || i.tm.base_opcode == 0xc3)

Maybe (i.tm.base_opcode | 1) == 0xc3?

>>> @@ -8279,6 +8523,30 @@ output_insn (void)
>>>    insn_start_frag = frag_now;
>>>    insn_start_off = frag_now_fix ();
>>>
>>> +  if (add_branch_padding_frag_p (&branch))
>>> +    {
>>> +      char *p;
>>> +      unsigned int max_branch_padding_size = 14;
>>
>> Where is this number coming from? Like above (and I think also in
>> one more case below) any such seemingly arbitrary numbers could do
>> with a comment.
> 
> Branch can be 8 bytes.  I leave some room for prefixes.

Again please explain this in a comment.

>>> +           /* Use existing segment prefix if possible.  Use CS
>>> +              segment prefix in 64-bit mode.  In 32-bit mode, use SS
>>> +              segment prefix with ESP/EBP base register and use DS
>>> +              segment prefix without ESP/EBP base register.  */
>>
>> Is there a reason for the preference of CS: in 64-bit mode?
> 
> CS is ignored in 64-bit mode.

As are DS, ES, and SS. I.e. my question could also be put as "Why
different rules for 64- and 32-bit code?"

>>> +static void
>>> +i386_classify_machine_dependent_frag (fragS *fragP)
>>> +{
>>> +  fragS *cmp_fragP;
>>> +  fragS *pad_fragP;
>>> +  fragS *branch_fragP;
>>> +  fragS *next_fragP;
>>> +  unsigned int max_prefix_length;
>>> +
>>> +  if (fragP->tc_frag_data.classified)
>>> +    return;
>>> +
>>> +  /* First scan for BRANCH_PADDING and FUSED_JCC_PADDING.  Convert
>>> +     FUSED_JCC_PADDING and merge BRANCH_PADDING.  */
>>> +  for (next_fragP = fragP;
>>> +       next_fragP != NULL;
>>> +       next_fragP = next_fragP->fr_next)
>>> +    {
>>> +      next_fragP->tc_frag_data.classified = 1;
>>> +      if (next_fragP->fr_type == rs_machine_dependent)
>>> +     switch (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype))
>>> +       {
>>> +       case BRANCH_PADDING:
>>> +         /* The BRANCH_PADDING frag must be followed by a branch
>>> +            frag.  */
>>> +         branch_fragP = i386_next_non_empty_frag (next_fragP);
>>> +         next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;
>>> +         break;
>>> +       case FUSED_JCC_PADDING:
>>> +         /* Check if this is a fused jcc:
>>> +            FUSED_JCC_PADDING
>>> +            CMP
>>
>> Here and in a number of other places I wonder why you say CMP when
>> the set of macro-fusion candidate insns is quite a bit larger.
> 
> I use CMP to refer all these instructions.

May I ask that at least in comments you clarify it's not just CMP.
This doesn't mean to re-enumerate all the insns, but at least say
something like "et al".

>>> +  if (!fits_in_signed_byte (padding_size))
>>> +    abort ();
>>
>> Why? The function's return type is int (should probably be
>> unsigned int).
> 
> It must fit tc_frag_data.length.

And how would one recognize the connection when e.g. widening the
field?

>>> @@ -11483,6 +12323,78 @@ md_parse_option (int c, const char *arg)
>>>          as_fatal (_("invalid -mrelax-relocations= option: `%s'"), arg);
>>>        break;
>>>
>>> +    case OPTION_MALIGN_BRANCH_BOUNDARY:
>>> +      {
>>> +     char *end;
>>> +     int align = strtoul (arg, &end, 0);
>>> +     if (*end == '\0')
>>> +       {
>>> +         if (align == 0)
>>> +           {
>>> +             align_branch_power = 0;
>>> +             break;
>>> +           }
>>> +         else if (align >= 32)
>>
>> Why the enforcement of 32 as the lower boundary? I.e. why would 16
>> not be an option as well?
> 
> CMP can be 9 bytes and jcc can be 6 bytes.  15 bytes is too close to
> 16 bytes.

I don't understand: Making such a pair fit within a single aligned
16-byte block doesn't seem impossible at all.

>>> +           {
>>> +             int align_power;
>>> +             for (align_power = 0;
>>> +                  (align & 1) == 0;
>>> +                  align >>= 1, align_power++)
>>> +               continue;
>>> +             if (align == 1)
>>> +               {
>>> +                 align_branch_power = align_power;
>>> +                 break;
>>> +               }
>>> +           }
>>> +       }
>>> +     as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);
>>> +      }
>>> +      break;
>>> +
>>> +    case OPTION_MALIGN_BRANCH_PREFIX_SIZE:
>>> +      {
>>> +     char *end;
>>> +     int align = strtoul (arg, &end, 0);
>>> +     if (*end == '\0' && align >= 0 && align < 6)
>>
>> Similarly here - why is 5 the upper boundary on the accepted values?
> 
> I am told that some processors only support 5 prefixes.

Now that very certainly requires a comment to explain. Furthermore -
what would happen on such processors if there was a 6th one? Is this
merely a performance concern, or worse?

>>> +  unsigned int classified : 1;
>>> +  unsigned int branch_type : 7;
>>
>> This doesn't require 7 bits, does it?
>>
> 
> I will change it to
> 
> unsigned int branch_type : 6;

Isn't 6 still too high? I realize the enumerators stored into here
aren't truly enumerators, but add_branch_padding_frag_p() only ever
stores one of the values (i.e. they don't get OR-ed together).

Jan

Follow-Ups:
- Re: [PATCH 3/5] i386: Align branches within a fixed boundary
  - From: H.J. Lu

References:
- [PATCH 0/5] i386: Optimize for Jump Conditional Code Erratum
  - From: H.J. Lu
- [PATCH 3/5] i386: Align branches within a fixed boundary
  - From: H.J. Lu
- Re: [PATCH 3/5] i386: Align branches within a fixed boundary
  - From: Jan Beulich
- Re: [PATCH 3/5] i386: Align branches within a fixed boundary
  - From: H.J. Lu

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]