This is the mail archive of the
binutils@sourceware.org
mailing list for the binutils project.
Re: [PATCH 3/5] i386: Align branches within a fixed boundary
On 15.11.2019 00:21, H.J. Lu wrote:
> On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 12.11.2019 17:19, H.J. Lu wrote:
>>> @@ -632,6 +652,31 @@ static enum check_kind
>>> }
>>> sse_check, operand_check = check_warning;
>>>
>>> +/* Non-zero if branches should be aligned within power of 2 boundary. */
>>> +static int align_branch_power = 0;
>>> +
>>> +/* Types of branches to align. */
>>> +enum align_branch_kind
>>> + {
>>> + align_branch_none = 0,
>>> + align_branch_jcc = 1 << 0,
>>> + align_branch_fused = 1 << 1,
>>> + align_branch_jmp = 1 << 2,
>>> + align_branch_call = 1 << 3,
>>> + align_branch_indirect = 1 << 4,
>>> + align_branch_ret = 1 << 5
>>> + };
>>> +
>>> +static unsigned int align_branch = (align_branch_jcc
>>> + | align_branch_fused
>>> + | align_branch_jmp);
>>> +
>>> +/* The maximum padding size for fused jcc. */
>>> +#define MAX_FUSED_JCC_PADDING_SIZE 20
>>
>> What is this number derived from?
>
> CMP can be 9 bytes and jcc can be 6 bytes. I leave some room
> for prefixes.
Can you extend the comment to this effect please?
>>> @@ -3012,6 +3070,11 @@ md_begin (void)
>>> x86_dwarf2_return_column = 8;
>>> x86_cie_data_alignment = -4;
>>> }
>>> +
>>> + /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it
>>> + can be turned into BRANCH_PREFIX frag. */
>>> + if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)
>>> + abort ();
>>
>> Wouldn't such a value better be rejected at command line option
>> handling time?
>
> There is
>
> /* The maximum number of prefixes added for an instruction. */
> static unsigned int align_branch_prefix_size = 5;
>
> I'd like to check it even if it isn't changed.
Ok, okay.
>> Like elsewhere I think you want to use i.tm checks here. This
>> would also eliminate my worry of the code not catching cases
>> where optimization changes the opcode in i.tm without also
>> changing the name.
>
> It don't think CMP instructions can be replaced by totally different
> instructions.
But there's also e.g. AND in the set, which I would expect
optimization to be able to switch _to_ (not from, as you were
implying).
>>> +static int
>>> +add_fused_jcc_padding_frag_p (void)
>>> +{
>>> + if (!align_branch_power
>>> + || now_seg == absolute_section
>>> + || !cpu_arch_flags.bitfield.cpui386
>>
>> Why the i386 dependency?
>
> It doesn't work with COND_JUMP86 without i386.
Why would it not work? People using the option may need to
re-code some Jcc to keep the targets reachable despite the
inserted padding, but that's not a reason to suppress the
adjustments altogether.
>>> +static int
>>> +add_branch_prefix_frag_p (void)
>>> +{
>>> + if (!align_branch_power
>>> + || now_seg == absolute_section
>>> + || i.tm.cpu_flags.bitfield.cpupadlock
>>> + || !cpu_arch_flags.bitfield.cpui386)
>>
>> Why the PadLock and i386 dependencies?
>
> Since PadLock instructions include prefixes in opcode, I don't
> want to mess with them.
How are PadLock insns different in this regard from e.g. SSE insns?
>>> + else if (i.tm.base_opcode == 0xc2
>>> + || i.tm.base_opcode == 0xc3
>>> + || i.tm.base_opcode == 0xca
>>> + || i.tm.base_opcode == 0xcb)
>>
>> These will also match various VEX/XOP/EVEX encoded templates. Perhaps
>> exclude any such right away in the first if()?
>>
>> Also you handle near and far returns here, but ...
>
> I will change it to
>
> else if (is_any_vex_encoding (&i.tm))
> return 0;
Why not move this even earlier in the function?
> else if (i.tm.base_opcode == 0xc2 || i.tm.base_opcode == 0xc3)
Maybe (i.tm.base_opcode | 1) == 0xc3?
>>> @@ -8279,6 +8523,30 @@ output_insn (void)
>>> insn_start_frag = frag_now;
>>> insn_start_off = frag_now_fix ();
>>>
>>> + if (add_branch_padding_frag_p (&branch))
>>> + {
>>> + char *p;
>>> + unsigned int max_branch_padding_size = 14;
>>
>> Where is this number coming from? Like above (and I think also in
>> one more case below) any such seemingly arbitrary numbers could do
>> with a comment.
>
> Branch can be 8 bytes. I leave some room for prefixes.
Again please explain this in a comment.
>>> + /* Use existing segment prefix if possible. Use CS
>>> + segment prefix in 64-bit mode. In 32-bit mode, use SS
>>> + segment prefix with ESP/EBP base register and use DS
>>> + segment prefix without ESP/EBP base register. */
>>
>> Is there a reason for the preference of CS: in 64-bit mode?
>
> CS is ignored in 64-bit mode.
As are DS, ES, and SS. I.e. my question could also be put as "Why
different rules for 64- and 32-bit code?"
>>> +static void
>>> +i386_classify_machine_dependent_frag (fragS *fragP)
>>> +{
>>> + fragS *cmp_fragP;
>>> + fragS *pad_fragP;
>>> + fragS *branch_fragP;
>>> + fragS *next_fragP;
>>> + unsigned int max_prefix_length;
>>> +
>>> + if (fragP->tc_frag_data.classified)
>>> + return;
>>> +
>>> + /* First scan for BRANCH_PADDING and FUSED_JCC_PADDING. Convert
>>> + FUSED_JCC_PADDING and merge BRANCH_PADDING. */
>>> + for (next_fragP = fragP;
>>> + next_fragP != NULL;
>>> + next_fragP = next_fragP->fr_next)
>>> + {
>>> + next_fragP->tc_frag_data.classified = 1;
>>> + if (next_fragP->fr_type == rs_machine_dependent)
>>> + switch (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype))
>>> + {
>>> + case BRANCH_PADDING:
>>> + /* The BRANCH_PADDING frag must be followed by a branch
>>> + frag. */
>>> + branch_fragP = i386_next_non_empty_frag (next_fragP);
>>> + next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;
>>> + break;
>>> + case FUSED_JCC_PADDING:
>>> + /* Check if this is a fused jcc:
>>> + FUSED_JCC_PADDING
>>> + CMP
>>
>> Here and in a number of other places I wonder why you say CMP when
>> the set of macro-fusion candidate insns is quite a bit larger.
>
> I use CMP to refer all these instructions.
May I ask that at least in comments you clarify it's not just CMP.
This doesn't mean to re-enumerate all the insns, but at least say
something like "et al".
>>> + if (!fits_in_signed_byte (padding_size))
>>> + abort ();
>>
>> Why? The function's return type is int (should probably be
>> unsigned int).
>
> It must fit tc_frag_data.length.
And how would one recognize the connection when e.g. widening the
field?
>>> @@ -11483,6 +12323,78 @@ md_parse_option (int c, const char *arg)
>>> as_fatal (_("invalid -mrelax-relocations= option: `%s'"), arg);
>>> break;
>>>
>>> + case OPTION_MALIGN_BRANCH_BOUNDARY:
>>> + {
>>> + char *end;
>>> + int align = strtoul (arg, &end, 0);
>>> + if (*end == '\0')
>>> + {
>>> + if (align == 0)
>>> + {
>>> + align_branch_power = 0;
>>> + break;
>>> + }
>>> + else if (align >= 32)
>>
>> Why the enforcement of 32 as the lower boundary? I.e. why would 16
>> not be an option as well?
>
> CMP can be 9 bytes and jcc can be 6 bytes. 15 bytes is too close to
> 16 bytes.
I don't understand: Making such a pair fit within a single aligned
16-byte block doesn't seem impossible at all.
>>> + {
>>> + int align_power;
>>> + for (align_power = 0;
>>> + (align & 1) == 0;
>>> + align >>= 1, align_power++)
>>> + continue;
>>> + if (align == 1)
>>> + {
>>> + align_branch_power = align_power;
>>> + break;
>>> + }
>>> + }
>>> + }
>>> + as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);
>>> + }
>>> + break;
>>> +
>>> + case OPTION_MALIGN_BRANCH_PREFIX_SIZE:
>>> + {
>>> + char *end;
>>> + int align = strtoul (arg, &end, 0);
>>> + if (*end == '\0' && align >= 0 && align < 6)
>>
>> Similarly here - why is 5 the upper boundary on the accepted values?
>
> I am told that some processors only support 5 prefixes.
Now that very certainly requires a comment to explain. Furthermore -
what would happen on such processors if there was a 6th one? Is this
merely a performance concern, or worse?
>>> + unsigned int classified : 1;
>>> + unsigned int branch_type : 7;
>>
>> This doesn't require 7 bits, does it?
>>
>
> I will change it to
>
> unsigned int branch_type : 6;
Isn't 6 still too high? I realize the enumerators stored into here
aren't truly enumerators, but add_branch_padding_frag_p() only ever
stores one of the values (i.e. they don't get OR-ed together).
Jan