[PATCH 3/5] i386: Align branches within a fixed boundary

Mon Nov 18 10:50:00 GMT 2019

On 16.11.2019 00:11, H.J. Lu wrote:
> On Fri, Nov 15, 2019 at 1:16 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 15.11.2019 00:21,  H.J. Lu  wrote:
>>> On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 12.11.2019 17:19,  H.J. Lu  wrote:
>>>> Like elsewhere I think you want to use i.tm checks here. This
>>>> would also eliminate my worry of the code not catching cases
>>>> where optimization changes the opcode in i.tm without also
>>>> changing the name.
>>>
>>> It don't think CMP instructions can be replaced by totally different
>>> instructions.
>>
>> But there's also e.g. AND in the set, which I would expect
>> optimization to be able to switch _to_ (not from, as you were
>> implying).
> 
> What other instructions can be encoded as AND?

I didn't say there _are_ such optimizations, but I wouldn't exclude
there might be in the future. Other aspects of the patch already
suggested that the combination of these new options and the
optimization ones may not have been fully taken into consideration
on this first version of the patch. I'd simply be afraid of such
aspect to be overlooked later on (when adding further optimizations)
as well. Yet the changes here are useful only if they indeed manage
to cover _all_ problematic code sequences, not just most of them.

>>>>> +static int
>>>>> +add_fused_jcc_padding_frag_p (void)
>>>>> +{
>>>>> +  if (!align_branch_power
>>>>> +      || now_seg == absolute_section
>>>>> +      || !cpu_arch_flags.bitfield.cpui386
>>>>
>>>> Why the i386 dependency?
>>>
>>> It doesn't work with COND_JUMP86 without i386.
>>
>> Why would it not work? People using the option may need to
>> re-code some Jcc to keep the targets reachable despite the
>> inserted padding, but that's not a reason to suppress the
>> adjustments altogether.
> 
> This option isn't for processors without i386.

Can this be stated then in the documentation you add?

>>>>> +static int
>>>>> +add_branch_prefix_frag_p (void)
>>>>> +{
>>>>> +  if (!align_branch_power
>>>>> +      || now_seg == absolute_section
>>>>> +      || i.tm.cpu_flags.bitfield.cpupadlock
>>>>> +      || !cpu_arch_flags.bitfield.cpui386)
>>>>
>>>> Why the PadLock and i386 dependencies?
>>>
>>> Since PadLock instructions include prefixes in opcode, I don't
>>> want to mess with them.
>>
>> How are PadLock insns different in this regard from e.g. SSE insns?
> 
> There are
> 
>              if (i.tm.base_opcode & 0xff000000)
>                 {
>                   prefix = (i.tm.base_opcode >> 24) & 0xff;
>                   if (!i.tm.cpu_flags.bitfield.cpupadlock
>                       || prefix != REPE_PREFIX_OPCODE
>                       || (i.prefix[REP_PREFIX] != REPE_PREFIX_OPCODE))
>                     add_prefix (prefix);
> 
> I don't want to change it for PadLock.

I understand that you don't want to do this, but you didn't
answer my question. Next to the code fragment you quote there
is similar code dealing with prefixes on (as said) e.g. SSE
insns. Why is it okay to leave SSE insns as they are, but
treat PadLock ones specially?

>>>>> +           /* Use existing segment prefix if possible.  Use CS
>>>>> +              segment prefix in 64-bit mode.  In 32-bit mode, use SS
>>>>> +              segment prefix with ESP/EBP base register and use DS
>>>>> +              segment prefix without ESP/EBP base register.  */
>>>>
>>>> Is there a reason for the preference of CS: in 64-bit mode?
>>>
>>> CS is ignored in 64-bit mode.
>>
>> As are DS, ES, and SS. I.e. my question could also be put as "Why
> 
> We pick CS.
> 
>> different rules for 64- and 32-bit code?"
> 
> Since CS isn't ignored in 32-bit mode, we can't use CS.  We pick the
> default segment register in 32-bit mode.

Which again doesn't answer my question: Code would be simpler if
there wouldn't need to be a distinction between 32- and 64-bit
modes. Hence the more tight constraints of 32-bit mode could be
applied to 64-bit mode as well. I certainly understand that the
more relaxed constraints of 64-bit mode can't be applied to
32-bit mode handling.

>>>>> +           {
>>>>> +             int align_power;
>>>>> +             for (align_power = 0;
>>>>> +                  (align & 1) == 0;
>>>>> +                  align >>= 1, align_power++)
>>>>> +               continue;
>>>>> +             if (align == 1)
>>>>> +               {
>>>>> +                 align_branch_power = align_power;
>>>>> +                 break;
>>>>> +               }
>>>>> +           }
>>>>> +       }
>>>>> +     as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);
>>>>> +      }
>>>>> +      break;
>>>>> +
>>>>> +    case OPTION_MALIGN_BRANCH_PREFIX_SIZE:
>>>>> +      {
>>>>> +     char *end;
>>>>> +     int align = strtoul (arg, &end, 0);
>>>>> +     if (*end == '\0' && align >= 0 && align < 6)
>>>>
>>>> Similarly here - why is 5 the upper boundary on the accepted values?
>>>
>>> I am told that some processors only support 5 prefixes.
>>
>> Now that very certainly requires a comment to explain. Furthermore -
>> what would happen on such processors if there was a 6th one? Is this
>> merely a performance concern, or worse?
> 
> I don't know exactly will will happen.  I just cap it to 5.

That's unfortunate then. Five years from now, how will anyone
understand why this seemingly artificial limit is the way it is
(without digging out this mail thread)?

Thanks for all the other adjustments you did.

Jan