New .nops directive, to aid Linux alternatives patching?

Sat Feb 10 17:22:00 GMT 2018

On 10/02/18 15:44, H.J. Lu wrote:
> On Fri, Feb 9, 2018 at 5:29 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 09/02/18 11:55, H.J. Lu wrote:
>>> On Fri, Feb 9, 2018 at 3:35 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>> On 09/02/18 02:22, H.J. Lu wrote:
>>>>> On Thu, Feb 8, 2018 at 5:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>> On Thu, Feb 8, 2018 at 4:45 PM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>>>>> On 09/02/2018 00:24, H.J. Lu wrote:
>>>>>>>> On Thu, Feb 8, 2018 at 3:47 PM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>>>>>>> On 08/02/2018 20:36, H.J. Lu wrote:
>>>>>>>>>> On Thu, Feb 8, 2018 at 12:33 PM, Andrew Cooper
>>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>>> On 08/02/2018 20:28, H.J. Lu wrote:
>>>>>>>>>>>> On Thu, Feb 8, 2018 at 12:27 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>>>>>>>> On Thu, Feb 8, 2018 at 12:18 PM, Andrew Cooper
>>>>>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>>>>>> On 08/02/2018 20:10, H.J. Lu wrote:
>>>>>>>>>>>>>>> On Thu, Feb 8, 2018 at 11:26 AM, Andrew Cooper
>>>>>>>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I realise this is a little bit niche, but how feasible would it be to
>>>>>>>>>>>>>>>> introduce a new .nops directive which takes a size parameter, and
>>>>>>>>>>>>>>>> outputs long nops covering the number of specified bytes?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sounds to me you want a pseudo NOP instruction:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pseudo-NOP N
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> which generates a long NOP with N byte.  Is that correct.  If yes,
>>>>>>>>>>>>>>> what is the range of N?
>>>>>>>>>>>>>> Currently 255 based on other implementation limits, and I expect that
>>>>>>>>>>>>>> ought to be long enough for anyone.  There is one existing user for
>>>>>>>>>>>>>> N=43, and I expect that to grow a bit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The real answer properly depends at what point it is more efficient to
>>>>>>>>>>>>>> jmp rather than wasting decode bandwidth decoding nops, and I don't know
>>>>>>>>>>>>>> the answer, but expect that it isn't larger than 255.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> How about
>>>>>>>>>>>>>
>>>>>>>>>>>>> {nop} N
>>>>>>>>>>>>>
>>>>>>>>>>>>> If N is less than 15 bytes, it generates a long nop.   Otherwise, we use a jump
>>>>>>>>>>>>> instruction over nops.  Does it work for you?
>>>>>>>>>>>> N will be limited to 255.
>>>>>>>>>>> Do you mean up to 255 bytes of adjacent long nops, or still a jump if
>>>>>>>>>>> over 15 bytes?  For alternatives in the range of 15-30, a jmp is almost
>>>>>>>>>>> certainly slower than executing through the nops.  The ORM isn't clear
>>>>>>>>>>> where the split lies, and I expect it is very uarch specific.
>>>>>>>>>> How about this
>>>>>>>>>>
>>>>>>>>>> {nop} N, L
>>>>>>>>>> {nop} N
>>>>>>>>>>
>>>>>>>>>> N is < =255. If L is missing, L is 15.
>>>>>>>>>>
>>>>>>>>>> If N < L then
>>>>>>>>>>   Long NOPs up to N bytes
>>>>>>>>>> else
>>>>>>>>>>   jmp + long nops up to N bytes.
>>>>>>>>>> fi
>>>>>>>>> I'm afraid that I don't think that will be very helpful in that form.
>>>>>>>>> Are there technical reasons why you don't want to emit more than a
>>>>>>>>> single 15byte long nop?
>>>>>>>>>
>>>>>>>> Doesn't
>>>>>>>>
>>>>>>>> {nop} 28, 40
>>>>>>>>
>>>>>>>> generate 2 x 14-byte nops?
>>>>>>> By the above logic, yes.  I still don't see the value in the L
>>>>>>> parameter, because I don't expect an average programmer to know how to
>>>>>>> choose it sensibly.  Then again, a compiler generating code for a
>>>>>>> specified uarch probably could have some idea of what value to feed in.
>>>>>>>
>>>>>>> If the semantics were a little more like:
>>>>>>>
>>>>>>> {nop} N => N bytes of nops with no jumps
>>>>>>> {nop} N, L => as above
>>>>>>>
>>>>>>> Then this might be more useful.
>>>>>>>
>>>>>>> I expect N will typically be an expression rather than an absolute
>>>>>>> number, because the usecase I've proposed is for filling in a specific,
>>>>>>> calculated number of bytes.  (In particular, what commonly happens is
>>>>>>> that memory references in alternatives are the thing which cause the
>>>>>>> exact length to fluctuate.)  When there is a sensible uarch value for L,
>>>>>>> that can be fed in, but shouldn't be mandatory.  In particular, if it
>>>>>>> unknown, 15 is almost certainly the wrong default for it.
>>>>>> So, you want
>>>>>>
>>>>>> .nop SIZE
>>>>>>
>>>>>> and
>>>>>>
>>>>>> .jump SIZE
>>>>>>
>>>>>> which are similar to '.skip SIZE , FILL'.  But they fill SIZE with nops or
>>>>>> jmp + nops.
>>>>>>
>>>>> Or
>>>>>
>>>>> .nop SIZE, JUMP_SIZE
>>>>>
>>>>> If SIZE < JUMP_SIZE then
>>>>>   SIZE of nops.
>>>>> else
>>>>>   SIZE of jmp + nops.
>>>>> fi
>>>> I'm still not sure why you want the jump functionality in the first
>>>> place, but yes - this latest option would work.
>>>>
>>>> FWIW, jumping over code with alternatives is typically done like:
>>>>
>>>> ALTERNATIVE "jmp .L\@_skip", "", FEATURE_X
>>>> ...
>>>> .L\@_skip:
>>>>
>>>> At which point it is only the two or 5 byte jmp which is being
>>>> dynamically modified.  The converse case is where we begin with 2 or 5
>>>> bytes of nops, and dynamically insert the jmp.
>>>>
>>>> If we're in the line for other related feature requests, how about being
>>>> able to optionally specify the maximum length of individual nops?  e.g.
>>>>
>>>> .nop SIZE [, MAX_NOP = 9 [, JUMP_SIZE = -1]]
>>> OK, let go with
>>>
>>>  .nop SIZE [, MAX_NOP = 9]
>>>
>>> It is easier to implement with 2 arguments.   MAX_NOP must be a constant.
>> Sounds good to me.
> Please try users/hjl/nop branch:
>
> https://github.com/hjl-tools/binutils-gdb/tree/users/hjl/nop

Oh - thankyou!Â  I was about to ask if there were any pointers to get
started hacking on binutils.

As for the functionality, there are unfortunately some issues.Â  Given
this source:

Â Â Â Â Â Â Â  .text
single:
Â Â Â Â Â Â Â  nop

pseudo_1:
Â Â Â Â Â Â Â  .nop 1

pseudo_8:
Â Â Â Â Â Â Â  .nop 8

pseudo_8_4:
Â Â Â Â Â Â Â  .nop 8, 4

pseudo_20:
Â Â Â Â Â Â Â  .nop 20

I get the following disassembly:

0000000000000000 <single>:
Â Â  0:Â Â Â  90Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  Â Â Â  nop

0000000000000001 <pseudo_1>:
Â Â  1:Â Â Â  66 90Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  Â Â Â  xchgÂ Â  %ax,%ax

0000000000000003 <pseudo_8>:
Â Â  3:Â Â Â  66 0f 1f 84 00 00 00 Â Â Â  nopwÂ Â  0x0(%rax,%rax,1)
Â Â  a:Â Â Â  00 00

000000000000000c <pseudo_8_4>:
Â Â  c:Â Â Â  90Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  Â Â Â  nop
Â Â  d:Â Â Â  0f 1f 40 00Â Â Â Â Â Â Â Â Â  Â Â Â  noplÂ Â  0x0(%rax)
Â  11:Â Â Â  0f 1f 40 00Â Â Â Â Â Â Â Â Â  Â Â Â  noplÂ Â  0x0(%rax)

0000000000000015 <pseudo_20>:
Â  15:Â Â Â  90Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  Â Â Â  nop
Â  16:Â Â Â  66 2e 0f 1f 84 00 00 Â Â Â  nopwÂ Â  %cs:0x0(%rax,%rax,1)
Â  1d:Â Â Â  00 00 00
Â  20:Â Â Â  66 2e 0f 1f 84 00 00 Â Â Â  nopwÂ Â  %cs:0x0(%rax,%rax,1)
Â  27:Â Â Â  00 00 00

The MAX_NOP part looks to be working as intended (including reducing
below the default of 10), but there appears to be an off-by-one
somewhere, as one too many nops are emitted in the block.

Furthermore, attempting to use .nop 30 yields:

/tmp/ccI2Eakp.s: Assembler messages:
/tmp/ccI2Eakp.s: Fatal error: can't write 145268933551616 bytes to
section .text of nops.o: 'Bad value'

I can't obviously tie reported number to anything, but it does appear to
depend on the current position in the section.Â  Inserting more regular
instructions ahead of the .nop 30 causes the reported number to get
larger until it overflows.

~Andrew