Bug 25750 - GNU as has inconsistent behavior when expanding .macro that takes as input directives with arguments
Summary: GNU as has inconsistent behavior when expanding .macro that takes as input di...
Status: RESOLVED WONTFIX
Alias: None
Product: binutils
Classification: Unclassified
Component: binutils (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-30 20:25 UTC by Jian Cai
Modified: 2020-04-14 18:29 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2020-04-09 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jian Cai 2020-03-30 20:25:58 UTC
Handling of spaces between directives and their arguments when expanding macros is not consistent in GNU as -- it only supports selective directives. For example, the following code assembled with aarch64-linux-gnu-gcc,

.macro foo insn
\insn
.endm

foo .inst (XXX) 

However, if we replace the argument of the macro expansion with another directive, such as "foo .section .sec1", it will fail. This is the case on x86 as well. Removing the space and surround the section name with parenthesis will work, i.e. "foo .section(.sec1)". But the section name is (.sec1) instead of .sec1 as intended. Not sure if this is intended but it introduces some confusion IMO.
Comment 1 Nick Clifton 2020-04-01 13:50:21 UTC
Hi caij,

  Which version of the binutils are you using for these tests ?

  When I tried to reproduce the problem using the current development sources I received the message "Error: too many positional arguments" for each invocation of foo.

  I did however find a variation that worked and did not have any behavioural inconsistencies:

.macro foo insn:vararg
\insn
.endm

	foo .word (XXX) 
	foo .word 0x1234
	foo .section sec1

Using the :vararg suffix to tell the macro processor to use all of the text as a single argument (inside the macro) preserves spaces and results in the output that you would expect.

Cheers
  Nick
Comment 2 Jian Cai 2020-04-01 21:19:10 UTC
Hi Nick,

Thank you for the information. I used gas 2.33.1

$ as -v
GNU assembler version 2.33.1 (aarch64-linux-gnu) using BFD version (GNU Binutils for Debian) 2.33.1

Does :vararg work for multiple arguments? Thanks.
Comment 3 Nick Clifton 2020-04-02 07:52:31 UTC
Hi Jian,

(In reply to Jian Cai from comment #2)
> Does :vararg work for multiple arguments? Thanks.

Yes it does:

  % cat fred.s

  .macro foo insn:vararg
  \insn
  .endm
	foo .section sec1 , "ax" , @progbits
	foo .section sec2 , "aw" , @note

  % as fred.s -o fred.o
  % readelf --wide --sections fred.o | grep sec

  [ 4] sec1   PROGBITS   0000000000000000 000040 000000 00  AX  0   0  1
  [ 5] sec2   NOTE       0000000000000000 000040 000000 00  WA  0   0  1
 
Cheers
  Nick
Comment 4 Jian Cai 2020-04-02 18:58:54 UTC
Sorry I just realized my question was misleading. What I really meant was if vararg would work on a macro taking as input multiple directives with arguments, such as the one below. vararg seemed to not work with such case in my experiment. 

.macro foo insn1, insn2
\insn1
\insn2
.endm

foo .section .sec1
foo .section .sec2
Comment 5 Nick Clifton 2020-04-03 09:36:35 UTC
Hi Jian,

> Sorry I just realized my question was misleading. What I really meant was if
> vararg would work on a macro taking as input multiple directives with
> arguments, such as the one below. vararg seemed to not work with such case
> in my experiment. 
> 
> .macro foo insn1, insn2
> \insn1
> \insn2
> .endm
> 
> foo .section .sec1
> foo .section .sec2

Well now that particular example would not work[1], but then it is not using the :vararg suffix either.  You are correct however in assuming that you cannot have two :vararg suffixes in a macro definition.  So for example this does not work:

  .macro foo insn1:vararg insn2:vararg
  \insn1
  \insn2
  .endm

  foo .section .sec1 .section .sec2

In fact only the last macro argument can have the :vararg suffix, which is why it can only be used once.

Cheers
  Nick

[1]  Just to be pedantic, your example would work if you put the two insns on the same line.  So this:

  .macro foo insn1 insn2
  \insn1 \insn2
  .endm

      foo .section .sec1

Does work.
Comment 6 Jian Cai 2020-04-06 19:18:08 UTC
> So this:
> 
>   .macro foo insn1 insn2
>   \insn1 \insn2
>   .endm
> 
>       foo .section .sec1
> 
> Does work.

Thanks for the suggestion. I found this issue while working on https://github.com/ClangBuiltLinux/linux/issues/939. Essentially, the following case assembled with arm-linux-gnueabihf-gcc but not clang.

$ cat bad.s
.macro alternative_insn insn1
.endm

alternative_insn .inst (0x0)

There were several ways to make it work with clang, including removing the space between .inst and its argument, quoting the directive and its arguments, and now using vararg as you suggested. 

I made an LLVM patch for (https://reviews.llvm.org/D76962), which will make clang's integrated assembler to treat .inst and its arguments as one macro argument in the above example. It did not go though upstream review as it would create mismatch between clang and gcc, so I am curious if gcc is interested to have more general support, so that code like the example in Comment 4 will work without being rewritten.
Comment 7 Nick Clifton 2020-04-07 10:25:34 UTC
(In reply to Jian Cai from comment #6)
Hi Jian,

> I made an LLVM patch for (https://reviews.llvm.org/D76962), which will make
> clang's integrated assembler to treat .inst and its arguments as one macro
> argument in the above example.

I am not clear on this.  Do you mean that if a macro's argument is the
string ".inst" then it will treat any text that follows on the same line
as being part of the same argument ?  If so, then is this special treatment
reserved only for macro arguments that start with ".inst" ?

> It did not go though upstream review as it
> would create mismatch between clang and gcc, 

[Just to be pedantic, it would be a mismatch between llvm's integrated assembler and the binutils's stand alone assembler.  Gcc is a separate project to the binutils].

> so I am curious if gcc is
> interested to have more general support, so that code like the example in
> Comment 4 will work without being rewritten.

I think not.  I am definitely against having macros change their behaviour based upon the text inside their arguments.  Plus there is a hard built-in requirement that assembler directives operate on a per-line basis.  If there is a need to spread the arguments to a directive across multiple lines in the input text, then the newline should be escaped.  Eg:

  .section sec1 \
     , "ax" , @progbits

Cheers
  Nick
Comment 8 Jian Cai 2020-04-08 21:31:00 UTC
Hi Nick,

Thanks for the feedback. First all of, apology and correction to the code example I came up with in https://sourceware.org/bugzilla/show_bug.cgi?id=25750#c4. It should be

.macro foo insn1 insn2
\insn1
\insn2
.endm

foo .section .sec1, .section .sec2

> Do you mean that if a macro's argument is the
string ".inst" then it will treat any text that follows on the same line
as being part of the same argument ?  If so, then is this special treatment
reserved only for macro arguments that start with ".inst" ?

What I meant was for gcc to treat the macro expansion in this example as foo ".section .sec1", ".section .sec2", albeit without the quotes, i.e. considering .section .sec1 as the argument for insn1, and .section .sec2 as the argument for insn2. Currently, as treats

foo .inst (XXX), .inst (YYY)

as 

foo ".inst (XXX)", ".inst (YYY)"

but not some others, such as foo .section .sec1, .section .sec2. All in all I don't think this is a major issue and we have different ways to work around it, so it's completely understandable if this remains what it is on gcc. I just figured it may be worth reporting this issue to upstream in case there is any interest to fix it.
Comment 9 Nick Clifton 2020-04-09 14:56:14 UTC
(In reply to Jian Cai from comment #8)
Hi Jian,

> .macro foo insn1 insn2
> \insn1
> \insn2
> .endm
> 
> foo .section .sec1, .section .sec2
 
It occurs to me that we might be able to achieve something like this,
if we extend the macro argument syntax to include a definition of the
argument separator.  For example:

  .macro foo insn1:sep, insn2:vararg
  \insn1
  \insn2
  .endm

This would tell the macro processor that insn1 takes all characters up to 
(but excluding) the next comma character and that insn2 takes the rest of
the line (starting at the first non-whitespace character after the comma).

The new syntax "<arg>:sep<char>" tells the macro processor that argument
<arg> is separated from the next argument by the <char> character, and only
by the <char> character, so that any other character (apart from line ending
characters) forms part of <arg>.

This syntax does not exist *yet*.  I am going to have a look at implementing
it, but if it is too difficult I will give up.

What do you think - would this change provide the functionality that you
desire ?

Cheers
  Nick
Comment 10 Fangrui Song 2020-04-13 00:12:55 UTC
(In reply to Jian Cai from comment #8)
Hi Jian,

> I did however find a variation that worked and did not have any behavioural inconsistencies:

For

  .macro memorize a
  \a
  .endm
  memorize .long (19371226)

The x86 backend rejects it while most other backends accept it. This is actually unintentional.

gas/config/tc-i386.c defines

  /* List of chars besides those in app.c:symbol_chars that can start an
     operand.  Used to prevent the scrubber eating vital white-space.  */
  const char extra_symbol_chars[] = "*%-([{}"

The type of these symbols are considered LEX_IS_SYMBOL_COMPONENT.

Before the input is fed to the parser, gas runs do_scrub_chars() to perform some preprocessing.

* For other backends, lex['('] is 0, even though the state is 10, no space is kept
  macro_expand() will see `.long(19371226)` and the string will be parsed as an argument.
* For the x86 backend, lex['('] is LEX_IS_SYMBOL_COMPONENT, the space before '(' is kept (state is 10).
  macro_expand() will see `.long (19371226)` and the string will be parsed as two arguments.
  If the argument does not have the `:vararg` qualifier, gas will error
  `Error: too many positional arguments`


There is one feature not documented in https://sourceware.org/binutils/docs/as/Macro.html
For a string literal argument, gas will remove the quotes when evaluation the argument.
This makes the following example work:

  .macro foo a
  \a
  .endm
  foo ".section sec"

For the Linux kernel, __emit_insn may be used as either a directive or an operand.
Its reliance on `.inst (x)` being parsed as a whole is brittle.

  arch/arm64/include/asm/sysreg.h
  52:#define __emit_inst(x)                       .inst (x)

  arch/arm64/include/asm/sysreg.h
  93:#define SET_PSTATE_PAN(x)            __emit_inst(0xd500401f | PSTATE_PAN | ((!!x) << PSTATE_Imm_shift))

  arch/arm64/kvm/hyp/entry.S
  112:     ALTERNATIVE(nop, SET_PSTATE_PAN(1), ARM64_HAS_PAN, CONFIG_ARM64_PAN)

#define __emit_inst(x) .inst(x)
may be the most suitable workaround for clang integrated assembler.
Comment 11 Fangrui Song 2020-04-13 00:19:43 UTC
(In reply to Nick Clifton from comment #9)
> (In reply to Jian Cai from comment #8)
> Hi Jian,
> 
> > .macro foo insn1 insn2
> > \insn1
> > \insn2
> > .endm
> > 
> > foo .section .sec1, .section .sec2
>  
> It occurs to me that we might be able to achieve something like this,
> if we extend the macro argument syntax to include a definition of the
> argument separator.  For example:
> 
>   .macro foo insn1:sep, insn2:vararg
>   \insn1
>   \insn2
>   .endm
> 
> This would tell the macro processor that insn1 takes all characters up to 
> (but excluding) the next comma character and that insn2 takes the rest of
> the line (starting at the first non-whitespace character after the comma).
> 
> The new syntax "<arg>:sep<char>" tells the macro processor that argument
> <arg> is separated from the next argument by the <char> character, and only
> by the <char> character, so that any other character (apart from line ending
> characters) forms part of <arg>.
> 
> This syntax does not exist *yet*.  I am going to have a look at implementing
> it, but if it is too difficult I will give up.
> 
> What do you think - would this change provide the functionality that you
> desire ?
> 
> Cheers
>   Nick

Hi Nick, for Jian's request, we can probably just quote the arguments

 .macro foo insn1 insn2
 \insn1
 \insn2
 .endm

- foo .section .sec1, .section .sec2
+ foo ".section .sec1", ".section .sec2"

We need to document it, though.

> Comment 10
> There is one feature not documented in https://sourceware.org/binutils/docs/as/Macro.html For a string literal argument, gas will remove the quotes when evaluation the argument.

I am a bit concerned that the :sep qualifier will not be accurate, either.
do_scrub_chars() can squeeze a run of spaces or delete some spaces arbitrarily. The :sep argument will not be a sole separator as spaces/commas can be separators as well.


% x86_64-linux-gnu-as a.s && objdump -s 
 0000 612c6220 28632064 292c65             a,b (c d),e
% aarch64-linux-gnu-as a.s && objdump -s
...
 0000 612c6228 63206429 2c65               a,b(c d),e
Comment 12 Fangrui Song 2020-04-13 00:20:53 UTC
Forgot to upload the assembly file I used for the `x86_64-linux-gnu-as a.s && objdump -s` command in Comment 11

% cat a.s
.macro foo a:vararg
.ascii "\a"
.endm
foo a, b  (c d), e
Comment 13 Nick Clifton 2020-04-14 14:36:28 UTC
Yes - I have had second thoughts about my ":sep" idea, and decided
that it is the wrong approach.  Using double quotes is right solution
and no extra hacking should be needed.

Changing the lexical behaviour of the x86 backend would probably be a 
bad idea.  I am sure that it would break something somewhere.  I suspect
that, broken though it may be, we will be best off if we accept the
status quo.

Cheers
  Nick
Comment 14 Fangrui Song 2020-04-14 15:55:51 UTC
Yes. I would hope do_scrub_chars() did not squeeze a sequence of spaces or dropped spaces before '(', but making such a change would just be disruptive now, see https://lore.kernel.org/linux-arm-kernel/20200414154307.2cke3x5ocz3q2as4@google.com/

Linux kernel between 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bca8f17f57bd76ddf2bbd2527eb890d6f588853e (2016) til now relies on `.inst (...)` scrubbed by do_scrub_chars() as `.inst(...)`

Not making the change will keep the nuance regarding spaces but that may be bearable.

% cat a.s
.macro foo a:vararg
.ascii "\a"
.endm
foo a  b  (c  d)
% aarch64-linux-gnu-as a.s && objdump -s
...
 0000 61206228 63206429                    a b(c d)

% as a.s && objdump -s                                                                                                              
...
 0000 61206220 28632064 29                 a b (c d)
Comment 15 Jian Cai 2020-04-14 18:28:59 UTC
Thanks Nick and Fangrui. Going through all the comments, it seems keeping the status quo is a better choice to keep existing (especially legacy) code work. Thank you all for the discussion and will close this issue now.