[PATCH 0/2] RISC-V: Better support for long instructions (64 < x <= 176 [bits])

Tue Nov 22 00:43:15 GMT 2022

On Sat, Nov 19, 2022 at 3:11 PM Tsukasa OI <research_trasio@irq.a4lg.com> wrote:
>
> Hello,
>
> Commit bb996692bd96 ("RISC-V/gas: allow generating up to 176-bit
> instructions with .insn") was suddenly merged into master by Jan Beulich.
> Although this encoding is not considered frozen (see the section "Expanded
> Instruction-Length Encoding" in the ISA Manual), such attempt makes sense.
>
> However, it was insufficient in some ways.  I quickly found a stack-based
> buffer overflow and fixed that.  Besides that, I found following issues
> related to long (assembler: ".insn"-based, disassembler: unrecognized)
> instructions:
>
>
>
> [Assembler: False "value conflicts with instruction length" error]
>
>     .insn 22, 0xfedcba98765432100123456789abcdef55aa33cc607f
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                 44 hexadecimal digits (22 bytes)
>
> GAS successfully compiles this (0x607f is one of the 2-byte prefixes of
> 22-byte instructions).  However, it fails on following examples.
>
>     .insn 22, 0xfedcba98765432100123456789ab607f
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                 32 hexadecimal digits (16 bytes)
>
> with following error:
>
>     a.s: Assembler messages:
>     a.s:1: Error: value conflicts with instruction length `22,0xfedcba98765432100123456789ab607f'
>
> This is not good.  Yes, the value is 16 bytes but upper 6-bytes of a 22-byte
> instruction could be just zeroed out.  It should be valid just like:
>
>     .insn 22, 0x607f
>
> This is a testcase from insn.s and current GAS can handle this.  The only
> reason which the example above fails is, the 16-byte constant is handled as
> O_big (a big number).  The root cause of this is:
>
>     if (bytes != imm_expr->X_add_number * CHARS_PER_LITTLENUM)
>       return _("value conflicts with instruction length");
>
> where:
>
>     bytes                  : Expected instruction length
>     imm_expr->X_add_number : MINIMUM Number of entries (of generic_bignum)
>                              required to represent a big number.
>                              Non-zero if O_big.
>     CHARS_PER_LITTLENUM    : A big number is represented as an array of
>                              little numbers (current GAS: unsigned short).
>                              This is the number of chars of a little number.
>
> That means, if a big number can be represented with different bytes as
> expected, ".insn" fails with "value conflicts with instruction length".
>
> Replacing this with:
>
>     if (bytes < imm_expr->X_add_number * CHARS_PER_LITTLENUM) /* == -> <  */
>       return _("value conflicts with instruction length");
>
> would resolve this problem on the current GAS but this depends on how
> the big numbers are represented.  If big number changes its base type
> (such as 2^32), it will stop working.
>
> My solution (on PATCH 2/2) is much complex.  It computes exact bytes
> required to repsent a big number and fails only if number of represented
> bytes exceeds expected number of bytes.
>
>     unsigned int llen = 0;
>     for (LITTLENUM_TYPE lval = generic_bignum[imm_expr->X_add_number - 1]; lval != 0; llen++)
>       lval >>= BITS_PER_CHAR;
>     unsigned int repr_bytes = (imm_expr->X_add_number - 1) * CHARS_PER_LITTLENUM + llen;
>     if (bytes < repr_bytes)
>       return _("value conflicts with instruction length");
>
> Although it may not work on all possible bigint encoding configurations, my
> implementation is flexible enough that any realistic bigint configurations
> would be compatible.
>
> This is the result after this fix (and the disassembler fix):
>
>        0:   607f 89ab 4567 0123     .byte   0x7f, 0x60, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01, 0x10, 0x32, 0x54, 0x76, 0x98, 0xba, 0xdc, 0xfe, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
>        8:   3210 7654 ba98 fedc
>       10:   0000 0000 0000
>
>
>
> [Assembler: DWARF line number information is missing with big numbers]
>
> We can compile this on the current GAS (note: line number is prefixed):
>
> 1:  .insn 22, 0x607f
> 2:  .insn 22, 0xfedcba98765432100123456789abcdef55aa33cc607f
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                 44 hexadecimal digits (22 bytes)
>
> with "as -gdwarf-2 -o a.o -c a.s"
>   or "as -g -o a.o -c a.s".
>
> However, if we extract the debug information, the result will be rather
> strange.  Here's the result of "objdump -WL a.o":
>
>     a.o:     file format elf64-littleriscv
>
>     Contents of the .debug_line section:
>
>     CU: a.s:
>     File name                            Line number    Starting address    View    Stmt
>     a.s                                            1                   0               x
>     a.s                                            -                0x2c
>
> A 22-byte instruction on the address 0 (line 1) is fine.  But wait, where
> the another 22-byte instruction on the address 0x16 (line 2) go?  Before
> describing, we'll see what's going on in the line 1.
>
> 1.  s_riscv_insn (".insn" directive) is called
>     a.  riscv_ip (".insn" with known instruction encoding) is called
>         but fails with "unrecognized opcode".
>     b.  riscv_ip_hardcode (".insn" with raw values) is called
>         and succeeds.
>     c.  append_insn is called **when imm_expr.X_op is not O_big**)
>         -> Go to 2
> 2.  append_insn (output an instruction)
>     a.  dwarf2_emit_insn is called (DWARF debug information is appended)
>     b.  relocation / fixups
>         (if relaxed instructions are emitted, return early)
>     c.  add_fixed_insn (add the instruction to the end of the output)
>
> The reason why it fails on line 2 (O_big) is 1.c.  Because DWARF debug
> information is appended on 2.a., this part is not called because append_insn
> is not called when imm_expr.X_op is O_big.  Instead, it does completely
> separate handling on the riscv_ip_hardcode function (1.b.).
>
> My patchset (PATCH 2/2) unifies this separate handling with regular
> instructions.  That means, 1.c. is changed so that append_insn is called
> even if imm_expr.X_op is O_big.  To unify handling, it adds insn_long_opcode
> (long opcode encoding up to 22-bytes) to struct riscv_cl_insn and the
> install_insn function (called by add_fixed_insn function) is modified to
> handle long instructions.  riscv_ip_hardcode function is modified to store
> long instruction encodings to new insn_long_opcode field.
>
> As a result, ".insn" directive with a big number emits correct
> debug information like this:
>
>     a.o:     file format elf64-littleriscv
>
>     Contents of the .debug_line section:
>
>     CU: a.s:
>     File name                            Line number    Starting address    View    Stmt
>     a.s                                            1                   0               x
>     a.s                                            2                0x16       1       x
>     a.s                                            -                0x2c
>
>
>
> [Disassembler: Instruction is trimmed with 64-bits]
>
> In this section, we reuse the object file generated by the section above
> (two 22-byte instructions).
>
>     0000000000000000 <.text>:
>        0:   607f 0000 0000 0000     .byte   0x7f, 0x60, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
>        8:   0000 0000 0000 0000
>       10:   0000 0000 0000
>       16:   607f 33cc 55aa cdef     .byte   0x7f, 0x60, 0xcc, 0x33, 0xaa, 0x55, 0xef, 0xcd, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
>       1e:   89ab 4567 0123 3210                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       26:   7654 ba98 fedc                                                                  zeroed out after first 64-bits
>
> See ".byte" at the address 0x16.  It's trimmed at 64-bits.
> The resolution is simple.  If we simply add a char* argument (containing all
> instruction bits), we can easily resolve this.
>
>     0000000000000000 <.text>:
>        0:   607f 0000 0000 0000     .byte   0x7f, 0x60, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
>        8:   0000 0000 0000 0000
>       10:   0000 0000 0000
>       16:   607f 33cc 55aa cdef     .byte   0x7f, 0x60, 0xcc, 0x33, 0xaa, 0x55, 0xef, 0xcd, 0xab, 0x89, 0x67, 0x45, 0x23, 0x01, 0x10, 0x32, 0x54, 0x76, 0x98, 0xba, 0xdc, 0xfe
>       1e:   89ab 4567 0123 3210                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       26:   7654 ba98 fedc                                                                  all instruction bits are correct
>
>
>
> With this patchset, although we don't support any "valid" long instructions
> yet, we can support long "invalid" instructions (only emitted through
> ".insn" and the encoded bits do not match to any valid instructions and
> printed as ".byte").
>
> This patchset is also important because it makes the assembler behavior
> consistent (does not depend whether the raw value can be represented with
> a small constant).
>
>
>
> Thanks,
> Tsukasa
>
>
>
>
> Tsukasa OI (2):
>   RISC-V: Make .insn tests stricter
>   RISC-V: Better support for long instructions

Thanks for Jan's feedback and it makes sense.  I have no further
opinion here, so I will totally respect and follow Jan judgment for
this series.  Anyway, thanks for spending time to fix this.

Nelson

>  gas/config/tc-riscv.c                | 50 +++++++++++++++++++++++-----
>  gas/testsuite/gas/riscv/insn-dwarf.d | 10 +++++-
>  gas/testsuite/gas/riscv/insn-na.d    | 28 ++++++++++------
>  gas/testsuite/gas/riscv/insn.d       | 32 +++++++++++++++---
>  gas/testsuite/gas/riscv/insn.s       |  9 +++++
>  opcodes/riscv-dis.c                  | 13 +++++---
>  6 files changed, 113 insertions(+), 29 deletions(-)
>
>
> base-commit: 15253318be0995200cc59929ca32eedbfd041e45
> --
> 2.38.1
>