[RFC] libopcodes based disassembler syntax highlighting

Jan Beulich jbeulich@suse.com
Mon Dec 6 09:02:21 GMT 2021


On 03.12.2021 17:03, Andrew Burgess via Binutils wrote:
> ***** WARNING *****
> 
> This commit is not in anyway supposed to be merge ready code.  This
> commit is being shared just to get feedback on the acceptability of
> the general approach.  If the feedback is positive I'll completely
> rework this commit and repost a more polished set of commits.
> 
> ***** END OF WARNING *****
> 
> This commit is a very rough, first draft, of adding disassembly syntax
> highlighting via libopcodes.
> 
> The idea is to add a new function to the disassemble_info structure
> for performing styled disassembler output.  There's a new enum defined
> which lists the various stylistic parts of the disassembler output,
> some that I currently have are 'text' (used as the default for any
> unstyled text), then we have 'mnemonic', 'register', 'immediate', and
> 'symbol' for all the different parts of a disassembled instruction.
> 
> We then update the users of the disassembler, objdump and gdb, to
> provide both the old fprintf_func and the new fprintf_styled_func.
> 
> Now we can update the disassemblers, one at a time, to switch over to
> using fprintf_styled_func.  This switch over can be done target at a
> time, and I'm certainly not promising to convert all targets.
> 
> As an example, in this commit, I've converted the risc-v disassembler,
> which is pretty easy to change, and the x86 disassembler which is much
> harder.  In fact, as it currently stands, I would not be happy to
> upstream the x86 disassembler parts, I've had to hack in a function
> that applies the styling by analysing the text of the instruction
> operands, which isn't really the idea at all.  That said, it works
> well enough, and I figured most folk will want to see what the results
> look like on x86.
> 
> I've also updated GDB to handle this new scheme, so we get syntax
> highlighting there too for x86 and risc-v.
> 
> Right now the styling is always on in objdump, which is probably not
> what we'd want, so that would need to change.

Personally I find such coloring more distracting than helpful. Hence yes,
I would much appreciate if this mode wouldn't become the default (let
alone being enabled unconditionally). I also think that in assembly
dialects like x86'es AT&T syntax, register and immediate prefixes are
already sufficient to identify what is what.

> @@ -49,6 +47,18 @@ enum dis_insn_type
>    dis_dref2			/* Two data references in instruction.  */
>  };
>  
> +enum disassembler_style
> +{
> +  dis_style_text,
> +  dis_style_mnemonic,
> +  dis_style_register,
> +  dis_style_immediate,
> +  dis_style_symbol,
> +};

With x86 not using dis_style_symbol at all, what is this supposed to
cover? How would an addend count? As part of the symbol, as an immediate,
or as text?

> --- a/opcodes/i386-dis.c
> +++ b/opcodes/i386-dis.c
> @@ -9445,6 +9445,38 @@ get_sib (disassemble_info *info, int sizeflag)
>      }
>  }
>  
> +/* This is a total hack for now, just to get something working on x86-64.
> +   The arguments are currently built (in some cases at least) into a single
> +   string, and then printed.  If we want to support styled printing then
> +   this would need to change, but that doesn't feel like an impossible
> +   task.
> +
> +   And it doesn't have to be all done in one go, we can always just fall
> +   back to printing some text with the "default" styling (i.e. no styling)
> +   if a particular bit of the disassembler is too hard to update in the
> +   short term.  */
> +
> +static void
> +print_operand_with_styling (disassemble_info *info, const char *txt)
> +{
> +  const char *tmp;
> +  enum disassembler_style style = dis_style_text;
> +
> +  for (tmp = txt; *tmp != '\0'; ++tmp)
> +    {
> +      if (*tmp == '*' || *tmp == '(' || *tmp == ')' || *tmp == ',')
> +        style = dis_style_text;
> +      else if (tmp[0] == '0' && tmp[1] == 'x')
> +        style = dis_style_immediate;
> +      else if (tmp[0] == '%')
> +        style = dis_style_register;
> +      else if (style == dis_style_text && ISDIGIT(tmp[0]))
> +        style = dis_style_immediate;
> +
> +      (*info->fprintf_styled_func) (info->stream, style, "%c", tmp[0]);
> +    }
> +}

While you say it's a hack and you don't want to upstream it in this
shape, I'd still like to point out (see above) that in AT&T syntax
immediates can be recognized from a leading $. There's an ambiguity
of course with symbols whose names start with $. (And obviously
there's a similar ambiguity between registers and symbols whose
names start with %.) Like you do for %, I think that leading $ ought
to be part of what gets emitted with dis_style_immediate.

Also, as to the distinction between immediate and symbol: How would
an immediate referencing a symbol count?

Jan



More information about the Binutils mailing list