[RFC] libopcodes based disassembler syntax highlighting
Jan Beulich
jbeulich@suse.com
Mon Dec 6 09:02:21 GMT 2021
On 03.12.2021 17:03, Andrew Burgess via Binutils wrote:
> ***** WARNING *****
>
> This commit is not in anyway supposed to be merge ready code. This
> commit is being shared just to get feedback on the acceptability of
> the general approach. If the feedback is positive I'll completely
> rework this commit and repost a more polished set of commits.
>
> ***** END OF WARNING *****
>
> This commit is a very rough, first draft, of adding disassembly syntax
> highlighting via libopcodes.
>
> The idea is to add a new function to the disassemble_info structure
> for performing styled disassembler output. There's a new enum defined
> which lists the various stylistic parts of the disassembler output,
> some that I currently have are 'text' (used as the default for any
> unstyled text), then we have 'mnemonic', 'register', 'immediate', and
> 'symbol' for all the different parts of a disassembled instruction.
>
> We then update the users of the disassembler, objdump and gdb, to
> provide both the old fprintf_func and the new fprintf_styled_func.
>
> Now we can update the disassemblers, one at a time, to switch over to
> using fprintf_styled_func. This switch over can be done target at a
> time, and I'm certainly not promising to convert all targets.
>
> As an example, in this commit, I've converted the risc-v disassembler,
> which is pretty easy to change, and the x86 disassembler which is much
> harder. In fact, as it currently stands, I would not be happy to
> upstream the x86 disassembler parts, I've had to hack in a function
> that applies the styling by analysing the text of the instruction
> operands, which isn't really the idea at all. That said, it works
> well enough, and I figured most folk will want to see what the results
> look like on x86.
>
> I've also updated GDB to handle this new scheme, so we get syntax
> highlighting there too for x86 and risc-v.
>
> Right now the styling is always on in objdump, which is probably not
> what we'd want, so that would need to change.
Personally I find such coloring more distracting than helpful. Hence yes,
I would much appreciate if this mode wouldn't become the default (let
alone being enabled unconditionally). I also think that in assembly
dialects like x86'es AT&T syntax, register and immediate prefixes are
already sufficient to identify what is what.
> @@ -49,6 +47,18 @@ enum dis_insn_type
> dis_dref2 /* Two data references in instruction. */
> };
>
> +enum disassembler_style
> +{
> + dis_style_text,
> + dis_style_mnemonic,
> + dis_style_register,
> + dis_style_immediate,
> + dis_style_symbol,
> +};
With x86 not using dis_style_symbol at all, what is this supposed to
cover? How would an addend count? As part of the symbol, as an immediate,
or as text?
> --- a/opcodes/i386-dis.c
> +++ b/opcodes/i386-dis.c
> @@ -9445,6 +9445,38 @@ get_sib (disassemble_info *info, int sizeflag)
> }
> }
>
> +/* This is a total hack for now, just to get something working on x86-64.
> + The arguments are currently built (in some cases at least) into a single
> + string, and then printed. If we want to support styled printing then
> + this would need to change, but that doesn't feel like an impossible
> + task.
> +
> + And it doesn't have to be all done in one go, we can always just fall
> + back to printing some text with the "default" styling (i.e. no styling)
> + if a particular bit of the disassembler is too hard to update in the
> + short term. */
> +
> +static void
> +print_operand_with_styling (disassemble_info *info, const char *txt)
> +{
> + const char *tmp;
> + enum disassembler_style style = dis_style_text;
> +
> + for (tmp = txt; *tmp != '\0'; ++tmp)
> + {
> + if (*tmp == '*' || *tmp == '(' || *tmp == ')' || *tmp == ',')
> + style = dis_style_text;
> + else if (tmp[0] == '0' && tmp[1] == 'x')
> + style = dis_style_immediate;
> + else if (tmp[0] == '%')
> + style = dis_style_register;
> + else if (style == dis_style_text && ISDIGIT(tmp[0]))
> + style = dis_style_immediate;
> +
> + (*info->fprintf_styled_func) (info->stream, style, "%c", tmp[0]);
> + }
> +}
While you say it's a hack and you don't want to upstream it in this
shape, I'd still like to point out (see above) that in AT&T syntax
immediates can be recognized from a leading $. There's an ambiguity
of course with symbols whose names start with $. (And obviously
there's a similar ambiguity between registers and symbols whose
names start with %.) Like you do for %, I think that leading $ ought
to be part of what gets emitted with dis_style_immediate.
Also, as to the distinction between immediate and symbol: How would
an immediate referencing a symbol count?
Jan
More information about the Binutils
mailing list