Machine Readable Format options? Try 2 without autoattachment.

Fri Apr 10 23:29:22 GMT 2020

I have been using binutils for various datamining tasks for decades now....

I build multiple products for multiple tool chains and multiple
platforms from the same code base.... carved up by #if spaghetti and
cunning linking.

The one and only thing that truly knows what ends up being called by
what and is where, is the preprocessor/compiler / linker.

Binutils, via tools like nm and objdump can tell me what the compiler
/linker did....

But the output formats are designed for human consumption and has,
ahhh, umfeatures that make automated parsing and querying hard.

For example, it's quite common in the output formats to omit a field,
forcing you to use heuristics to navigate past the corresponding
spaces to pick up the next column.

Also, the formats are informally specified... not something you really
want to rely on.

However, the world is awash with well defined human readable machine
formats that could trivially be used. (eg. json, yaml, ...)

Some of the binutil tools have --format=XXX settings (typically used
for compatibility with legacy standards).

It would be trivial to add a --format=json option.

Ideally all and every bit of information that can be obtained from the
binutil tools should be available in a machine readable form... (a
well factored / deisgned sqlite db would be a dream...)

...but in practice I'm usually parsing objdump --syms, or nm --extern
or even occasionally objdump -d to pull call graph and definition -
reference graph information or objdump --dwarf=info to pull macro
definitions info.

It's always a sore point with me that the first step in datamining elf
data is always to write a one-off custom somewhat kludgy parser to
read the output format.

However I'm clearly not alone in this desire, as witnessed by this
llvm-dev mail thread...

https://groups.google.com/d/topic/llvm-dev/U-sTsZB-6ls/discussion

Is there any initiative afoot to producing machine readable output formats
such as csv or json or yaml or...?

My typical destination for these activities is either a ruby script or
sqlite.

Of course, a standalone tool could do this, but for most tasks it's
just adding to the preexisting list of formatters (eg. bsd / posix /
sysv / and now add json)

I did look at elfutils, I wasn't aware of them until Nick mentioned
them, but they seem way behind in all aspects, including output
formats.

Would the maintainers object to a pull requests that added such a feature
(probably easier than doing half baked parsers)?

What would the preferred output format be?

What would the preferred command line interface be (eg. Instead of the
usual bsd / posix / sysv ... options on --format= add --format=csv or
something?)

In my ideal  dream universe there would be a "convert everything elf and
dwarf knows about this large collection of files into a well designed
relational data model in a sqlite db" switch.

But that is probably a large step too far.

Arguably this exists, it's called libbfd, libdwarf and libelf....but
those have to cope with the many tentacled horrors of decades of
legacy systems and cpus, which is why everyone uses the binutils tools
not the raw naked bfd.

Example use case just for inspiration:

Supposing you have a large body (>2000) object files * 4 tool chains *
20 products all compiled * 1000 unit tests all
compiled--ffunction-sections --ddata-sections and linked
--gc-sections...

Tell me which functions / macros / are in the source but never used anywhere?

Believe me, whenever I do this class of analysis on a mature system,
you'll be surprised out how much code I delete, and how much easier it
is to refactor the remaining code...

Thanks!

John