This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Token-level mapping of coverage information and generated code

From: Simon Richter <Simon dot Richter at hogyros dot de>
To: binutils at sourceware dot org
Date: Tue, 3 Mar 2020 19:39:28 +0100
Subject: Token-level mapping of coverage information and generated code

Hi,

I'd like to get finer-than-line-level information for code coverage and
optimized-out code.

Consider:

    extern void foo(void);                              // 1
    int test()                                          // 2
    {                                                   // 3
        int a = 0, b = 0, c = 1, d = 0;                 // 4
        if( a == b && a == c && b == c) { d = a; }      // 5
        foo();                                          // 6
        return d;                                       // 7
    }                                                   // 8

Compiling with gcc -c -O0 and mapping back to source, I get

    int test()
    {
       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   48 83 ec 10             sub    $0x10,%rsp
        int a = 0, b = 0, c = 1, d = 0;
       8:   c7 45 f8 00 00 00 00    movl   $0x0,-0x8(%rbp)
       f:   c7 45 f4 00 00 00 00    movl   $0x0,-0xc(%rbp)
      16:   c7 45 f0 01 00 00 00    movl   $0x1,-0x10(%rbp)
      1d:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
    
        if( a == b && a == c && b == c) { d = a; }
      24:   8b 45 f8                mov    -0x8(%rbp),%eax
      27:   3b 45 f4                cmp    -0xc(%rbp),%eax
      2a:   75 16                   jne    42 <test+0x42>
      2c:   8b 45 f8                mov    -0x8(%rbp),%eax
      2f:   3b 45 f0                cmp    -0x10(%rbp),%eax
      32:   75 0e                   jne    42 <test+0x42>
      34:   8b 45 f4                mov    -0xc(%rbp),%eax
      37:   3b 45 f0                cmp    -0x10(%rbp),%eax
      3a:   75 06                   jne    42 <test+0x42>
      3c:   8b 45 f8                mov    -0x8(%rbp),%eax
      3f:   89 45 fc                mov    %eax,-0x4(%rbp)
    
        foo();
      42:   e8 00 00 00 00          callq  47 <test+0x47>
                            43: R_X86_64_PLT32      foo-0x4
    
        return d;
      47:   8b 45 fc                mov    -0x4(%rbp),%eax
    }
      4a:   c9                      leaveq 
      4b:   c3                      retq   

The finest resolution I can get here is a single line, addr2line reports
the exact same mapping for instruction-to-source-line.

Instrumenting for code coverage and running, I get
    
            1:    2:int test()
            -:    3:{
            1:    4:    int a = 0, b = 0, c = 1, d = 0;
           1*:    5:    if( a == b && a == c && b == c) { d = a; }
            1:    5-block  0
            1:    5-block  1
        %%%%%:    5-block  2
        %%%%%:    5-block  3
            1:    6:    foo();
            1:    6-block  0
            1:    7:    return d;
            -:    8:}

As expected, the condition is resolved into four basic blocks,
corresponding to the three tests and the conditional body. Can I somehow
map these basic blocks back to the tokens in the source file?

Similarly, if I compile with optimization enabled, mapping back to source
code gives me

    int test()
    {
       0:   48 83 ec 08             sub    $0x8,%rsp
        int a = 0, b = 0, c = 1, d = 0;
        if( a == b && a == c && b == c) { d = a; }
        foo();
       4:   e8 00 00 00 00          callq  9 <test+0x9>
                            5: R_X86_64_PLT32       foo-0x4
        return d;
    }
       9:   31 c0                   xor    %eax,%eax
       b:   48 83 c4 08             add    $0x8,%rsp
       f:   c3                      retq   

I can get a bit better mapping information by interrogating addr2line to
see what source code lines actually contributed to the output:

    $ python -c 'for x in range(0, 16): print hex(x)' | \
        addr2line -e test.o | \
        cut -d: -f2 | \
        uniq
    3
    6
    8

This does omit the initialization of d, but I guess that can't be helped
since it's propagated into the return statement as a constant, which is
probably not that relevant a problem for the real world.

Again, I'd like to get a finer-grained mapping than lines here, so I can
highlight in the source code which code actually got used in the final
output.

As a nasty hack, I can run the source code through "tr ' ' '\n'" before
compiling, which gives me rather good resolution for the coverage test, but
the mapping to subexpressions is somewhat arbitrary, because counters are
associated with control flow inside the expression 

            1:   28:if(
            -:   29:a
            -:   30:==
            -:   31:b
            1:   32:&&
            -:   33:a
            -:   34:==
            -:   35:c
        #####:   36:&&
            -:   37:b
            -:   38:==
            -:   39:c)
            -:   40:{
            -:   41:d
        #####:   42:=
            -:   43:a;
            -:   44:}

Is there some way I could accurately extract information from a run that
allows me to highlight which subexpressions hve been evaluated?

>From the run above, I can possibly get

        if( a == b && a == c && b == c) { d = a; }
        ~~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~
        1          1         -              -

which isn't bad, but it could probably be improved. The end goal is to
build reports

    "this condition has not been touched by a testcase"
and
    "this code is unused and the compiler can prove it"

   Simon

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]