Bug 14026

Summary:	inode-uprobes should compute proper SET_REG_IP before probe invocation
Product:	systemtap	Reporter:	Negreanu Adrian <adrian.m.negreanu>
Component:	runtime	Assignee:	Unassigned <systemtap>
Status:	RESOLVED FIXED
Severity:	normal	CC:	dsmith, fche, jistone, mark
Priority:	P2
Version:	unspecified
Target Milestone:	---
Host:		Target:
Build:		Last reconfirmed:
Attachments:	readelf -a on the debug file readelf -a on the target executable output from stap and staprun output from stap and staprun when UNW_PC_FROM_RA=1 readelf --debug-dump=frames target_executable gdb output when the same breakpoint is added dataAlign patch (bad) readelf --debug-dump=frames gcc-4.6-target_executable patch that sets IP before calling probe handler

Description Negreanu Adrian 2012-04-27 10:00:31 UTC

Created attachment 6372 [details]
readelf -a on the debug file

On an android x86 target, I'm trying this script:

probe process("/system/bin/mksh").function("chvt_reinit") {
        printf("%s: pid=%d\n", module_name(), pid());
        printf("%s: ", module_name());
        print_regs();
        print_ubacktrace();
        printf("%s: DONE\n",module_name());
}

The backtrace consists of only addresses.
I attached the output of stap + staprun,
readelf -a on the target executable and
 of its .gnu_debuglink
and also the "readelf -debug-dump=frames target_executable".

One thing I noticed while trying to get more insight of what's happening,
was that if I define UNW_PC_FROM_RA to be 1, in -debug-dump=frames
the symbol is resolved but unwind() will only unwind the topmost frame.

Comment 1 Negreanu Adrian 2012-04-27 10:01:35 UTC

Created attachment 6373 [details]
readelf -a on the target executable

Comment 2 Negreanu Adrian 2012-04-27 10:03:18 UTC

Created attachment 6374 [details]
output from stap and staprun

Comment 3 Negreanu Adrian 2012-04-27 13:51:04 UTC

Created attachment 6375 [details]
output from stap and staprun when UNW_PC_FROM_RA=1

Comment 4 Negreanu Adrian 2012-04-27 15:22:46 UTC

Created attachment 6376 [details]
readelf --debug-dump=frames target_executable

Comment 5 Negreanu Adrian 2012-04-27 15:39:29 UTC

Created attachment 6377 [details]
gdb output when the same breakpoint is added

Comment 6 Negreanu Adrian 2012-05-02 13:21:55 UTC

The patch below made it work for me:


diff --git a/runtime/unwind.c b/runtime/unwind.c
index e440177..3ac7f8d 100644
--- a/runtime/unwind.c
+++ b/runtime/unwind.c
@@ -610,7 +610,7 @@ static int processCFI(const u8 *start, const u8 *end, unsigned long targetLoc,
                        break;
                }
                dbug_unwind(1, "targetLoc=%lx state->loc=%lx\n", targetLoc, state->loc);
-               if (ptr.p8 > end)
+               if (ptr.p8 >= end)
                        result = 0;
                if (result && targetLoc != 0 && targetLoc < state->loc)
                        return 1;

Comment 7 Mark Wielaard 2012-05-02 14:05:25 UTC

(In reply to comment #6)
> The patch below made it work for me:
> 
> 
> diff --git a/runtime/unwind.c b/runtime/unwind.c
> index e440177..3ac7f8d 100644
> --- a/runtime/unwind.c
> +++ b/runtime/unwind.c
> @@ -610,7 +610,7 @@ static int processCFI(const u8 *start, const u8 *end,
> unsigned long targetLoc,
>                         break;
>                 }
>                 dbug_unwind(1, "targetLoc=%lx state->loc=%lx\n", targetLoc,
> state->loc);
> -               if (ptr.p8 > end)
> +               if (ptr.p8 >= end)
>                         result = 0;
>                 if (result && targetLoc != 0 && targetLoc < state->loc)
>                         return 1;

I am not sure that is correct. In principle ptr being equal to end should be fine (see also the last return condition in ProcessCFI). It means we read everything, the last instruction and arguments (and nothing more).

If you hit this case then processCFI () will fail, which will make unwind_frame () also fail. So I am somewhat surprised this seems to work for you. Do you have debug output for before/after this patch?

Comment 8 Negreanu Adrian 2012-05-02 14:29:53 UTC

00000040 00000014 00000000 CIE
  Version:               1
  Augmentation:          "zR"
  Code alignment factor: 1
  Data alignment factor: -4
  Return address column: 8
  Augmentation data:     1b

  DW_CFA_def_cfa: r4 (esp) ofs 4
  DW_CFA_offset: r8 (eip) at cfa-4
  DW_CFA_nop
  DW_CFA_nop

Comment 9 Negreanu Adrian 2012-05-02 14:32:43 UTC

(gdb) disassemble 0x8074810
Dump of assembler code for function chvt_reinit:
   0x08074810 <+0>:     push   %ebp
   0x08074811 <+1>:     mov    %esp,%ebp
   0x08074813 <+3>:     lea    -0x8(%esp),%esp
   0x08074817 <+7>:     mov    %ebx,(%esp)
   0x0807481a <+10>:    mov    %esi,0x4(%esp)
   0x0807481e <+14>:    call   0x8049fc5 <__x86.get_pc_thunk.bx>
   0x08074823 <+19>:    add    $0x1a559,%ebx
   0x08074829 <+25>:    call   0x8049ac0 <getpid@plt>
   0x0807482e <+30>:    mov    0x1e0(%ebx),%edx
   0x08074834 <+36>:    mov    0x14c(%ebx),%esi
   0x0807483a <+42>:    mov    %eax,0x14(%edx)
   0x0807483d <+45>:    mov    %eax,0x4(%esi)
   0x08074840 <+48>:    call   0x8049a70 <geteuid@plt>
   0x08074845 <+53>:    mov    %eax,0xc(%esi)
   0x08074848 <+56>:    call   0x8049bf0 <getpgrp@plt>
   0x0807484d <+61>:    mov    %eax,0x8(%esi)
   0x08074850 <+64>:    call   0x8049d80 <getppid@plt>
   0x08074855 <+69>:    mov    %eax,0x10(%esi)
   0x08074858 <+72>:    mov    (%esp),%ebx
   0x0807485b <+75>:    mov    0x4(%esp),%esi
   0x0807485f <+79>:    mov    %ebp,%esp
   0x08074861 <+81>:    pop    %ebp
   0x08074862 <+82>:    ret    
End of assembler dump.

Comment 10 Negreanu Adrian 2012-05-02 14:44:02 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > The patch below made it work for me:
> > 
> > 
> > diff --git a/runtime/unwind.c b/runtime/unwind.c
> > index e440177..3ac7f8d 100644
> > --- a/runtime/unwind.c
> > +++ b/runtime/unwind.c
> > @@ -610,7 +610,7 @@ static int processCFI(const u8 *start, const u8 *end,
> > unsigned long targetLoc,
> >                         break;
> >                 }
> >                 dbug_unwind(1, "targetLoc=%lx state->loc=%lx\n", targetLoc,
> > state->loc);
> > -               if (ptr.p8 > end)
> > +               if (ptr.p8 >= end)
> >                         result = 0;
> >                 if (result && targetLoc != 0 && targetLoc < state->loc)
> >                         return 1;
> 
> I am not sure that is correct. In principle ptr being equal to end should be
> fine (see also the last return condition in ProcessCFI). It means we read
> everything, the last instruction and arguments (and nothing more).
> 
> If you hit this case then processCFI () will fail, which will make unwind_frame
> () also fail. So I am somewhat surprised this seems to work for you. Do you
> have debug output for before/after this patch?


You're right.
I used the wrong binary. Sorry for that !
The change that was included was:

                 if (result && targetLoc != 0 && targetLoc < state->loc)
-                         return 1;
+                         //return 1;

Comment 11 Mark Wielaard 2012-05-02 14:49:11 UTC

And the FDE as discussed on irc:

<groleo>  8302 00003dd4 00000024 00003d98 FDE cie=00000040 pc=08074810..08074863
<groleo>  8303   DW_CFA_advance_loc: 1 to 08074811
<groleo>  8304   DW_CFA_def_cfa_offset: 8
<groleo>  8305   DW_CFA_advance_loc: 2 to 08074813
<groleo>  8306   DW_CFA_offset: r5 (ebp) at cfa-8
<groleo>  8307   DW_CFA_def_cfa_register: r5 (ebp)
<groleo>  8308   DW_CFA_advance_loc: 22 to 08074829
<groleo>  8309   DW_CFA_offset: r6 (esi) at cfa-12

Comment 12 Mark Wielaard 2012-05-02 15:34:20 UTC

I am wondering whether we are using the data_alignment correctly.

Currently we are adjusting the offset in unwind_frame. Assuming state->dataAlign always has to be used. But the DWARF spec says that some DW_CFA operations (DW_CFA_def_cfa and DW_CFA_def_cfa_offset) take a non-factored offset. So we should factor the offset in processCFI() instead only when needed and not adjust in unwind_frame.

Comment 13 Mark Wielaard 2012-05-02 15:37:30 UTC

Created attachment 6385 [details]
dataAlign patch (bad)

Bad patch implementing the above idea. It makes lots of exelib.exp tests fail, so it is not right.

Comment 14 Mark Wielaard 2012-05-02 19:55:56 UTC

(In reply to comment #13)
> Created attachment 6385 [details]
> dataAlign patch (bad)
> 
> Bad patch implementing the above idea. It makes lots of exelib.exp tests fail,
> so it is not right.

Actually, not such a bad patch after all. It was something completely unrelated that made exelib.exp fail (fixed by commit fb6cf5 Fix PR13992 cont. Do iterate_over_modules, just not over libraries.).

The patch is still somewhat ugly (will need some extra checks to make sure offsets are actually read correctly before applying adjustment), but it actually seems to produce the correct results on x86_64. Could you try it out for your testcase?

Comment 15 Mark Wielaard 2012-05-03 09:55:17 UTC

I did try the patch out on a i686 setup and it seems to work OK there. But I don't have a reproducer yet for the original issue.

Comment 16 Negreanu Adrian 2012-05-04 13:10:59 UTC

I upgraded from gcc 4.4 to gcc 4.6 and it works as expected:
 0x8075411 : chvt_reinit+0x1/0x60 [/system/bin/mksh]
 0x8076f3b : mksh_init+0x1b/0xa80 [/system/bin/mksh]
 0x8049f16 : main+0x36/0x70 [/system/bin/mksh]
 0xb7724eae
sh: DONE

Comment 17 Negreanu Adrian 2012-05-04 13:51:38 UTC

Created attachment 6390 [details]
readelf --debug-dump=frames  gcc-4.6-target_executable

Comment 18 Mark Wielaard 2012-05-06 13:09:17 UTC

It is strange that the same binary compiled with gcc-4.6 does produce a good backtrace, while one compiled with gcc-.4.4 doesn't In both cases the FDE associated with the pc probe point points to a similar CIE.

GCC 4.4 binary:

00003dd4 00000024 00003d98 FDE cie=00000040 pc=08074810..08074863
  DW_CFA_advance_loc: 1 to 08074811
[...]

GCC 4.6 binary:

00003d78 0000001c 00003d3c FDE cie=00000040 pc=08075410..08075463
  DW_CFA_advance_loc: 4 to 08075414
[...]

In both cases the CIE looks like:

00000040 00000014 00000000 CIE
  Version:               1
  Augmentation:          "zR"
  Code alignment factor: 1
  Data alignment factor: -4
  Return address column: 8
  Augmentation data:     1b

  DW_CFA_def_cfa: r4 (esp) ofs 4
  DW_CFA_offset: r8 (eip) at cfa-4
  DW_CFA_nop
  DW_CFA_nop

So the result should be the same since the first operand of the FDE is an DW_CFA_advance_loc that gets you past the probe point pc address (target loc).

Comment 19 Mark Wielaard 2012-05-06 19:38:42 UTC

I finally understand what is happening, though I still don't understand how it can happen.

The real problem is that we enter the unwinder with PC 8074811. While the probe point actually is 8074810. This is normally caused because when a breakpoint is taken the x86 increases the PC by one before invoking the interrupt handler. But in the uprobe handler we already take care of this:

  // Make it look like the IP is set as it would in the actual user
  // task when calling real probe handler. Reset IP regs on return, so
  // we don't confuse uprobes. PR10458
  s.op->newline() << "{";
  s.op->indent(1);
  s.op->newline() << "unsigned long uprobes_ip = REG_IP(c->uregs);";
  s.op->newline() << "SET_REG_IP(regs, inst->vaddr);";
  s.op->newline() << "(*sups->probe->ph) (c);";
  s.op->newline() << "SET_REG_IP(regs, uprobes_ip);";
  s.op->newline(-1) << "}";

In the case of GCC-4.4 the FDE has its first DW_CFA_advance_loc 1. So we think we need to process the instructions between 8074810 and 8074811. While with GCC-4.6 the FDE has its first DW_CFA_advance_loc 4, so the one-off in the PC address doesn't matter and we don't process extra FDE instructions.

The question is why doesn't the above trick of adjusting the IP to compensate for the breakpoint PC increase work?

To check whether this is the real issue, you can try this patch:

diff --git a/runtime/stack-dwarf.c b/runtime/stack-dwarf.c
index 9c55997..fbea35c 100644
--- a/runtime/stack-dwarf.c
+++ b/runtime/stack-dwarf.c
@@ -69,6 +69,7 @@ static void __stp_dwarf_stack_user_print(struct pt_regs *regs, int verbose,
 {
        struct unwind_frame_info *info = &uwcontext->info;
        arch_unw_init_frame_info(info, regs, ! uregs_valid);
+       info->call_frame = 1; /* XXX Always assume PC is off by one. */
 
        while (levels) {
                int ret = unwind(uwcontext, 1);

Comment 20 Frank Ch. Eigler 2012-05-06 21:15:41 UTC

(In reply to comment #19)
> I finally understand what is happening, though I still don't understand how it
> can happen.
> 
> The real problem is that we enter the unwinder with PC 8074811. While the probe
> point actually is 8074810.

The code you quote refers to the struct uprobe->vaddr as the intended original address of the breakpoint.  Perhaps that address is being munged.

Have we seen yet a version of the run with -DDEBUG_UPROBES and stap -p3 output?  Which uprobes/utrace version was being used?  (stap-report?)

Comment 21 Frank Ch. Eigler 2012-05-06 21:30:56 UTC

By the way, if inst->vaddr is incorrect for some reason, there is always sups->address to compare.

Comment 22 Mark Wielaard 2012-05-07 08:42:40 UTC

(In reply to comment #20)
> The code you quote refers to the struct uprobe->vaddr as the intended original
> address of the breakpoint.  Perhaps that address is being munged.

I guess that could be happening, but how?

This is done this way to have an architecture neutral way of setting REG_IP to the actual probe address. But if it is unreliable then we might just have to make this architecture specific and have some ADJUST_REG_IP_BREAKPOINT macro that does the right thing per architecture. This might be needed anyway for the inode based uprobe handler which currently has:

  // XXX: Can't set SET_REG_IP; we don't actually know the relocated address.

> Have we seen yet a version of the run with -DDEBUG_UPROBES and stap -p3 output?
>  Which uprobes/utrace version was being used?  (stap-report?)

No, that would be useful.

BTW. I just checked in some cleanup of the DWARF unwinder code:

commit 44c8a286464a6196bbf44fdeaea9ccee23fa4b8c
Author: Mark Wielaard <mjw@redhat.com>
Date:   Mon May 7 10:20:21 2012 +0200

    Cleanups and fixes for DWARF unwinder.
    
    Add explict comments and set_*_rule functions for different states.
    Do data alignment in processCFI, not afterwards in unwind_frame.
    Remove unnecessary UNW_DEFAULT_RA. Make unwind_item state union
    explicit about (unsigned) reg versus (signed) offset. Same for
    unwind_reg_state cfa.

It does fix some corner cases related to calculated offset rules, but that isn't the issue here. But it would be nice if future tests could be done against a tree that has this commit in, to make the results a little easier to interpret.

Comment 23 Mark Wielaard 2012-05-07 13:44:33 UTC

(In reply to comment #22)
> I guess that could be happening, but how?
> 
> This is done this way to have an architecture neutral way of setting REG_IP to
> the actual probe address. But if it is unreliable then we might just have to
> make this architecture specific and have some ADJUST_REG_IP_BREAKPOINT macro
> that does the right thing per architecture. This might be needed anyway for the
> inode based uprobe handler which currently has:
> 
>   // XXX: Can't set SET_REG_IP; we don't actually know the relocated address.

This is indeed "it"!
<groleo> the latest k3.0 and the uprobes patch is the one here http://sourceware.org/ml/systemtap/2011-q3/msg00157.html

Comment 24 Frank Ch. Eigler 2012-07-19 14:53:23 UTC

This inode-uprobes shortcoming is biting other users too.  Perhaps we can work around it in stap land by using the mmap callbacks that check build-ids to also go and calculate run-time virtual addresses of the inode-uprobes, so they can be filled in around probe execution time.

Comment 25 Frank Ch. Eigler 2012-07-26 22:08:35 UTC

Srikar advises that the uprobes_get_bkpt_addr() function may be just the
ticket to compute the real probe virtual address at run time.

Comment 26 David Smith 2012-08-14 21:55:04 UTC

Created attachment 6578 [details]
patch that sets IP before calling probe handler

(In reply to comment #25)
> Srikar advises that the uprobes_get_bkpt_addr() function may be just the
> ticket to compute the real probe virtual address at run time.

It looks like uprobes_get_bkpt_addr() has been renamed uprobe_get_swbp_addr().

Here's a patch, that I've done a quick test on, that seems to work.  Anyone got any ideas on the best way to test this?

Comment 27 David Smith 2012-08-14 22:10:16 UTC

(In reply to comment #26)
> Created attachment 6578 [details]
> patch that sets IP before calling probe handler
> 
> (In reply to comment #25)
> > Srikar advises that the uprobes_get_bkpt_addr() function may be just the
> > ticket to compute the real probe virtual address at run time.
> 
> It looks like uprobes_get_bkpt_addr() has been renamed uprobe_get_swbp_addr().
> 
> Here's a patch, that I've done a quick test on, that seems to work.  Anyone got
> any ideas on the best way to test this?

When exelib.exp is run, I see no difference with or without the patch.

====
Running ../../src/testsuite/systemtap.exelib/exelib.exp ...
FAIL: uprobeslibgcc-O3-m64-prelink-debug prelink ./libuprobeslibgcc-O3-m64-prelink-debug.so
FAIL: uprobeslibgcc-O3-m64-prelink-sep-debug prelink ./libuprobeslibgcc-O3-m64-prelink-sep-debug.so
FAIL: uprobeslibgcc-O3-m64-prelink-sep-debug-after prelink ./libuprobeslibgcc-O3-m64-prelink-sep-debug-after.so

		=== systemtap Summary ===

# of expected passes		185
# of unexpected failures	3
====

The 3 failures are all prelink failures, of the form:

====
prelink failed: /usr/sbin/prelink: Could not set security context for ./libuprobeslibgcc-O3-m64-prelink-sep-debug-after.so: Operation not supported
====

Comment 28 Mark Wielaard 2012-08-14 22:20:13 UTC

(In reply to comment #27) 
> The 3 failures are all prelink failures, of the form:
> 
> ====
> prelink failed: /usr/sbin/prelink: Could not set security context for
> ./libuprobeslibgcc-O3-m64-prelink-sep-debug-after.so: Operation not supported
> ====

Is that on NFS? I believe that case has a bug (either kernel/selinux or prelink).

You might want to try testsuite/systemtap.context/uprobe_uaddr.exp which should test this case more precisely.

Comment 29 Mark Wielaard 2012-08-14 22:28:21 UTC

Another testcase to try might be uprobe_stmt_num.exp

Comment 30 David Smith 2012-08-15 14:08:57 UTC

(In reply to comment #28)
> (In reply to comment #27) 
> > The 3 failures are all prelink failures, of the form:
> > 
> > ====
> > prelink failed: /usr/sbin/prelink: Could not set security context for
> > ./libuprobeslibgcc-O3-m64-prelink-sep-debug-after.so: Operation not supported
> > ====
> 
> Is that on NFS? I believe that case has a bug (either kernel/selinux or
> prelink).

Yes, this was on NFS.

> You might want to try testsuite/systemtap.context/uprobe_uaddr.exp which should
> test this case more precisely.

I get the same results with and without the patch.  Looking at the test, it uses uretprobes, which of course don't work with inode uprobes.

====
Running ../../src/testsuite/systemtap.context/uprobe_uaddr.exp ...
FAIL: uprobe_uaddr (0)
FAIL: uprobe_uaddr match line 0 main+* 
FAIL: uprobe_uaddr match line 1 func+* 
FAIL: uprobe_uaddr match line 2 func2+* 
FAIL: uprobe_uaddr match line 3 func+* 
FAIL: uprobe_uaddr match line 4 main+* 

		=== systemtap Summary ===

# of expected passes		1
# of unexpected failures	6
====

Comment 31 David Smith 2012-08-15 14:14:22 UTC

(In reply to comment #29)
> Another testcase to try might be uprobe_stmt_num.exp

Without the patch, I get:

====
Running ../../src/testsuite/systemtap.context/uprobe_stmt_num.exp ...           
FAIL: uprobe_stmt_num-run-statements                                            
FAIL: uprobe_stmt_num-run-statements-nolines                                    
                                                                                
        === systemtap Summary ===                                               
                                                                                
# of expected passes          2                                                 
# of unexpected failures      2                                                 
# of known failures             1                                               
====

With the patch, I get:

====
Running ../../src/testsuite/systemtap.context/uprobe_stmt_num.exp ...

		=== systemtap Summary ===

# of expected passes		4
# of known failures		1
====

Comment 32 Mark Wielaard 2012-08-15 14:25:42 UTC

(In reply to comment #30)
> > Is that on NFS? I believe that case has a bug (either kernel/selinux or
> > prelink).
> 
> Yes, this was on NFS.

That is https://bugzilla.redhat.com/show_bug.cgi?id=625544
"Cannot restore selinux context on nfs mounted filesystem"

(In reply to comment #30)
> I get the same results with and without the patch.  Looking at the test, it
> uses uretprobes, which of course don't work with inode uprobes.

Boo!

(In reply to comment #31)
> With the patch, I get:
> 
> Running ../../src/testsuite/systemtap.context/uprobe_stmt_num.exp ...
> 
>         === systemtap Summary ===
> 
> # of expected passes        4
> # of known failures        1

Hurray!

Comment 33 David Smith 2012-08-15 15:12:23 UTC

Patch checked in as commit 39e63ba.