Bug 18649 - int_arg() misbehaves on x86[_64] for 32-bit uprobe in binary having debuginfo
Summary: int_arg() misbehaves on x86[_64] for 32-bit uprobe in binary having debuginfo
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: tapsets (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-07-09 09:28 UTC by Martin Cermak
Modified: 2015-09-04 07:47 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
proposed patch (1.43 KB, patch)
2015-07-29 16:29 UTC, Martin Cermak
Details | Diff
updated patch (2.73 KB, patch)
2015-08-21 07:49 UTC, Martin Cermak
Details | Diff
updated patch (2.67 KB, patch)
2015-08-31 08:22 UTC, Martin Cermak
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Cermak 2015-07-09 09:28:56 UTC
The int_arg() function doesn't work correctly on i[36]86 and x86_64 when probing 32-bit userspace application having debuginfo compiled in. Let's have following program:

=======
int                                                                             
testfc(int arg)                                                                 
{                                                                               
    return arg;                                                                 
}                                                                               
                                                                                   
int                                                                             
main()                                                                          
{                                                                               
    testfc(32767);                                                              
    return 0;                                                                   
}
=======

and probe it using following stap command:

=======
stap -e 'probe process("a.out").function("testfc") {printf("%x\n", int_arg(1))}' -c ./a.out
=======

Following example demonstrates int_arg() returning nonsense value when 32-bit binary having debuginfo is being traced:

=======
 7.2 S x86_64 # cat test.c
int
testfc(int arg)
{
    return arg;
}

int
main()
{
    testfc(32767);
    return 0;
}


 7.2 S x86_64 # gcc test.c 
 7.2 S x86_64 # stap -e 'probe process("a.out").function("testfc") {printf("%x\n", int_arg(1))}' -c ./a.out
7fff
 7.2 S x86_64 # gcc -g test.c 
 7.2 S x86_64 # stap -e 'probe process("a.out").function("testfc") {printf("%x\n", int_arg(1))}' -c ./a.out
7fff
 7.2 S x86_64 # gcc -m32 test.c 
 7.2 S x86_64 # stap -e 'probe process("a.out").function("testfc") {printf("%x\n", int_arg(1))}' -c ./a.out
7fff
 7.2 S x86_64 # gcc -g -m32 test.c 
 7.2 S x86_64 # stap -e 'probe process("a.out").function("testfc") {printf("%x\n", int_arg(1))}' -c ./a.out
804840a
 7.2 S x86_64 # 
=======
Comment 1 Martin Cermak 2015-07-09 09:32:51 UTC
Other arches seem to work fine in this sense except that s390 has following limitation in tapset/s390/registers.stp:

=======
        if (%{ CONTEXT->user_mode_p %}) {                                       
                error("Cannot access function args in user context");           
                return 0;                                                       
        } 
======= 

Not sure why this limitation is there. After commenting it out, int_arg() seems to return correct values.
Comment 2 Josh Stone 2015-07-09 15:49:57 UTC
(In reply to Martin Cermak from comment #0)
>  7.2 S x86_64 # gcc -m32 test.c 
>  7.2 S x86_64 # stap -e 'probe process("a.out").function("testfc")
> {printf("%x\n", int_arg(1))}' -c ./a.out
> 7fff
>  7.2 S x86_64 # gcc -g -m32 test.c 
>  7.2 S x86_64 # stap -e 'probe process("a.out").function("testfc")
> {printf("%x\n", int_arg(1))}' -c ./a.out
> 804840a
>  7.2 S x86_64 # 

My guess is that debuginfo is allowing our prologue analysis, which means the probe will actually be placed a few instructions in.  Since the arguments are on the stack, and the prologue will have manipulated the stack, we're no longer looking at the right place for calling conventions.  This might fail on x86_64 too for arguments >= 7.

I don't have a suggested fix off hand.  A script writer can get around this by preferring the debuginfo access anyway, like @choose_defined($arg, int_arg(1)).
Comment 3 Martin Cermak 2015-07-10 20:45:36 UTC
Maybe it'd be useful to have an option to foce stap not to use debuginfo even if it's available. It'd be easier to use and less invasive than strip. I've seen more contraindications between present debuginfo and dwarfless probes today, e.g when testing longlong_arg() in 32-on-64 environment on x86_64. I see e.g. -P in the man page, but not resp. opposite switch. Dunno. Just saying.

This is just a sidenote, since it has nothing to do with actual fix for this bug.
Comment 4 Martin Cermak 2015-07-10 20:57:34 UTC
(In reply to Josh Stone from comment #2)
> A script writer can get around this by preferring the debuginfo access
> anyway, like @choose_defined($arg, int_arg(1)).

Sigh, I overlooked this.
Comment 5 Josh Stone 2015-07-10 21:47:26 UTC
(In reply to Martin Cermak from comment #3)
> Maybe it'd be useful to have an option to foce stap not to use debuginfo
> even if it's available. It'd be easier to use and less invasive than strip.
> [...] I see e.g. -P in the man page, but not resp. opposite switch.

I could see this as a tri-state long option, something like:

 --prologue-searching=always (same as -P)
 --prologue-searching=auto (current default, only userspace w/ has_valid_locs())
 --prologue-searching=never (completely disabled)

Ignoring debuginfo for "all" purposes could be a separate option, only tangentially related to prologues.  We'd still need some debuginfo use to leak through though -- e.g. we generate debuginfo objects to gather tracepoints.  Also @cast requires debuginfo, and generated @cast-with-headers is particularly useful to otherwise "dwarfless" probing.
Comment 6 David Smith 2015-07-13 14:38:27 UTC
(In reply to Martin Cermak from comment #3)
> Maybe it'd be useful to have an option to foce stap not to use debuginfo
> even if it's available. It'd be easier to use and less invasive than strip.
> I've seen more contraindications between present debuginfo and dwarfless
> probes today, e.g when testing longlong_arg() in 32-on-64 environment on
> x86_64. I see e.g. -P in the man page, but not resp. opposite switch. Dunno.
> Just saying.
> 
> This is just a sidenote, since it has nothing to do with actual fix for this
> bug.

There sort of is an option to ignore all debuginfo, by setting the SYSTEMTAP_DEBUGINFO environment variable to the empty string. This is what the nd_syscall.exp test case does, to make sure we're using dwarfless accesses for everything. Here's the section from the test case:

    # Override SYSTEMTAP_DEBUGINFO_PATH to ensure no debuginfo could be used    
    set env(SYSTEMTAP_DEBUGINFO_PATH) ""                                        

See stappaths.7 for more info on SYSTEMTAP_DEBUGINFO_PATH if you are interested.

I made this change to nd_syscall.exp after running into copy-and-paste errors when adding probes to the nd_syscall tapsets - I'd accidentally leave a '$foo' reference in the nd_syscall probe.

I know this works to ignore kernel debuginfo, I haven't tested to make sure it works to ignore user program debuginfo.
Comment 7 Josh Stone 2015-07-13 16:52:07 UTC
(In reply to David Smith from comment #6)
> There sort of is an option to ignore all debuginfo, by setting the
> SYSTEMTAP_DEBUGINFO environment variable to the empty string.

That only affects the search path for split debuginfo, as we have for binaries installed from rpms.  I don't think it will have any affect on those that still have their debuginfo built-in, like a test program we just built with "gcc -g".
Comment 8 Martin Cermak 2015-07-21 14:10:23 UTC
So, SYSTEMTAP_DEBUGINFO doesn't seem to work for built-in debuginfo. So I attepmted to "trace" how -P works and disabled prologue searching in the source:

=======
diff --git a/tapsets.cxx b/tapsets.cxx
index 2a50fbd..463dd09 100644
--- a/tapsets.cxx
+++ b/tapsets.cxx
@@ -1719,7 +1719,7 @@ query_addr(Dwarf_Addr addr, dwarf_query *q)
     {
       if (!dw.die_entrypc(fnscope, &addr))
         return;
-      if (dwarf_tag(fnscope) == DW_TAG_subprogram &&
+      if (false && dwarf_tag(fnscope) == DW_TAG_subprogram && // XXXYYY
           (q->sess.prologue_searching ||
            (q->has_process && !q->dw.has_valid_locs()))) // PR 6871 && PR 6941
         {
@@ -2210,7 +2210,7 @@ query_cu (Dwarf_Die * cudie, dwarf_query * q)
       if (rc != DWARF_CB_OK)
         q->query_done = true;
 
-      if (!q->filtered_functions.empty() &&
+      if (false && !q->filtered_functions.empty() && // XXXYYY
           !q->has_statement_str && // PR 2608
            (q->sess.prologue_searching ||
             (q->has_process && !q->dw.has_valid_locs()))) // PR 6871 && PR 6941
=======

This seems to help. So would it make sense if I'd try to add that tri-state long option as suggested in Comment #5, that would drive prologue searching at these two places?
Comment 9 Martin Cermak 2015-07-29 16:29:27 UTC
Created attachment 8459 [details]
proposed patch

Attached patch seems to work for me. It doesn't contain test coverage and documentation yet. Does it look good so far?
Comment 10 David Smith 2015-08-14 19:07:15 UTC
(In reply to Martin Cermak from comment #9)
> Created attachment 8459 [details]
> proposed patch
> 
> Attached patch seems to work for me. It doesn't contain test coverage and
> documentation yet. Does it look good so far?

I see one code omission - the changes to session.cxx don't look complete. You updated the regular constructor, but not the copy constructor for the systemtap_session class.

In general it seems odd to have both session->prologue_searching and session->prologue_searching_mode. I'd think you'd get rid of session->prologue_searching and make "-P" a synonym for "--prologue-searching=always". Also, you'll have to be careful with ENABLE_PROLOGUES.
Comment 11 Martin Cermak 2015-08-21 07:49:15 UTC
Created attachment 8539 [details]
updated patch

Tries to address the above comments and adds a stap.1 man entry.
Comment 12 David Smith 2015-08-21 13:49:56 UTC
(In reply to Martin Cermak from comment #11)
> Created attachment 8539 [details]
> updated patch
> 
> Tries to address the above comments and adds a stap.1 man entry.

The patch is looking good. At some point we'll need to add some tests for the new switch, but you can probably go ahead and check the patch in now.
Comment 13 Josh Stone 2015-08-21 17:42:23 UTC
ENABLE_PROLOGUES is not quite right.  The former true/false values should now be always/auto to keep the same behavior -- there was no never.

It appears you removed prologues from usage() entirely, instead of adding your new switch to the output.  Why?

The manpage explanation of "auto" is odd: "enable only if possible"?  As if "always" will instead do the impossible? :)  The real logic behind auto is a heuristic for when we think it will be useful, in userspace with low quality location info.  Not sure of a short blurb  -- maybe just "enabled by heuristic".

The staprun.h change doesn't belong at all.  I assume that was copying color_mode, but prologues are only a concern to the translator.


Have you confirmed that --prologue-searching=never solves your original problem?
Comment 14 Martin Cermak 2015-08-31 08:22:31 UTC
Created attachment 8565 [details]
updated patch

(In reply to Josh Stone from comment #13)
> ENABLE_PROLOGUES is not quite right.  The former true/false values should
> now be always/auto to keep the same behavior -- there was no never.

Ouch, you are right.  Fixed.

> It appears you removed prologues from usage() entirely, instead of adding
> your new switch to the output.  Why?

Fixed.

> The manpage explanation of "auto" is odd: "enable only if possible"?  As if
> "always" will instead do the impossible? :)  The real logic behind auto is a
> heuristic for when we think it will be useful, in userspace with low quality
> location info.  Not sure of a short blurb  -- maybe just "enabled by
> heuristic".

:) Yup, updated. 

> The staprun.h change doesn't belong at all.  I assume that was copying
> color_mode, but prologues are only a concern to the translator.

Yes, I've been following Jonathan's color_mode implementation.  When looking over the patch before attaching it, I felt this is somewhat weird, but unfortunately left it there.  Fixed.

> Have you confirmed that --prologue-searching=never solves your original
> problem?

Sure.  I did that before and re-tested now again with attached updated patch.
Comment 15 Martin Cermak 2015-09-01 07:36:47 UTC
Fixed in commit 8c1cf686937c8dd59baafbd1c97b1b76a5a9830d
Comment 16 Martin Cermak 2015-09-04 07:47:24 UTC
(In reply to David Smith from comment #12)
> At some point we'll need to add some tests for the new switch

Added in commit 30833b3210eba4e8752ca493d0b7780e6a50d2d8