Bug 14431 - char * always being printed as, possibly INVALID/NULL "<unknown>", string without giving actual address
Summary: char * always being printed as, possibly INVALID/NULL "<unknown>", string wit...
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: translator (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-03 10:30 UTC by Mark Wielaard
Modified: 2015-10-21 18:50 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Wielaard 2012-08-03 10:30:22 UTC
If you have a char or unsigned char pointer and pretty print it, but it doesn't actually point to a valid string, then the result is not very useful. char * and unsigned char * are often used as arbitrary pointers/markers in code, not just for strings.

Contrived example:

#include <malloc.h>
#include <stddef.h>

struct foobar
{
  char *foo;
  size_t foo_size;
  unsigned char *bar;
  size_t bar_size;
};

static struct foobar *fb;

unsigned char
use ()
{
  if (fb->foo) *fb->foo = 'a';
  return fb->bar_size > 0 ? *fb->bar : 0;
}

int
main (int argc, char **argv)
{
  struct foobar stack;
  fb = &stack;
  fb->foo_size = 42;
  fb->bar_size = 47;
  fb->foo = (char *) calloc (fb->foo_size, 1);
  fb->bar = (unsigned char *) calloc (fb->bar_size, 1);
  fb->bar_size = use (&fb);
  fb->foo = NULL;
  fb->bar = fb->foo + fb->foo_size;
  fb->bar_size = 0;
  return use (&fb);
}

$ gcc -g -o foobar foobar.c 
$ stap -e 'probe process.function("use") { log($fb$) }' -c ./foobar
{.foo="", .foo_size=42, .bar="", .bar_size=47}
{.foo="<unknown>", .foo_size=42, .bar="<unknown>", .bar_size=0}

It would have been convenient if those last two foo/bar pointers were represented by something different than just "<unknown>".

Maybe "<unknown@0x0>" and "<unknown@0x42>"?
Comment 1 Mark Wielaard 2012-08-04 22:57:54 UTC
What about we always also add the address of a char array?

diff --git a/tapsets.cxx b/tapsets.cxx
index 1068a42..edeb30c 100644
--- a/tapsets.cxx
+++ b/tapsets.cxx
@@ -2893,18 +2893,20 @@ dwarf_pretty_print::print_chars (Dwarf_Die* start_type, target_symbol* e,
   const char *name = dwarf_diename (&type);
   if (name && (name == string("char") || name == string("unsigned char")))
     {
-      if (push_deref (pf, "\"%s\"", e))
+      if (push_deref (pf, "\"%s\"@%p", e))
         {
           // steal the last arg for a string access
           assert (!pf->args.empty());
+          expression* expr = pf->args.back();
           functioncall* fcall = new functioncall;
           fcall->tok = e->tok;
           fcall->function = userspace_p ? "user_string2" : "kernel_string2";
-          fcall->args.push_back (pf->args.back());
+          fcall->args.push_back (expr);
           expression *err_msg = new literal_string ("<unknown>");
           err_msg->tok = e->tok;
           fcall->args.push_back (err_msg);
           pf->args.back() = fcall;
+         pf->args.push_back (expr);
         }
       return true;
     }

For our contrived example above that would give:

{.foo=""@0xcb4010, .foo_size=42, .bar=""@0xcb4050, .bar_size=47}
{.foo="<unknown>"@0x0, .foo_size=42, .bar="<unknown>"@0x2a, .bar_size=0}

Does that look reasonable or silly?
Comment 2 Josh Stone 2012-08-05 20:29:44 UTC
(In reply to comment #1)
> What about we always also add the address of a char array?
[...]
> For our contrived example above that would give:
> 
> {.foo=""@0xcb4010, .foo_size=42, .bar=""@0xcb4050, .bar_size=47}
> {.foo="<unknown>"@0x0, .foo_size=42, .bar="<unknown>"@0x2a, .bar_size=0}
> 
> Does that look reasonable or silly?

It's tough to automagically do The Right Thing without knowing what the user is looking for.  And in some cases, especially with kernel pointers, the %p may overshadow the actual string, something like:

> {.foo=""@0xffffffffffcb4010, .foo_size=42, .bar=""@0xffffffffffcb4050, .bar_size=47}

Rather than trying both string value and address every time, maybe we should make new kernel/user_string variants that return either a quoted string *or* a pointer value, e.g. "\"foo\"" or "0x2a" depending on whether it's readable.  Maybe "?@0x2a" to indicate it was tried but couldn't be read.  Your examples would be:

> {.foo="", .foo_size=42, .bar="", .bar_size=47}
> {.foo=?@0x0, .foo_size=42, .bar=?@0x2a, .bar_size=0}

We can only try to be just so smart... for people who know they want something different, like all pointer addresses, they can always roll their own pretty-print.
Comment 3 Mark Wielaard 2012-08-06 15:32:38 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > What about we always also add the address of a char array?
> [...]
> > For our contrived example above that would give:
> > 
> > {.foo=""@0xcb4010, .foo_size=42, .bar=""@0xcb4050, .bar_size=47}
> > {.foo="<unknown>"@0x0, .foo_size=42, .bar="<unknown>"@0x2a, .bar_size=0}
> > 
> > Does that look reasonable or silly?
> 
> It's tough to automagically do The Right Thing without knowing what the user is
> looking for.

Yes, that is why we should just give all information instead of guessing what the user wants.

>  And in some cases, especially with kernel pointers, the %p may
> overshadow the actual string, something like:
> 
> > {.foo=""@0xffffffffffcb4010, .foo_size=42, .bar=""@0xffffffffffcb4050, .bar_size=47}

What do you mean by overshadow?
I think this example shows nicely why you always want the address, since the strings look the same (are empty - by accident?), but clearly they are not the same since they are at different addresses.

> Rather than trying both string value and address every time, maybe we should
> make new kernel/user_string variants that return either a quoted string *or* a
> pointer value, e.g. "\"foo\"" or "0x2a" depending on whether it's readable. 
> Maybe "?@0x2a" to indicate it was tried but couldn't be read.  Your examples
> would be:
> 
> > {.foo="", .foo_size=42, .bar="", .bar_size=47}
> > {.foo=?@0x0, .foo_size=42, .bar=?@0x2a, .bar_size=0}
> 
> We can only try to be just so smart...

Yes, that was what I was going for at first. But that seems "too smart".
We might think we are seeing valid strings because there is some zero character nearby, but that might not be at all what the user was going for (they just use char * as generic pointer, not as string pointer).

I do like the usage of plain ? instead of "<unknown>". Updated patch below.

> for people who know they want something
> different, like all pointer addresses, they can always roll their own
> pretty-print.

Is that really that simple? I wasn't able to do that easily. Except by printing the fields individually with my own identifiers.

Updated patch:

diff --git a/tapsets.cxx b/tapsets.cxx
index 1068a42..f3ffaa8 100644
--- a/tapsets.cxx
+++ b/tapsets.cxx
@@ -2893,18 +2893,20 @@ dwarf_pretty_print::print_chars (Dwarf_Die* start_type, target_symbol* e,
   const char *name = dwarf_diename (&type);
   if (name && (name == string("char") || name == string("unsigned char")))
     {
-      if (push_deref (pf, "\"%s\"", e))
+      if (push_deref (pf, "\"%s\"@%p", e))
         {
           // steal the last arg for a string access
           assert (!pf->args.empty());
+          expression* expr = pf->args.back();
           functioncall* fcall = new functioncall;
           fcall->tok = e->tok;
           fcall->function = userspace_p ? "user_string2" : "kernel_string2";
-          fcall->args.push_back (pf->args.back());
-          expression *err_msg = new literal_string ("<unknown>");
+          fcall->args.push_back (expr);
+          expression *err_msg = new literal_string ("?");
           err_msg->tok = e->tok;
           fcall->args.push_back (err_msg);
           pf->args.back() = fcall;
+          pf->args.push_back (expr);
         }
       return true;
     }
Comment 4 Josh Stone 2012-08-06 17:24:42 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > 
> > It's tough to automagically do The Right Thing without knowing what the user is
> > looking for.
> 
> Yes, that is why we should just give all information instead of guessing what
> the user wants.

But too much information will make it difficult to read.

> >  And in some cases, especially with kernel pointers, the %p may
> > overshadow the actual string, something like:
> > 
> > > {.foo=""@0xffffffffffcb4010, .foo_size=42, .bar=""@0xffffffffffcb4050, .bar_size=47}
> 
> What do you mean by overshadow?

I meant that since the value of kernel pointers on 64-bit will always take 16 characters, it clutters the output quite a bit.  That may be a matter of opinion, but this is supposed to be "pretty" printing.

> I think this example shows nicely why you always want the address, since the
> strings look the same (are empty - by accident?), but clearly they are not the
> same since they are at different addresses.

If you really are looking just for string values, then the memory addresses of those strings are irrelevant.

> I do like the usage of plain ? instead of "<unknown>".

Note the difference that I proposed it *without* quotes, so it's clear that the string is not really "?".

> > for people who know they want something
> > different, like all pointer addresses, they can always roll their own
> > pretty-print.
> 
> Is that really that simple? I wasn't able to do that easily. Except by printing
> the fields individually with my own identifiers.

Yes, you'd print fields yourself individually to get the formatting you really want.

We can't be all things to all people with a heuristic like pretty printing.  I tried to make it as simple-yet-useful as possible, to hopefully capture most cases.  Another example is chasing & expanding struct pointers, which we decided was too difficult to get "right" without cluttering the general case.

So here you have "possibly invalid char*", or worse "valid char* to a non-string" - I'm not sure we can fully address this without making the "char* pointing to a string" case worse.
Comment 5 Mark Wielaard 2012-08-06 18:59:22 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > (In reply to comment #2)
> > > 
> > > It's tough to automagically do The Right Thing without knowing what the user is
> > > looking for.
> > 
> > Yes, that is why we should just give all information instead of guessing what
> > the user wants.
> 
> But too much information will make it difficult to read.

Of course. But having the address there isn't too much, is it?

> I meant that since the value of kernel pointers on 64-bit will always take 16
> characters, it clutters the output quite a bit.  That may be a matter of
> opinion, but this is supposed to be "pretty" printing.

Aha. I think this is "pretty", and not cluttered. But can easily be convinced of any other formatting. So you have a different suggestion for adding the address?

> > I do like the usage of plain ? instead of "<unknown>".
> 
> Note the difference that I proposed it *without* quotes, so it's clear that the
> string is not really "?".

ah, yes, but that is somewhat more difficult since then we have to be
more fancy about the results of [kernel|user]_string2(). How about using
"<?>".

> > Is that really that simple? I wasn't able to do that easily. Except by printing
> > the fields individually with my own identifiers.
> 
> Yes, you'd print fields yourself individually to get the formatting you really
> want.

Which isn't really practical for larger/deeper structs.
And [unsigned] char * is kind of special since it is often not used as
string pointer, so warrants having the address around always.
Comment 6 Frank Ch. Eigler 2015-10-21 18:50:35 UTC
commit 9fcf867f09d0