Bug 12121 - Apparent non-determinism in DWARF symbol resolution
Summary: Apparent non-determinism in DWARF symbol resolution
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: translator (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Josh Stone
URL:
Keywords:
: 12120 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-10-14 03:37 UTC by Andrew Sutherland
Modified: 2010-10-21 18:14 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
full output from that stap run (2.03 KB, text/plain)
2010-10-14 03:37 UTC, Andrew Sutherland
Details
the systemtap script passed to stap (post-transformation by my preprocessing logic) (12.14 KB, text/plain)
2010-10-14 03:38 UTC, Andrew Sutherland
Details
good run's stap -v -v output (note: now with latencytap included) (4.59 KB, text/plain)
2010-10-14 12:22 UTC, Andrew Sutherland
Details
bad run's stap -v -v output (note: now with latencytap included) (2.13 KB, text/plain)
2010-10-14 12:22 UTC, Andrew Sutherland
Details
systemtap script from runs (note: now with latencytap included) (15.97 KB, text/plain)
2010-10-14 12:24 UTC, Andrew Sutherland
Details
first fix attempt (1.65 KB, patch)
2010-10-19 01:54 UTC, Josh Stone
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Sutherland 2010-10-14 03:37:19 UTC
Created attachment 5057 [details]
full output from that stap run

As initially mentioned on my blog[1] post without useful substantiation, sometimes my systemtap scripts will fail to translate.

1: http://www.visophyte.org/blog/?p=586

Here is a link to a bzipped version of the mammoth 280M libxul.so file (only 90M compressed!):
http://clicky.visophyte.org/scratch/libxul.so.bz2

Perhaps important version info:
$ rpmquery --info systemtap
Name        : systemtap                    Relocations: (not relocatable)
Version     : 1.3                               Vendor: Fedora Project
Release     : 2.fc13                        Build Date: Tue 24 Aug 2010 11:45:23 AM PDT
Install Date: Wed 15 Sep 2010 09:47:01 PM PDT      Build Host: x86-18.phx2.fedoraproject.org
Group       : Development/System            Source RPM: systemtap-1.3-2.fc13.src.rpm
Size        : 7917275                          License: GPLv2+
Signature   : RSA/SHA256, Wed 25 Aug 2010 03:07:38 PM PDT, Key ID 7edc6ad6e8e40fde

$ gcc --version
gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10)

$ rpmquery --info gcc
Name        : gcc                          Relocations: (not relocatable)
Version     : 4.4.4                             Vendor: Fedora Project
Release     : 10.fc13                       Build Date: Wed 30 Jun 2010 02:54:10 AM PDT
Install Date: Mon 19 Jul 2010 03:20:53 PM PDT      Build Host: xb-01.phx2.fedoraproject.org
Group       : Development/Languages         Source RPM: gcc-4.4.4-10.fc13.src.rpm
Size        : 19365082                         License: GPLv3+ and GPLv3+ with exceptions and GPLv2+ with exceptions
Signature   : RSA/SHA256, Wed 30 Jun 2010 03:23:31 AM PDT, Key ID 7edc6ad6e8e40fde


Here is an example of the output with just the warnings/errors bits.  I will attach a full output log including the earlier parts in case they prove interesting:

WARNING: eliding unused function 'jsobj_class_name': identifier 'jsobj_class_name' at /home/visbrero/rev_control/hg/comm-central/obj-thunderbird-debug/mozperfish.stp:1143:10
 source: function jsobj_class_name(jsobj) {
                  ^
WARNING: eliding unused function 'jsval_to_string': identifier 'jsval_to_string' at :1178:10
 source: function jsval_to_string(jsval) {
                  ^
semantic error: unable to find member 'regs' for JSContext (alternatives: Error: JSContext isn't a struct/class/union): operator '->' at :1095:36
        source:   regs = @cast(cx, "JSContext", @2)->regs;
                                                   ^
semantic error: unable to find member 'mLengthAndFlags' for JSString (alternatives: Error: JSString isn't a struct/class/union): operator '->' at :1155:45
        source:   lenAndFlags = @cast(jsstr, "JSString", @2)->mLengthAndFlags;
                                                            ^
semantic error: unable to find member 'mChars' for JSString (alternatives: Error: JSString isn't a struct/class/union): operator '->' at :1174:36
        source:   cp = @cast(jsstr, "JSString", @2)->mChars;
                                                   ^
semantic error: unable to find member 'fslots' for JSObject (alternatives: Error: JSObject isn't a struct/class/union): operator '->' at :1243:36
        source:     @cast(jsfunobj, "JSObject", @2)->fslots[0]->data->asPtr);
                                                   ^
Pass 2: analyzed script: 41 probe(s), 84 function(s), 2 embed(s), 20 global(s) using 577944virt/369712res/3960shr kb, in 5880usr/390sys/6295real ms.
Pass 2: analysis failed.  Try again with another '--vp 01' option.
Running rm -rf /tmp/staps7pWmL
Comment 1 Andrew Sutherland 2010-10-14 03:38:51 UTC
Created attachment 5058 [details]
the systemtap script passed to stap (post-transformation by my preprocessing logic)
Comment 2 Andrew Sutherland 2010-10-14 03:48:23 UTC
Oh, and if you would like to reproduce using your own build, Thunderbird build instructions can be found here:
https://developer.mozilla.org/en/Simple_Thunderbird_build

I am building with the following .mozconfig file in use:
ac_add_options --enable-application=mail

ac_add_options --enable-debug
ac_add_options --enable-debugger-info-modules=yes
ac_add_options --enable-extensions=default,inspector
ac_add_options --disable-optimize

ac_add_options --enable-chrome-format=symlink

# BAD BAD BAD, but I don't like crashes :(
ac_add_options --disable-accessibility

mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/obj-thunderbird-debug
mk_add_options MOZ_MAKE_FLAGS="-s -j6"


If you want to recreate the full experience:
hg clone http://hg.mozilla.org/users/bugmail_asutherland.org/tb-test-help/

The log file I previously attached already tells you what the actual stap run arguments were.  Unfortunately at least the -x part is somewhat hard to create because my wrapper script does its own variation on '-c' by forking a thread.  (I think staprun was destroying my environment so xpcshell failed to run and that was my workaround... although I think there were other pipe benefits to that strategy.)

Once a build completes, I...
a) change to an output directory where I can run my unit test
cd obj-thunderbird-debug/mail/base/test

b) run my unit test with my test framework thing:
make SOLO_FILE=test_viewWrapper_virtualFolder.js EXTRA_TEST_ARGS="--debugger /home/visbrero/rev_control/hg/tb-test-help/systemtap/chewchewwoowoo.py --debugger-args '/home/visbrero/rev_control/hg/tb-test-help/systemtap/mozperfish/mozperfish.stp --'" check-one
Comment 3 Andrew Sutherland 2010-10-14 03:53:14 UTC
*** Bug 12120 has been marked as a duplicate of this bug. ***
Comment 4 Frank Ch. Eigler 2010-10-14 12:12:27 UTC
Andrew, do you happen to have corresponding output from another run that does resolve properly?
Comment 5 Andrew Sutherland 2010-10-14 12:22:14 UTC
Created attachment 5060 [details]
good run's stap -v -v output (note: now with latencytap included)
Comment 6 Andrew Sutherland 2010-10-14 12:22:51 UTC
Created attachment 5061 [details]
bad run's stap -v -v output (note: now with latencytap included)
Comment 7 Andrew Sutherland 2010-10-14 12:24:23 UTC
Created attachment 5062 [details]
systemtap script from runs (note: now with latencytap included)

Sorry if the latencytap kernel probes are cluttering things... I can revert if needed...
Comment 8 Josh Stone 2010-10-14 18:12:59 UTC
I managed to reproduce the non-determinism with a much smaller test:

$ cat xulcast.stp 
probe begin {
  println(@cast(0, "JSContext", @1)->regs)
  println(@cast(0, "JSObject", @1)->fslots)
}

$ stap -vvv -p2 xulcast.stp $PWD/libxul.so

The success of this is extremely variable.  The only consistency is that the two @casts always succeed or fail together, never one without the other.

The difference I found with the extra verbosity is in these lines:

literal_stmt_for_pointer: finding value for JSContext(SRCFILE)
literal_stmt_for_pointer: finding value for JSObject(SRCFILE)

The SRCFILE varies from run to run, with only some apparently having good type information.  I still don't know why our choice of SRCFILE would vary, as I would expect the iteration order to be consistent, but this seems to be the key.
Comment 9 Josh Stone 2010-10-14 18:49:12 UTC
It appears that this libxul.so may have some corruption:

$ dwarflint libxul.so
error: .debug_abbrev: abbr. attribute 0x4657e: invalid name 0x2107.

With bad abbrevs, it may be no surprise that stap/elfutils are behaving strangely.  I'll try to grab some compiler experts to help troubleshoot this further...
Comment 10 Jan Kratochvil 2010-10-14 19:23:22 UTC
(In reply to comment #9)
> $ dwarflint libxul.so
> error: .debug_abbrev: abbr. attribute 0x4657e: invalid name 0x2107.

0x2107 is DW_AT_GNU_vector, I guess dwarflint should be just updated.
Comment 11 Josh Stone 2010-10-14 22:19:30 UTC
Wow, we actually *do* have indeterminism.  We have a CU cache which is a hashmap keyed on CU->addr, and since that addr depends on where the inspected module gets mmap'ed, the hashmap iteration order will be different from run to run.  But while this is unfortunate for reproducibility, it's not really the root cause.

What's tripping us up is a certain pattern of typedef, e.g.:

  typedef struct JSObject JSObject;

We're currently keeping a map of all the types just by name.  So for a @cast to "JSObject", we end up picking whichever of the typedef or struct comes first in the debuginfo.  That's kind of OK, because we already dig into typedefs to get their underlying struct.  We run into trouble though when some CUs have just a declaration of that struct, not a definition.  So then we go looking for the definition, but only by the name "JSObject", and we may end up back on the typedef.  Hilarity ensues...
Comment 12 Josh Stone 2010-10-19 01:54:00 UTC
Created attachment 5067 [details]
first fix attempt

This seems to resolve it completely for me.  I still need to run our regression tests before I commit it, and figure out a smaller testcase than that monster libxul, but please let me know how it works for you.
Comment 13 Josh Stone 2010-10-19 22:15:49 UTC
I've pushed commit a44a7cb, and I believe this is now fixed.  Please report if you find otherwise...
Comment 14 Andrew Sutherland 2010-10-21 18:14:35 UTC
Apologies for the feedback delay; had to put out a bunch of tinderbox fires.

Yes, things are much happier with the landed changes.  Thank you ever so much!