Created attachment 5057 [details] full output from that stap run As initially mentioned on my blog[1] post without useful substantiation, sometimes my systemtap scripts will fail to translate. 1: http://www.visophyte.org/blog/?p=586 Here is a link to a bzipped version of the mammoth 280M libxul.so file (only 90M compressed!): http://clicky.visophyte.org/scratch/libxul.so.bz2 Perhaps important version info: $ rpmquery --info systemtap Name : systemtap Relocations: (not relocatable) Version : 1.3 Vendor: Fedora Project Release : 2.fc13 Build Date: Tue 24 Aug 2010 11:45:23 AM PDT Install Date: Wed 15 Sep 2010 09:47:01 PM PDT Build Host: x86-18.phx2.fedoraproject.org Group : Development/System Source RPM: systemtap-1.3-2.fc13.src.rpm Size : 7917275 License: GPLv2+ Signature : RSA/SHA256, Wed 25 Aug 2010 03:07:38 PM PDT, Key ID 7edc6ad6e8e40fde $ gcc --version gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10) $ rpmquery --info gcc Name : gcc Relocations: (not relocatable) Version : 4.4.4 Vendor: Fedora Project Release : 10.fc13 Build Date: Wed 30 Jun 2010 02:54:10 AM PDT Install Date: Mon 19 Jul 2010 03:20:53 PM PDT Build Host: xb-01.phx2.fedoraproject.org Group : Development/Languages Source RPM: gcc-4.4.4-10.fc13.src.rpm Size : 19365082 License: GPLv3+ and GPLv3+ with exceptions and GPLv2+ with exceptions Signature : RSA/SHA256, Wed 30 Jun 2010 03:23:31 AM PDT, Key ID 7edc6ad6e8e40fde Here is an example of the output with just the warnings/errors bits. I will attach a full output log including the earlier parts in case they prove interesting: WARNING: eliding unused function 'jsobj_class_name': identifier 'jsobj_class_name' at /home/visbrero/rev_control/hg/comm-central/obj-thunderbird-debug/mozperfish.stp:1143:10 source: function jsobj_class_name(jsobj) { ^ WARNING: eliding unused function 'jsval_to_string': identifier 'jsval_to_string' at :1178:10 source: function jsval_to_string(jsval) { ^ semantic error: unable to find member 'regs' for JSContext (alternatives: Error: JSContext isn't a struct/class/union): operator '->' at :1095:36 source: regs = @cast(cx, "JSContext", @2)->regs; ^ semantic error: unable to find member 'mLengthAndFlags' for JSString (alternatives: Error: JSString isn't a struct/class/union): operator '->' at :1155:45 source: lenAndFlags = @cast(jsstr, "JSString", @2)->mLengthAndFlags; ^ semantic error: unable to find member 'mChars' for JSString (alternatives: Error: JSString isn't a struct/class/union): operator '->' at :1174:36 source: cp = @cast(jsstr, "JSString", @2)->mChars; ^ semantic error: unable to find member 'fslots' for JSObject (alternatives: Error: JSObject isn't a struct/class/union): operator '->' at :1243:36 source: @cast(jsfunobj, "JSObject", @2)->fslots[0]->data->asPtr); ^ Pass 2: analyzed script: 41 probe(s), 84 function(s), 2 embed(s), 20 global(s) using 577944virt/369712res/3960shr kb, in 5880usr/390sys/6295real ms. Pass 2: analysis failed. Try again with another '--vp 01' option. Running rm -rf /tmp/staps7pWmL
Created attachment 5058 [details] the systemtap script passed to stap (post-transformation by my preprocessing logic)
Oh, and if you would like to reproduce using your own build, Thunderbird build instructions can be found here: https://developer.mozilla.org/en/Simple_Thunderbird_build I am building with the following .mozconfig file in use: ac_add_options --enable-application=mail ac_add_options --enable-debug ac_add_options --enable-debugger-info-modules=yes ac_add_options --enable-extensions=default,inspector ac_add_options --disable-optimize ac_add_options --enable-chrome-format=symlink # BAD BAD BAD, but I don't like crashes :( ac_add_options --disable-accessibility mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/obj-thunderbird-debug mk_add_options MOZ_MAKE_FLAGS="-s -j6" If you want to recreate the full experience: hg clone http://hg.mozilla.org/users/bugmail_asutherland.org/tb-test-help/ The log file I previously attached already tells you what the actual stap run arguments were. Unfortunately at least the -x part is somewhat hard to create because my wrapper script does its own variation on '-c' by forking a thread. (I think staprun was destroying my environment so xpcshell failed to run and that was my workaround... although I think there were other pipe benefits to that strategy.) Once a build completes, I... a) change to an output directory where I can run my unit test cd obj-thunderbird-debug/mail/base/test b) run my unit test with my test framework thing: make SOLO_FILE=test_viewWrapper_virtualFolder.js EXTRA_TEST_ARGS="--debugger /home/visbrero/rev_control/hg/tb-test-help/systemtap/chewchewwoowoo.py --debugger-args '/home/visbrero/rev_control/hg/tb-test-help/systemtap/mozperfish/mozperfish.stp --'" check-one
*** Bug 12120 has been marked as a duplicate of this bug. ***
Andrew, do you happen to have corresponding output from another run that does resolve properly?
Created attachment 5060 [details] good run's stap -v -v output (note: now with latencytap included)
Created attachment 5061 [details] bad run's stap -v -v output (note: now with latencytap included)
Created attachment 5062 [details] systemtap script from runs (note: now with latencytap included) Sorry if the latencytap kernel probes are cluttering things... I can revert if needed...
I managed to reproduce the non-determinism with a much smaller test: $ cat xulcast.stp probe begin { println(@cast(0, "JSContext", @1)->regs) println(@cast(0, "JSObject", @1)->fslots) } $ stap -vvv -p2 xulcast.stp $PWD/libxul.so The success of this is extremely variable. The only consistency is that the two @casts always succeed or fail together, never one without the other. The difference I found with the extra verbosity is in these lines: literal_stmt_for_pointer: finding value for JSContext(SRCFILE) literal_stmt_for_pointer: finding value for JSObject(SRCFILE) The SRCFILE varies from run to run, with only some apparently having good type information. I still don't know why our choice of SRCFILE would vary, as I would expect the iteration order to be consistent, but this seems to be the key.
It appears that this libxul.so may have some corruption: $ dwarflint libxul.so error: .debug_abbrev: abbr. attribute 0x4657e: invalid name 0x2107. With bad abbrevs, it may be no surprise that stap/elfutils are behaving strangely. I'll try to grab some compiler experts to help troubleshoot this further...
(In reply to comment #9) > $ dwarflint libxul.so > error: .debug_abbrev: abbr. attribute 0x4657e: invalid name 0x2107. 0x2107 is DW_AT_GNU_vector, I guess dwarflint should be just updated.
Wow, we actually *do* have indeterminism. We have a CU cache which is a hashmap keyed on CU->addr, and since that addr depends on where the inspected module gets mmap'ed, the hashmap iteration order will be different from run to run. But while this is unfortunate for reproducibility, it's not really the root cause. What's tripping us up is a certain pattern of typedef, e.g.: typedef struct JSObject JSObject; We're currently keeping a map of all the types just by name. So for a @cast to "JSObject", we end up picking whichever of the typedef or struct comes first in the debuginfo. That's kind of OK, because we already dig into typedefs to get their underlying struct. We run into trouble though when some CUs have just a declaration of that struct, not a definition. So then we go looking for the definition, but only by the name "JSObject", and we may end up back on the typedef. Hilarity ensues...
Created attachment 5067 [details] first fix attempt This seems to resolve it completely for me. I still need to run our regression tests before I commit it, and figure out a smaller testcase than that monster libxul, but please let me know how it works for you.
I've pushed commit a44a7cb, and I believe this is now fixed. Please report if you find otherwise...
Apologies for the feedback delay; had to put out a bunch of tinderbox fires. Yes, things are much happier with the landed changes. Thank you ever so much!