cagney@localhost$ ../frysk-core/frysk/bindir/fcatch ../frysk-core/frysk/bindir/fstack core.18692 ./TestRunner Entered parseCommand fcatch: from PID 18955 TID 18955: SIGSEGV(11) detected - dumping stack trace for TID 18955 #0 0x082cf3b8 in read_proc_memory(void*, void*, unsigned long long, unsigned int, unsigned int) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #1 0x0855348b in elf_from_remote_memory () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #2 0x082cf35f in dwfl_frysk_proc_find_elf(Dwfl_Module*, void**, char const*, unsigned long long, char**, Elf**) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #3 0x08551898 in find_file () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #4 0x08551ae3 in find_symtab () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #5 0x08551c6a in dwfl_module_getsymtab () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #6 0x085540dc in dwfl_module_addrsym () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #7 0x082cfd98 in hidden alias for void lib::dwfl::DwflModule::getSymbol(long long, lib::dwfl::SymbolBuilder*) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #8 0x08218e7f in frysk::symtab::Symbol* frysk::symtab::SymbolFactory::getSymbol(frysk::proc::Task*, long long) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #9 0x082091a2 in frysk::symtab::Symbol* frysk::stack::LibunwindFrame::getSymbol() () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #10 0x082072ea in void frysk::stack::Frame::toPrint(java::io::PrintWriter*, bool, bool) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #11 0x0820b720 in void frysk::stack::StackFactory::printTaskStackTrace(java::io::PrintWriter*, frysk::proc::Task*, bool, bool, int) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #12 0x08230901 in void frysk::util::StacktraceAction::printTasks() () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #13 0x08230a52 in void frysk::util::StacktraceAction::allExistingTasksCompleted() () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #14 0x081e8dc7 in frysk::proc::ProcCoreAction::ProcCoreAction(frysk::proc::Proc*, frysk::proc::ProcObserver$ProcAction*) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #15 0x080fa5d3 in void frysk::bindir::fstack::stackCore(frysk::util::CoreExePair*) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #16 0x080fa98d in void frysk::bindir::fstack::access$1(frysk::util::CoreExePair*) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #17 0x080f9f15 in void frysk::bindir::fstack$2::parseCores(JArray<frysk::util::CoreExePair*>*) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #18 0x0822d158 in JArray<java::lang::String*>* frysk::util::CommandlineParser::doParse(JArray<java::lang::String*>*) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #19 0x0822ccd5 in JArray<java::lang::String*>* frysk::util::CommandlineParser::parse(JArray<java::lang::String*>*) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #20 0x080fa92d in void frysk::bindir::fstack::main(JArray<java::lang::String*>*) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #21 0x02ddd3d1 in void gnu::java::lang::MainThread::call_main() () from /usr/lib/libgcj.so.8rh.0.0 #22 0x02e56987 in void gnu::java::lang::MainThread::run() () from /usr/lib/libgcj.so.8rh.0.0 #23 0x02defb63 in _Jv_ThreadRun(java::lang::Thread*) () from /usr/lib/libgcj.so.8rh.0.0 #24 0x02d9b8ea in _Jv_RunMain(_Jv_VMInitArgs*, java::lang::Class*, char const*, int, char const**, bool) () from /usr/lib/libgcj.so.8rh.0.0 #25 0x02d9baa4 in _Jv_RunMain(java::lang::Class*, char const*, int, char const**, bool) () from /usr/lib/libgcj.so.8rh.0.0 #26 0x02d9bb2b in JvRunMain () from /usr/lib/libgcj.so.8rh.0.0 #27 0x080f66c1 in main () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack #28 0x00808390 in __libc_start_main () from /lib/libc-2.7.so #29 0x080f65c1 in void java::awt::Component::transferFocusUpCycle() () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack SIGSEGV(11) detected - dumping stack trace for TID 18956 #0 0x00110402 in __kernel_vsyscall () from /home/scratch/frysk/pid/native/frysk-sys/[vdso] #1 0x000518e0 in [unknown] from /home/scratch/frysk/pid/native/frysk-sys/Unknown #2 0xf13c8001 in [unknown] from /home/scratch/frysk/pid/native/frysk-sys/Unknown
That looks familiar, but I lost the core file that produced a similar issues. Can you make the core file available somewhere?
Pretty certain that Java does not glob ./Testrunner as 'pwd'/TestRunner. It just sees the filename as ./Testrunner. Can you rerun the test with physical location of TestRunner provided?
Looking a little closer, it seems LinuxCoreHost was not using the Canonical File name, so the ../ and the ./ were not being correctly resolved in the path: 2008-02-08 Phil Muldoon <pmuldoon@redhat.com> * LinuxCoreHost.java: Use CanonicalFile() througout.
There's a more fundamental issue here; this should not have lead to an _abort_ Since we clearly don't know what to do with the IO exception, getAbsolutePath() is also a better option.
Feel free to split it into two bugs. One where the the process run aground in the dwfl call, and another where ./ and ../ are not being dealt with other than in a canonical file case. They need to be addressed differently. As for the IOException, the converstion of a IOException to a RuntimeException is just conversion to an unchecked versus checked exception. I'm quite open to throwing either, but in keeping with the rest of frysk's usage on unchecked exception. I think think the case of using a canonical file is correct in LinuxCoreHost as it is correctly constructing the location of the file. getAbsolutePath wasn't. It was using the file as passed by the user, with the expectation that ./ and ../ would be correctly parsed. I'll say usability trumps pedantic do as I mean, now what I say (type) here. I won't address the dwfl issue as I have no expertise there.
Attempting to replicate the issue with a simple process produces the following sacktrace below. I added the debug statements for emphasis. The ideal solution here would be for LinuxCoreHost not to deal with paths at all, and have Elf just provide the File based constructor. Also, a change to how StatelessFile itself constructs the filename would be required: public StatelessFile(File file) { byte[] path = file.getAbsolutePath().getBytes(); // NUL terminate it - an array start with NUL unixPath = new byte[path.length + 1]; System.arraycopy (path, 0, unixPath, 0, path.length); } This also uses AbsolutePath, and does not account for encoding. There is another improvement in that the metadata instead of storing strings extracted from the linkmap for library names, store File objects instead. This once again shunts the native -> java path fixing off to one central location. Anyway, the bug I tried to replicate showed as: [pmuldoon@localhost frysk-core]$ ./frysk/bindir/fstack core.2157 ./sleep Exe file backend is print as System.out.println(File): ./sleep Exe file backend is print as System.out.println(File.getAbsolutePath()) /home/pmuldoon/frysk_bin/frysk-core/./sleep Exe file backend is print as System.out.println(File.getCanoncialPath()) /home/pmuldoon/frysk_bin/frysk-core/sleep Exception in thread "main" java.lang.NullPointerException at lib.dwfl.DwflModule.getSymbol(fstack) at frysk.symtab.SymbolFactory.getSymbol(fstack) at frysk.stack.LibunwindFrame.getSymbol(fstack) at frysk.stack.Frame.toPrint(fstack) at frysk.stack.StackFactory.printTaskStackTrace(fstack) at frysk.util.StacktraceAction.printTasks(fstack) at frysk.util.StacktraceAction.allExistingTasksCompleted(fstack) at frysk.proc.ProcCoreAction.<init>(fstack) at frysk.bindir.fstack.stackCore(fstack) at frysk.bindir.fstack.access$1(fstack) at frysk.bindir.fstack$2.parseCores(fstack) at frysk.util.CommandlineParser.doParse(fstack) at frysk.util.CommandlineParser.parse(fstack) at frysk.bindir.fstack.main(fstack)
Some additional root cause analysis. I've spent the last few days working down through the code to the dwlf native calls. Given this command reproducer: [pmuldoon@localhost frysk-core]$ cp /bin/sleep . [pmuldoon@localhost frysk-core]$ ./sleep 5000 & [3] 8966 [pmuldoon@localhost frysk-core]$ ./frysk/bindir/fcore 8966 [pmuldoon@localhost frysk-core]$ ./frysk/bindir/fstack core.8966 ./sleep Produces: Exception in thread "main" java.lang.NullPointerException at lib.dwfl.DwflModule.getSymbol(DwflModule.cxx:126) at frysk.symtab.SymbolFactory.getSymbol(SymbolFactory.java:86) at frysk.stack.LibunwindFrame.getSymbol(LibunwindFrame.java:185) at frysk.stack.Frame.toPrint(Frame.java:163) at frysk.stack.StackFactory.printTaskStackTrace(StackFactory.java:105) at frysk.util.StacktraceAction.printTasks(StacktraceAction.java:156) at frysk.util.StacktraceAction.allExistingTasksCompleted(StacktraceAction.java:209) at frysk.proc.ProcCoreAction.<init>(ProcCoreAction.java:60) at frysk.bindir.fstack.stackCore(fstack.java:145) at frysk.bindir.fstack.access$1(fstack.java:140) at frysk.bindir.fstack$2.parseCores(fstack.java:166) at frysk.util.CommandlineParser.doParse(CommandlineParser.java:169) at frysk.util.CommandlineParser.parse(CommandlineParser.java:109) at frysk.bindir.fstack.main(fstack.java:278) If we add in the following debug statements to: frysk-sys/lib/dwfl/cni/Dwfl.cxx in function: lib::dwfl::Dwfl::dwfl_report_module void lib::dwfl::Dwfl::dwfl_report_module(jstring moduleName, jlong low, jlong high) { jsize len = JvGetStringUTFLength(moduleName); char modName[len+1]; JvGetStringUTFRegion(moduleName, 0, len, modName); modName[len] = '\0'; printf("Reporting module: %s\n at address 0x%lx-0x%lx\n",modName,low,high); ::dwfl_report_module(DWFL_POINTER, modName, (::Dwarf_Addr) low, (::Dwarf_Addr) high); } The output is produced: Reporting module: ./sleep at address 0x400000-0x626000 Reporting module: /lib64/librt.so.1 at address 0x3235400000-0x3235609000 Reporting module: /lib64/ld-linux-x86-64.so.2 at address 0x38ca800000-0x38caa1c000 Reporting module: /lib64/libc.so.6 at address 0x38cba00000-0x38cbd57000 Reporting module: /lib64/libpthread.so.0 at address 0x38cc600000-0x38cc81b000 Reporting module: [vdso] at address 0x7fff997fe000-0x7fff99800000 Note the the module is reported as provided by the user: ./sleep The backtrace eventually grounds out at GetSymbol in DwflModule.cxx. Adding some debug statements here: void lib::dwfl::DwflModule::getSymbol(jlong address, lib::dwfl::SymbolBuilder* symbolBuilder) { Dwarf_Addr addr = (Dwarf_Addr) address; GElf_Sym closest_sym; printf("Testing error no with dwfl pointer 0x%lx address 0x%lx\n",(long) DWFL_MODULE_POINTER, (long) address); const char* methName = dwfl_module_addrsym(DWFL_MODULE_POINTER, addr, &closest_sym, NULL); if (errno != 0) printf("Erro no is %d. Message: %s\n", errno, strerror(errno)); else printf("Method name is: %s\n", methName); jstring jMethodName; if (methName == NULL) jMethodName = NULL; else jMethodName = JvNewStringUTF(methName); symbolBuilder->symbol(jMethodName, closest_sym.st_value, closest_sym.st_size, ELF64_ST_TYPE(closest_sym.st_info), ELF64_ST_BIND(closest_sym.st_info), closest_sym.st_other); printf("Returning\n"); } Produces this output: Testing error no with dwfl pointer 0x107b880 address 0x38cba9ac30 Method name is: __nanosleep_nocancel Returning Testing error no with dwfl pointer 0x107b500 address 0x402f1a Exception in thread "main" java.lang.NullPointerException at lib.dwfl.DwflModule.getSymbol(DwflModule.cxx:126) Note that the first symbol lookup is fine. When a second symbol lookup occurs at the memory address represented by the ./bash named module, the function: dwfl_module_addrsym never returns, or sets an errno. It seems to die in the elfutils lib. Subsequently, if you follow the exception path back up to Frame.java toPrint() method, the symbol is null: public void toPrint (PrintWriter writer, boolean printSource, boolean fullpath) { // the address, padded with 0s based on the task's word size, ... writer.write("0x"); String addr = Long.toHexString(getAddress()); int padding = 2 * getTask().getISA().wordSize() - addr.length(); for (int i = 0; i < padding; ++i) writer.write('0'); writer.write(addr); // the symbol, if known append (), .. Symbol symbol = getSymbol(); writer.write(" in "); writer.write(symbol.getDemangledName()); and the call to getDemangledName() is null. The backtrace is produced.
I wanted to be sure that the dwfl_report_module was not setting errno, so I checked it after the call with the code below. For all modules including: [pmuldoon@localhost frysk-core]$ ./frysk/bindir/fstack core.8966 ./sleep Reporting module: ./sleep at address 0x400000-0x626000 Erro no is 0, message is Success I get success. Code: void lib::dwfl::Dwfl::dwfl_report_module(jstring moduleName, jlong low, jlong high) { jsize len = JvGetStringUTFLength(moduleName); char modName[len+1]; JvGetStringUTFRegion(moduleName, 0, len, modName); modName[len] = '\0'; printf("Reporting module: %s\n at address 0x%lx-0x%lx\n",modName,low,high); ::dwfl_report_module(DWFL_POINTER, modName, (::Dwarf_Addr) low, (::Dwarf_Addr) high); printf("Erro no is %d, message is %s\n", errno, strerror(errno)); }
Upstream already has a LinuxCoreHost.java patch that converts ./sleep to the canonical name `pwd`/sleep, so this error will not occur in Frysk at present. This is neither the right place to do the converstion, or the right fix. IMO the dwfl issue should be further investigated into the dwfl lib code as this code masks the problem, rather than solving it. I'm removing my assignment from this bug.