| Summary: | fstack cores | ||
|---|---|---|---|
| Product: | frysk | Reporter: | Andrew Cagney <cagney> |
| Component: | general | Assignee: | Unassigned <frysk-bugzilla> |
| Status: | ASSIGNED --- | ||
| Severity: | normal | CC: | mark |
| Priority: | P2 | ||
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Host: | Target: | ||
| Build: | Last reconfirmed: | ||
| Project(s) to access: | ssh public key: | ||
| Bug Depends on: | |||
| Bug Blocks: | 2244 | ||
|
Description
Andrew Cagney
2008-02-07 19:01:25 UTC
That looks familiar, but I lost the core file that produced a similar issues. Can you make the core file available somewhere? Pretty certain that Java does not glob ./Testrunner as 'pwd'/TestRunner. It just sees the filename as ./Testrunner. Can you rerun the test with physical location of TestRunner provided? Looking a little closer, it seems LinuxCoreHost was not using the Canonical File name, so the ../ and the ./ were not being correctly resolved in the path: 2008-02-08 Phil Muldoon <pmuldoon@redhat.com> * LinuxCoreHost.java: Use CanonicalFile() througout. There's a more fundamental issue here; this should not have lead to an _abort_ Since we clearly don't know what to do with the IO exception, getAbsolutePath() is also a better option. Feel free to split it into two bugs. One where the the process run aground in the dwfl call, and another where ./ and ../ are not being dealt with other than in a canonical file case. They need to be addressed differently. As for the IOException, the converstion of a IOException to a RuntimeException is just conversion to an unchecked versus checked exception. I'm quite open to throwing either, but in keeping with the rest of frysk's usage on unchecked exception. I think think the case of using a canonical file is correct in LinuxCoreHost as it is correctly constructing the location of the file. getAbsolutePath wasn't. It was using the file as passed by the user, with the expectation that ./ and ../ would be correctly parsed. I'll say usability trumps pedantic do as I mean, now what I say (type) here. I won't address the dwfl issue as I have no expertise there. Attempting to replicate the issue with a simple process produces the following
sacktrace below. I added the debug statements for emphasis. The ideal solution
here would be for LinuxCoreHost not to deal with paths at all, and have Elf just
provide the File based constructor. Also, a change to how StatelessFile itself
constructs the filename would be required:
public StatelessFile(File file)
{
byte[] path = file.getAbsolutePath().getBytes();
// NUL terminate it - an array start with NUL
unixPath = new byte[path.length + 1];
System.arraycopy (path, 0, unixPath, 0, path.length);
}
This also uses AbsolutePath, and does not account for encoding.
There is another improvement in that the metadata instead of storing strings
extracted from the linkmap for library names, store File objects instead. This
once again shunts the native -> java path fixing off to one central location.
Anyway, the bug I tried to replicate showed as:
[pmuldoon@localhost frysk-core]$ ./frysk/bindir/fstack core.2157 ./sleep
Exe file backend is print as System.out.println(File): ./sleep
Exe file backend is print as System.out.println(File.getAbsolutePath())
/home/pmuldoon/frysk_bin/frysk-core/./sleep
Exe file backend is print as System.out.println(File.getCanoncialPath())
/home/pmuldoon/frysk_bin/frysk-core/sleep
Exception in thread "main" java.lang.NullPointerException
at lib.dwfl.DwflModule.getSymbol(fstack)
at frysk.symtab.SymbolFactory.getSymbol(fstack)
at frysk.stack.LibunwindFrame.getSymbol(fstack)
at frysk.stack.Frame.toPrint(fstack)
at frysk.stack.StackFactory.printTaskStackTrace(fstack)
at frysk.util.StacktraceAction.printTasks(fstack)
at frysk.util.StacktraceAction.allExistingTasksCompleted(fstack)
at frysk.proc.ProcCoreAction.<init>(fstack)
at frysk.bindir.fstack.stackCore(fstack)
at frysk.bindir.fstack.access$1(fstack)
at frysk.bindir.fstack$2.parseCores(fstack)
at frysk.util.CommandlineParser.doParse(fstack)
at frysk.util.CommandlineParser.parse(fstack)
at frysk.bindir.fstack.main(fstack)
Some additional root cause analysis. I've spent the last few days working down
through the code to the dwlf native calls. Given this command reproducer:
[pmuldoon@localhost frysk-core]$ cp /bin/sleep .
[pmuldoon@localhost frysk-core]$ ./sleep 5000 &
[3] 8966
[pmuldoon@localhost frysk-core]$ ./frysk/bindir/fcore 8966
[pmuldoon@localhost frysk-core]$ ./frysk/bindir/fstack core.8966 ./sleep
Produces:
Exception in thread "main" java.lang.NullPointerException
at lib.dwfl.DwflModule.getSymbol(DwflModule.cxx:126)
at frysk.symtab.SymbolFactory.getSymbol(SymbolFactory.java:86)
at frysk.stack.LibunwindFrame.getSymbol(LibunwindFrame.java:185)
at frysk.stack.Frame.toPrint(Frame.java:163)
at frysk.stack.StackFactory.printTaskStackTrace(StackFactory.java:105)
at frysk.util.StacktraceAction.printTasks(StacktraceAction.java:156)
at
frysk.util.StacktraceAction.allExistingTasksCompleted(StacktraceAction.java:209)
at frysk.proc.ProcCoreAction.<init>(ProcCoreAction.java:60)
at frysk.bindir.fstack.stackCore(fstack.java:145)
at frysk.bindir.fstack.access$1(fstack.java:140)
at frysk.bindir.fstack$2.parseCores(fstack.java:166)
at frysk.util.CommandlineParser.doParse(CommandlineParser.java:169)
at frysk.util.CommandlineParser.parse(CommandlineParser.java:109)
at frysk.bindir.fstack.main(fstack.java:278)
If we add in the following debug statements to: frysk-sys/lib/dwfl/cni/Dwfl.cxx
in function: lib::dwfl::Dwfl::dwfl_report_module
void
lib::dwfl::Dwfl::dwfl_report_module(jstring moduleName, jlong low, jlong high)
{
jsize len = JvGetStringUTFLength(moduleName);
char modName[len+1];
JvGetStringUTFRegion(moduleName, 0, len, modName);
modName[len] = '\0';
printf("Reporting module: %s\n at address 0x%lx-0x%lx\n",modName,low,high);
::dwfl_report_module(DWFL_POINTER, modName, (::Dwarf_Addr) low,
(::Dwarf_Addr) high);
}
The output is produced:
Reporting module: ./sleep
at address 0x400000-0x626000
Reporting module: /lib64/librt.so.1
at address 0x3235400000-0x3235609000
Reporting module: /lib64/ld-linux-x86-64.so.2
at address 0x38ca800000-0x38caa1c000
Reporting module: /lib64/libc.so.6
at address 0x38cba00000-0x38cbd57000
Reporting module: /lib64/libpthread.so.0
at address 0x38cc600000-0x38cc81b000
Reporting module: [vdso]
at address 0x7fff997fe000-0x7fff99800000
Note the the module is reported as provided by the user: ./sleep
The backtrace eventually grounds out at GetSymbol in DwflModule.cxx. Adding some
debug statements here:
void
lib::dwfl::DwflModule::getSymbol(jlong address, lib::dwfl::SymbolBuilder*
symbolBuilder)
{
Dwarf_Addr addr = (Dwarf_Addr) address;
GElf_Sym closest_sym;
printf("Testing error no with dwfl pointer 0x%lx address 0x%lx\n",(long)
DWFL_MODULE_POINTER, (long) address);
const char* methName = dwfl_module_addrsym(DWFL_MODULE_POINTER, addr,
&closest_sym, NULL);
if (errno != 0)
printf("Erro no is %d. Message: %s\n", errno, strerror(errno));
else
printf("Method name is: %s\n", methName);
jstring jMethodName;
if (methName == NULL)
jMethodName = NULL;
else
jMethodName = JvNewStringUTF(methName);
symbolBuilder->symbol(jMethodName,
closest_sym.st_value,
closest_sym.st_size,
ELF64_ST_TYPE(closest_sym.st_info),
ELF64_ST_BIND(closest_sym.st_info),
closest_sym.st_other);
printf("Returning\n");
}
Produces this output:
Testing error no with dwfl pointer 0x107b880 address 0x38cba9ac30
Method name is: __nanosleep_nocancel
Returning
Testing error no with dwfl pointer 0x107b500 address 0x402f1a
Exception in thread "main" java.lang.NullPointerException
at lib.dwfl.DwflModule.getSymbol(DwflModule.cxx:126)
Note that the first symbol lookup is fine. When a second symbol lookup occurs at
the memory address represented by the ./bash named module, the function:
dwfl_module_addrsym never returns, or sets an errno. It seems to die in the
elfutils lib. Subsequently, if you follow the exception path back up to
Frame.java toPrint() method, the symbol is null:
public void toPrint (PrintWriter writer, boolean printSource, boolean fullpath) {
// the address, padded with 0s based on the task's word size, ...
writer.write("0x");
String addr = Long.toHexString(getAddress());
int padding = 2 * getTask().getISA().wordSize() - addr.length();
for (int i = 0; i < padding; ++i)
writer.write('0');
writer.write(addr);
// the symbol, if known append (), ..
Symbol symbol = getSymbol();
writer.write(" in ");
writer.write(symbol.getDemangledName());
and the call to getDemangledName() is null. The backtrace is produced.
I wanted to be sure that the dwfl_report_module was not setting errno, so I
checked it after the call with the code below. For all modules including:
[pmuldoon@localhost frysk-core]$ ./frysk/bindir/fstack core.8966 ./sleep
Reporting module: ./sleep
at address 0x400000-0x626000
Erro no is 0, message is Success
I get success. Code:
void
lib::dwfl::Dwfl::dwfl_report_module(jstring moduleName, jlong low, jlong high)
{
jsize len = JvGetStringUTFLength(moduleName);
char modName[len+1];
JvGetStringUTFRegion(moduleName, 0, len, modName);
modName[len] = '\0';
printf("Reporting module: %s\n at address 0x%lx-0x%lx\n",modName,low,high);
::dwfl_report_module(DWFL_POINTER, modName, (::Dwarf_Addr) low,
(::Dwarf_Addr) high);
printf("Erro no is %d, message is %s\n", errno, strerror(errno));
}
Upstream already has a LinuxCoreHost.java patch that converts ./sleep to the canonical name `pwd`/sleep, so this error will not occur in Frysk at present. This is neither the right place to do the converstion, or the right fix. IMO the dwfl issue should be further investigated into the dwfl lib code as this code masks the problem, rather than solving it. I'm removing my assignment from this bug. |