Summary: | frysk command line tools failing with an NPE | ||
---|---|---|---|
Product: | frysk | Reporter: | Ray Ruvinskiy <rruvinsk> |
Component: | general | Assignee: | Unassigned <frysk-bugzilla> |
Status: | NEW --- | ||
Severity: | normal | CC: | pmuldoon, scox |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: |
Description
Ray Ruvinskiy
2008-06-10 21:52:36 UTC
Since I'm doing other fixes to fstack, I'll look at this one. I've dug into this. I hope you don't mind. This actually seems to be a bug in the readlink function rather than in frysk. The call to readlink in frysk::sys::proc::Exe::get returns a much higher byte count as a return value than the actual length of the path it fills in the buffer. The buffer also seems to be partly filled with garbage: (gdb) p file $9 = "/proc/14211/exe", '\0' <repeats 258 times>, "?\224??*\000\000\000qV_?\177\000\000\000? \224??*\000\000U\204??5", '\0' <repeats 1716 times>, "?\224??*\000\000? \000\000\000\000\000\000\000*\000\000\000\000\000\000\000?\204??5", '\0' <repeats 1723 times>, "8M??5\000\000\000\b\000\000\000\000\000\000\000\000K??5\000\000\000@U?? 5\000\000\000?C??5", '\0' <repeats 11 times>, "8M?? 5\000\000\000\b\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000?\230?? 5\000\000\000?\006@?5\000\000\000@\205?\000\000\000\000\000@\210?\000"... (gdb) p len $10 = 48 (gdb) p link $11 = "/usr/libexec/gconfd-2\000\000attachment-remi (deleted)", '\0' <repeats 849 times>, "?\224? ?*\000\000\000qV_?\177\000\000\000?\224??*\000\000U\204??5", '\0' <repeats 1716 times>, "? \224??*\000\000\000qV_?\177\000\000\000?\224??*\000\000U\204??5", '\0' <repeats 1438 times> It looks like readlink fills in garbage (?) after the first NULL, and the byte count it returns includes this garbage. This confuses frysk, causing the NPE because we a null is returned from LinuxPtraceProc.getExeFile, at line 238. I was able to reproduce this problem in a sample program outside of frysk. I have glibc 2.7-2, which appears to be the latest Fedora 8 package. Also, it appears I was mistaken in saying that this problem exists on Ubuntu. I tried fstack again today on Ubuntu 8.04 (also glibc 2.7), and this did not reproduce. Just for extra information, in Fedora 8, x86-64 I cannot reproduce on gnome-termnal. Could be a Fedora 8 specific issue? Without debug-info installed: [pmuldoon@localhost frysk_bin]$ ps -ax | grep gnome-terminal Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ 3686 ? Sl 0:00 gnome-terminal 3820 pts/0 R+ 0:00 grep gnome-terminal [pmuldoon@localhost frysk_bin]$ ./frysk-core/frysk/bindir/fstack 3686 Task #3686 #0 0x000000332f4da926 in __poll () #1 0x000000329603ab18 in [unknown] #2 0x000000329603b1ad in g_main_loop_run () #3 0x00000035edb83a98 in gtk_main () #4 0x000000000041959e in [unknown] #5 0x000000332f41e32a in __libc_start_main () #6 0x000000000040ef39 in [unknown] Task #3693 #0 0x000000333000db8b in __read () #1 0x000000329603852b in [unknown] #2 0x0000003296060434 in [unknown] #3 0x000000333000729a in start_thread () #4 0x000000332f4e42cd in __clone () With debug-info installed. [pmuldoon@localhost frysk_bin]$ ./frysk-core/frysk/bindir/fstack 3686 Task #3686 #0 0x000000332f4da926 in __poll () #1 0x000000329603ab18 in g_main_context_iterate () #2 0x000000329603b1ad in g_main_loop_run () #3 0x00000035edb83a98 in gtk_main () #4 0x000000000041959e in [unknown] #5 0x000000332f41e32a in __libc_start_main () #6 0x000000000040ef39 in [unknown] Task #3693 #0 0x000000333000db8b in __read () #1 0x000000329603852b in child_watch_helper_thread () #2 0x0000003296060434 in g_thread_create_proxy () #3 0x000000333000729a in start_thread () #4 0x000000332f4e42cd in __clone () (In reply to comment #2) > I've dug into this. I hope you don't mind. not at all! > This actually seems to be a bug in the readlink function rather than in frysk. The call to readlink in frysk::sys::proc::Exe::get returns a much higher byte count as a return value than the actual length of > the path it fills in the buffer. The buffer also seems to be partly filled with garbage: > > (gdb) p file > $9 = "/proc/14211/exe", '\0' <repeats 258 times>, "?\224??*\000\000\000qV_?\177\000\000\000? > \224??*\000\000U\204??5", '\0' <repeats 1716 times>, "?\224??*\000\000? > \000\000\000\000\000\000\000*\000\000\000\000\000\000\000?\204??5", '\0' <repeats 1723 > times>, "8M??5\000\000\000\b\000\000\000\000\000\000\000\000K??5\000\000\000@U?? > 5\000\000\000?C??5", '\0' <repeats 11 times>, "8M?? > 5\000\000\000\b\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000?\230?? > 5\000\000\000?\006@?5\000\000\000@\205?\000\000\000\000\000@\210?\000"... ah! this is sounding like the kernel; you could try "fexe" on a process showing the problem or just ls -l /proc/PID/exe? If the last one prints garbage then I think it is the kernel and we should push it to fedora. For a workaround, I guess we could sanity check using strlen and then log the problem. Here's a repo: cagney@rat$ cp /bin/sleep /tmp cagney@rat$ /tmp/sleep 1000 & [1] 28790 cagney@rat$ cagney@rat$ cagney@rat$ fexe 28790 /tmp/sleep cagney@rat$ cagney@rat$ rm /tmp/sleep cagney@rat$ fexe 28790 Error: java.lang.NullPointerException cagney@rat$ cagney@rat$ ls -l /proc/28790/exe lrwxrwxrwx 1 cagney cagney 0 2008-06-12 08:51 /proc/28790/exe -> /tmp/sleep (deleted) Hmm, I suspect this was an "it was working"; we've a relatively straight forward test case implement. This prints an error explainng what's gone wrong, rather than NPEing. commit e7fe8b0edbf5bc538a3e7fbcd37997bc800d27a5 Author: Andrew Cagney <cagney@toil.yyz.redhat.com> Date: Thu Jun 12 16:31:00 2008 -0400 When /proc/PID/exe is invalid throw an exception. frysk-core/frysk/bindir/ChangeLog 2008-06-12 Andrew Cagney <cagney@redhat.com> * TestFexe.java (testExeOfDeletedFile()): Enable. Match expected error. * fexe.java: Use Exe.getName. frysk-core/frysk/proc/live/ChangeLog 2008-06-12 Andrew Cagney <cagney@redhat.com> * LinuxPtraceProc.java (getExe()): Delete. frysk-sys/frysk/sys/ChangeLog 2008-06-12 Andrew Cagney <cagney@redhat.com> * cni/Errno.cxx (throwUserException): New. * cni/Errno.hxx (throwUserException): New. frysk-sys/frysk/sys/proc/ChangeLog 2008-06-12 Andrew Cagney <cagney@redhat.com> * Exe.java (getName): Rename get. (openReadOnly): New. * jni/Exe.cxx (Exe::getName): Update. * cni/Exe.cxx (Exe::getName): Update. frysk-sys/jnixx/ChangeLog 2008-06-12 Andrew Cagney <cagney@redhat.com> * exceptions.cxx (userException): Implement. * exceptions.hxx (userException): Declare. cagney@toil$ cp /bin/sleep . cagney@toil$ ./sleep 1000 & [2] 14879 cagney@toil$ ./frysk/bindir/fexe $! /notnfs/cagney/git/native/frysk-core/sleep cagney@toil$ rm sleep cagney@toil$ ./frysk/bindir/fexe $! Error: The link /proc/14879/exe points to the deleted file /notnfs/cagney/git/native/frysk-core/sleep In my case, the binary was not deleted. If you look at the gdb output, you will see that the buffer contains the valid path, followed by a NULL, followed by garbage which happens to include "(deleted)". The byte count went to the end of "(deleted)". Of course, printing out the buffer will result in the right output, due to the NULL at the end of the correct path. There's definitely procfs or glibc weirdness here, but it has nothing to do with frysk. It appears that as of commit 531115f2541df6ed1e8a8f16e24c519fb858b209, Errno.h includes frysk/UserException.h, but the latter does not seem to have been checked in, causing a build failure. (In reply to comment #7) > In my case, the binary was not deleted. If you look at the gdb output, you will see that the buffer contains > the valid path, followed by a NULL, followed by garbage which happens to include "(deleted)". Ah, yes, changing the code, I found an old comment (now living in frysk/sys/proc/{cni,jni}/Exe.cxx: if ((int) strlen(link) != len) { // Assume that an EXE that has somehow ended up with an embedded // NUL character is invalid. This happens when the kernel screws // up "mv a-really-long-file $exe" leaving the updated EXE string // with something like "$exe<NUL>ally-long-file (deleted)". userException(env, "The link %s is corrupt", file); } here's the actual sequence, I think I might tweak fexe a little so it prints this out: cagney@rat$ /tmp/sleep 1000 & [1] 31378 cagney@rat$ cagney@rat$ mv /tmp/a-really-long-file /tmp/sleep cagney@rat$ ls -l /proc/31378/exe lrwxrwxrwx 1 cagney cagney 0 2008-06-13 14:21 /proc/31378/exe -> /tmp/sleep cagney@rat$ (the string actually contains "/tmp/sleep<NUL>.... (deleted)"), and as you can see the exe, while looking ok, is pointing to the wrong file. Pre-link, and possibly update, can do this. (i've re-opened the bug) Can you check for this. I'm also going to tweak fexe so its easier to see this. Ah, this makes sense and explains what I've been seeing. I originally built frysk on Ubuntu. Prior to doing that, I updated quite a few packages and did not reboot or restart X. When I tried fstack, I would get the NPE. I then decided to try frysk on Fedora but wasn't able to get to it for several weeks. When I had some free time, I installed a Fedora 8 VM and proceeded to update all my packages (which was a fun project in itself...). I then built frysk and saw the same NPE. However, in those two weeks, I had rebooted my Ubuntu machine, and so when I went back to try frysk on it again, I could not reproduce the problem. Tricky little problem. |