Bug 12642

Summary: utrace: taskfinder misses events when main thread does not go through at least one quiesce
Product: systemtap Reporter: Rayson Ho <raysonlogin>
Component: runtimeAssignee: Unassigned <systemtap>
Status: RESOLVED FIXED    
Severity: normal CC: dsmith, fche, jistone, kahing, mark
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Rayson Ho 2011-04-06 06:05:26 UTC
If a multi-threaded program with the main thread not going through at least one quiesce after systemtap/uprobes tracing starts, then Systemtap is not able to detect the events.

Example:

#include <pthread.h>
#include <stdio.h>

void spool_write_script(int jobid) __attribute__ ((noinline));

void spool_write_script(int jobid)
{
 printf("sleeping... %d sec\n", 1 + jobid % 2);
 sleep(2 + jobid % 2);
}

mythread()
{
 int i;
 for (i=0;i<30;i++)
  spool_write_script(i);
}

int main()
{
 pthread_t tid;

 printf("pid = %d\n", getpid());

#ifdef MAINWAITS
 getchar();
#endif

 pthread_create(&tid, NULL, mythread, NULL);

 pthread_join(tid, NULL);

}


Reproduce with:
stap hangs when the process is traced with:
# stap -e 'probe process("./a.out").function("spool_write_script") {printf("called")}' -x <PID>

However, when the testcase is compiled with -DMAINWAITS, and when stap is started before inputing a key (getchar()), then stap is able to get the events. (Also, starting the program with -c would make it work.)


Analysis:
In the systemtap runtime, __stp_utrace_task_finder_target_quiesce() checks if it is encountering the main thread:

        /* Call the callbacks.  Assume that if the thread is a
         * thread group leader, it is a process. */
        __stp_call_callbacks(tgt, tsk, 1, (tsk->pid == tsk->tgid));

For slave threads, tsk->pid == tsk->tgid is false, and the callback function in uprobes only handles the main thread, as the address space is shared by all the threads. In stap_uprobe_process_found():

  if (! process_p) return 0; /* ignore threads */

And thus stap_uprobe_change_plus() is not called for the threads.

(As a hack) Simply passing 1 instead of (tsk->pid == tsk->tgid) would make it work for the testcase above.

Severity:
Normal - The testcase is from a real application (daemon) that creates slave threads, and the main thread waits until clean-up time (ie. daemon restart, exit, etc...).

A workaround is to force the main thread to quiesce after stap runs, e.g. by attaching gdb.
Comment 1 Frank Ch. Eigler 2012-10-29 12:57:35 UTC
David, could this be corrected by a UTRACE_STOP sent to the main thread upon attaching to it?
Comment 2 Mark Wielaard 2012-10-29 19:43:25 UTC
Just adding a comment so people who see hotspot java probes not working against running java processes might find this bug report. The hotspot main thread does nothing except wait for all other threads to exit. Which can trigger this bug. This happens for example on a 2.6.32-220.23.1.el6.x86_64 kernel.
Comment 3 David Smith 2012-11-02 16:03:32 UTC
Depending on how you count, we've got 3 or 4 sets of utrace-like functionality here, with different behaviors:

1) the original version of utrace, which is present in RHEL5, handled by runtime/linux/task_finder.c. In this case we do send the main thread a UTRACE_STOP, which causes it to stop and we attach correctly.

Here your testcase passes.

2) Version 2 of utrace, which is present in RHEL6 (also handled by runtime/linux/task_finder.c). In this case we do send the main thread a UTRACE_STOP - however the stop doesn't/can't interrupt a system call.

Here your testcase fails.

3) On new kernels without "real" utrace, we fake it with tracepoint handlers and task_work_add().  This code is in runtime/linux/task_finder2.c and runtime/stp_utrace.c. This code uses task_work_add() to run code when the task is stopped.  This can't interrupt a system call.

Here your testcase fails.

4) The new dyninst runtime ("--runtime=dyninst") uses the dyninst library to attach, which ends up using ptrace. Attaching to a running process isn't quite there yet, but I think it should be possible to interrupt a system call.

So, for cases 2) and 3) above, we've still got some thinking to do.
Comment 4 Josh Stone 2012-11-03 17:02:56 UTC
(In reply to comment #3)
> 4) The new dyninst runtime ("--runtime=dyninst") uses the dyninst library to
> attach, which ends up using ptrace. Attaching to a running process isn't quite
> there yet, but I think it should be possible to interrupt a system call.

I'd be surprised if there was a quiesce-type issue in stapdyn, but I did find an obvious omission that caused its attach to miss -- commit 02bff02.
Comment 5 David Smith 2012-11-27 15:35:59 UTC
Commit f346b8b fixes this for everything except dyninst.

Here's what was going on with this one.

In the original utrace (RHEL5), UTRACE_STOP could interrupt the target task (by sending the task a fake signal). This caused the target task to stop even when it was sleeping.

In the newer utrace (RHEL6), that functionality was split out into UTRACE_INTERRUPT. So, for RHEL6, we now make a pass after the systemtap session is started and send all target tasks an UTRACE_INTERRUPT.

For newer kernels without utrace, I've implemented UTRACE_INTERRUPT support, so just like for RHEL6, we now make a pass after the systemtap session is started and send all target tasks an UTRACE_INTERRUPT.

I've added a test case, called 'main_quiesce.exp' that tests this issue.

This new test case passes everywhere, except with dyninst. That will need more investigation.
Comment 6 David Smith 2012-12-06 20:06:17 UTC
The dyninst problem has been moved to its own bug, #14923.