Uprobing a multithreaded app on an x86_64 SMP system shows serious serialization of the threads in the kernel's signal-handling code. In the app in question, the child threads just call a dummy function repeatedly; the uprobes module probes the dummy function's entry point. Here's a summary of data reported by oprofile. It shows that with more than one thread running, utrace_get_signal(), get_signal_to_deliver(), and force_sig_info() are the top three consumers of CPU time. I'm guessing that the threads are serializing on task_struct->sighand->siglock (which is shared among tasks of the same process). #CPUs: 4 pct (rank) pct (rank) pct (rank) threads usec/iter** utrace_get_signal get_signal_to_deliver force_sig_info 1* 4.4 12.2% (1) 2.4% (13) < 1% 1 4.0 12.0% (1) 3.5% (7) < 1% 2 9.2 21.4% (1) 13.2% (2) 5.7% (3) 3 19.0 30.9% (1) 24.4% (2) 13.5% (3) 4 29.7 36.7% (1) 25.6% (2) 14.4% (3) *single-thread program -- no parent thread ** Divide by #threads to get usec per probe hit. Percentages are of total kernel+user time. I have no particular reason to think that this problem is specific to x86_64. I've observed poor scaling on multithreaded apps before, but never got around to pointing oprofile at it. I was hoping it was something we could fix in uprobes. :-|
Due at least to internal utrace signal-related locking, this problem may not be correctable. The lkml-bound uprobes should be evaluated with multithreaded programs to see whether that is affected.