what does 'probe process(PID_OR_NAME).clone' mean?

Wed May 28 20:30:00 GMT 2008

Here's Roland's response to my original email.  (Note that I've got
Roland's permission to forward this on.)

====

Your new formulation really doesn't wash with me.
Rather than a coherent response to your message,
I'll just dump a bunch of related thoughts.

First about the terminology.  In Linuxspeak they use "pid" to mean what
the rest of us might call "tid"--an ID for an individual thread
(sometimes called a task in Linuxspeak).  They use "tgid" (thread group
ID) to mean what the rest of us normally call a PID--an ID for a whole
process of one or more threads.  It's easy to be loose with them when
the fine details aren't coming up, because in Linux the PID as users
normally experience it (aka tgid) is the same number as the tid of the
initial thread in that process.  In discussions relating at all to Linux
internals, I find it easiest to stick to "tid" and "tgid" to be clear.
In features and documentation for users and programmers who don't have
Linux on the brain, it usually makes sense to talk mostly about PIDs and
the process as a whole, and then treat the possibility of multiple
threads in a process directly and explicitly rather than tossing around
tid/tgid/pid casually.

At the low level, all the events are per-thread events.  (In fact, the
entirety of the utrace interface we have now is a per-thread interface
without regard to thread groups, i.e. processes.)  At the level users
want to think about, some events are naturally considered unitary
process-wide events, like exec (and maybe "process exit" as distinct
from "thread exit").  Many (most?) others (most of the time?) users do
think of as being tied to an individual thread, but want to treat
uniformly for all threads in the process, i.e., formulate as "the
process experiences the event in some thread" with the thread as a
detail of what happened rather than as the determinant of how to treat
the event.

The clone event is an event that the parent thread experiences before
the child is considered to have been born and experienced any events.
(Consider it "crowning", if you like graphic metaphors.  ;-)
It differs from a syscall event in a few ways, some semantic and
some with only performance impact.

1. At low level, there is one switch per thread to enable any kind of
   syscall tracing.  Flipping this on makes *all* syscall entries and
   *all* syscall exits by this thread go through a slow path in the
   bowels of the kernel, not the optimized fast path.  You use the
   slow exit path even if you only care about entry tracing, and vice
   versa.  If your tracing does a very quick check for just one
   syscall and ignores all others, you are going not only through your
   check but through slow entry and exit paths for each and every
   syscall that is not of interest.

   By contrast, tracing clone events imposes zero overhead on anything
   else.  Even if only a minority of clone events are of interest,
   your callback's quick check to ignore others will in fact be very
   quick in relation to a large and costly operation that just took
   place.  Percentagewise the overhead of catching and discarding
   clone events is probably entirely negligible.

2. In syscall tracing, the event is an attempt to make a syscall.
   You get an exit trace for an error return too, though that's of
   no interest.

3. (For semantics, this is the kicker.)  The clone event callback is
   an unique opportunity where the child exists and can be examined
   and controlled before it ever runs.

   The most arcane special thing is the utrace_attach synchronization.
   (The details of this might change in the future.  But keep the
   issue in mind, whether or not you write code to rely on it now, and
   your input will influence whether the future utrace interface still
   tries to solve this problem for you and the details of how.)

   As you know, utrace_attach can be called on any target thread
   asynchronously from any thread.  As soon as a new thread exists,
   its tid goes into global tables and it becomes possible for anyone
   to find the thread's task_struct and call utrace_attach on it.
   This is so at the time the report_clone callback is made.  SMP or
   preemption may allow another thread to call utrace_attach before or
   simultaneous with your callback code.

   While report_clone callbacks are being made, utrace_attach has a
   special behavior related to this.  Until callbacks have completed,
   if the caller of utrace_attach is not the parent thread (the one
   running callbacks), it can block (or return -ERESTARTNOINTR).  The
   idea is that the engines attached to the parent and tracing its
   clone events get first crack at attaching to the child, before
   other randoms off the street.

   In the face of multiple engines, this could matter for callback
   order, though that was not the motivating concern.  When it's
   important is if you are using UTRACE_ATTACH_EXCLUSIVE with
   UTRACE_ATTACH_MATCH_* as a means of synchronization for your
   engine's own data structures and semantics.  The existing ptrace
   via utrace code does this (with UTRACE_MATCH_OPS) to implement its
   "one ptracer" semantics.  In ptrace, if the tracer of the parent
   uses PTRACE_O_TRACE*, it gets attached to the child and it's
   impossible for a simultaneous PTRACE_ATTACH to get in the way.

   The current code does this only for the very first utrace_attach
   call on the new thread.  So after one engine's report_clone callback
   has used utrace_attach on the new child, asynchronous utrace_attach
   calls by other threads can usurp later engines whose report_clone
   callbacks run second.  This is a stupid rule that can let something
   innocent break ptrace's semantics.  I'm sure I didn't intend it that
   way, but it's sufficient to make ptrace correct when ptrace is the
   only engine (or always the first), so I never noticed.  At least
   this much about the magic will surely change in the coming revamp.

   That's a lot said about a small bit of magic.  But the special
   synchronization among utrace_attach calls is not the key issue.

   The key feature of the report_clone callback is that this is the
   opportunity to do things to the new thread before it starts to run.
   Before this (such as at syscall entry for clone et al), there is no
   child to get hold of.  After this, the child starts running.  At any
   later point (such as at syscall exit from the creating call), the
   new thread will get scheduled and (if nothing else happened) will
   begin running in user mode.  (In the extreme, it could have run,
   died, and then the tid you were told already reused for a completely
   unrelated new thread.)  During report_clone, you can safely call
   utrace_attach on the new thread and then make it stop/quiesce,
   preventing it from doing anything once it gets started.

Another note on report_clone: this callback is not the place to do much
with the child except to attach it.  If you want to do something with
the child, then attach it, quiesce it, and let the rest of clone finish
in the parent--this releases the child to be scheduled and finish its
initial kernel-side setup.  (If you want the parent to wait, then make
the parent quiesce after your callback.)  Then let the report_quiesce
callback from the child instigate whatever you want to do with the
child.  This is how PTRACE_O_TRACE* works: it attaches the child, gives
it a SIGSTOP (poor man's quiesce), and then the parent lets the child
run while it stops for a ptrace report; meanwhile the child gets
scheduled, processes its SIGSTOP, and stops for the ptrace'd signal.

All of that discussion was about the implementation perspective that is
Linux-centric, low-level, and per-thread, considering one thread (task
in Linuxspeak) doing a clone operation that creates another task.  In
common terms, this encompasses two distinct kinds of things: creation
of additional threads within a process (pthread_create et al), and
process creation (fork/vfork).  At the utrace level, i.e. what's
meaningful at low level in Linux, this is distinguished by the
clone_flags parameter to the report_clone callback.  Important bits:

* CLONE_THREAD set
  This is a new thread in the same process; child->tgid == parent->tgid.
* CLONE_THREAD clear
  This child has its own new thread group; child->tgid == child->pid (tid).
  For modern use, this is the marker of "new process" vs "new thread".
* CLONE_VM|CLONE_VFORK both set
  This is a vfork process creation.  The parent won't return to user
  (or syscall exit tracing) until the child dies or execs.
  (Technically CLONE_VFORK can be set without CLONE_VM and it causes
  the same synchronization.)
* CLONE_VM set
  The child shares the address space of the parent.  When set without
  CLONE_THREAD or CLONE_VFORK, this is (ancient, unsupported)
  linuxthreads, or apps doing their own private clone magic (happens).

For reference, old ptrace calls it "a vfork" if CLONE_VFORK is set,
calls it "a fork" if &CSIGNAL (a mask) == SIGCHLD, and otherwise calls
it "a clone".  With syscalls or normal glibc functions, common values
are:

fork	 	-- just SIGCHLD or SIGCHLD|CLONE_*TID
vfork		-- CLONE_VFORK | CLONE_VM | SIGCHLD
pthread_create	-- CLONE_THREAD | CLONE_SIGHAND | CLONE_VM
		     | CLONE_FS | CLONE_FILES
		     | CLONE_SETTLS | CLONE_PARENT_SETTID
		     | CLONE_CHILD_CLEARTID | CLONE_SYSVSEM

Any different combination is some uncommon funny business.  (There are
more known examples I won't go into here.)  But probably just keying on
CLONE_THREAD is more than half the battle.

For building up to the user's natural perspective on things, I like an
organization of a few building blocks.  First, let me describe the idea
of a "tracing group".  (For now, I'll just talk about it as a semantic
abstraction and not get into how something would implement it per se.)
By this I just mean a set of tasks (i.e. threads, in one or more
processes) that you want to treat uniformally, at least in utrace
terms.  That is, "tracing group" is the coarsest determinant of how you
treat a thread having an event of potential interest.  In utrace terms,
all threads in the group have the same event mask, the same ops vector,
and possibly the same engine->data pointer.  In systemtap terms, this
might mean all the threads for which the same probes are active in a
given systemtap session.  The key notion is that the tracing group is
the granularity at which we attach policy (and means of interaction,
i.e. channels to stapio or whatnot).

In that context, I think of task creation having these components:

1. clone event in the parent

   This is the place for up to three kinds of things to do.
   Choices can be driven by the clone_flags and/or by inspecting
   the kernel state of the new thread (which is shared with the parent,
   was copied from the parent, or is freshly initialized).

   a. Decide which groups the new task will belong to.
      i.e., if it qualifies for the group containing the parent,
      utrace_attach it now.  Or, maybe policy says for this clone
      we should spawn a new tracing group with a different policy.

   b. Do some cheap/nonblocking kind of notification and/or data
      structure setup.

   c. Decide if you want to do some heavier-weight tracing on the
      parent, and tell it to quiesce.

2. quiesce event in the parent

   This happens if 1(c) decided it should.  (For the ptrace model, this
   is where it just stays stopped awaiting PTRACE_CONT.)  After the
   revamp, this will not really be different from the syscall-exit
   event, which you might have enabled just now in the clone event
   callback.  If you are interested in the user-level program state of
   the parent that just forked/cloned, the kosher thing is to start
   inspecting it here.  (The child's tid will first be visible in the
   parent's return value register here, for example.)

3. join-group event for the child

   This "event" is an abstract idea, not a separate thing that occurs
   at low level.  The notion is similar to a systemtap "begin" probe.
   The main reason I distinguish this from the clone event and the
   child's start event (below) is to unify this way of organizing
   things with the idea of attaching to an interesting set of processes
   and threads already alive.  i.e., a join-group event happens when
   you start a session that probes a thread, as well as when a thread
   you are already probing creates another thread you choose to start
   probing from birth.

   You can think of this as the place that installs the utrace event
   mask for the thread, though that's intended to be implicit in the
   idea of what a tracing group is.  This is the place where you'd
   install any per-thread kind of tracing setup, which might include hw
   breakpoints/watchpoints.  For the attach case, where the thread was
   not part of an address space already represented in the tracing
   group, this could be the place to insert breakpoints (aka uprobes).

4. "start" event in the child

   This is not a separate low-level event, but just the first event you
   see reported by the child.  If you said you were interested (in the
   clone/join-group event), then this will usually be the quiesce event.
   But note that the child's first event might be death, if it was sent
   a SIGKILL before it had a chance to run.

   This is close to the process.start event in process.stp, but in a
   place just slightly later where it's thoroughly kosher in utrace
   terms.  Here is the first time it's become possible to change the new
   thread's user_regset state.  Everything in the kernel perspective and
   the parent's perspective about the new thread start-up has happened
   (including CLONE_CHILD_SETTID), but the thread has yet to run its
   first user instruction.

Now, let's describe the things that make sense to a user in terms of
these building blocks, in the systemtap context.  I'm still just using
this as an abstraction to describe what we do with utrace.  But it's not
just arbitrary.  I think the "grouping" model is a good middle ground
between the fundamental facilities we have to work with and the natural
programmer's view for the user that we want to get to.

Not that it would necessarily be implemented this way, but for purposes
of discussion imagine that we have the tracing group concept above as a
first-class idea in systemtap, and the primitive utrace events as probe
types.  The probe types users see are done in tapsets.  A systemtap
session starts by creating a series of tracing groups.  I think of each
group having a set of event rules (which implies its utrace event mask).
In systemtap, the rules are the probes active on that group.  I'll
describe the rules that would be implicit (i.e. in tapset code, or
whatever) and apply in addition to (before) any script probes on the
same specific low-level events (clone/exec).

When there are any global probes on utrace events, make a group we'll
call {global}.  (Add all live threads.)  Its rules are:
	clone -> child joins the group
(Down the road there may be special utrace support to optimize the
global tracing case over universal attach.)

For a process(PID) probe, make a group we'll call {process PID}.
(Add all threads of live process PID.)  Its rules are:
	clone with CLONE_THREAD -> child joins the group
	clone without CLONE_THREAD -> child leaves the group

Here I take a PID of 123 to refer to the one live process with tgid 123
at the start of the systemtap session, and not any new process that
might come to exist during the session and happen to be assigned tgid 123.

For a process.execname probe, make a group we'll call {execname "foo"}.
Its rules are:
	clone -> child joins the group
	exec -> if execname !matches "foo", leave the group

When there are any process.execname probes, then there is an implicit
global probe on exec.  In effect, {global} also has the rule:
	exec -> if execname matches "foo", join group {execname "foo"}

The probes a user wants to think about might be:

probe process.fork.any = probe utrace.clone if (!(clone_flags &
CLONE_THREAD))
probe process.fork.fork = probe utrace.clone if (!(clone_flags & CLONE_VM))
probe process.vfork = probe utrace.clone if (clone_flags & CLONE_VFORK)
probe process.create_thread = probe utrace.clone if (clone_flags &
CLONE_THREAD)
probe process.thread_start
probe process.child_start

The {thread,child}_start probes would be some sort of magic for running
in the report_quiesce callback of the new task after the report_clone
callback in the creator decided we wanted to attach and set up to see
that probe.

Thanks,
Roland