Bug 3858

Summary: option to share trace buffers between probe modules
Product: systemtap Reporter: Masami Hiramatsu <masami.hiramatsu.pt>
Component: runtimeAssignee: Unassigned <systemtap>
Status: RESOLVED FIXED    
Severity: enhancement CC: hunt, mhiramat
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: export stp_print* interfaces
Sharing a buffer amoung several precompiled scripts
Sharing a buffer amoung several scripts
Testcase for sharing buffer.
Sharing a buffer amoung several scripts (take4)
Sharing a buffer amoung several scripts (take5)

Description Masami Hiramatsu 2007-01-11 11:35:17 UTC
Currently, The C source code generated by the SystemTap includes
the all runtime source codes before compiling it. I suggest
introducing the independent runtime kernel module to the
SystemTap because of the reasons below.

- Output Integration: I would like to integrate the output of the
 script which is added from after, to the output of the script
 which is running now.
- Memory Efficiency: When several scripts are executed
 simultaneously, several instances of the common runtime
 processing code exist in the kernel. This is not efficient for
 memory.

Thanks,
Comment 1 Frank Ch. Eigler 2007-01-21 23:49:09 UTC
Integration of outputs from multiple systemtap sessions can be construed
as a part of bug #3857 or a user-space post-processor, or even perhaps
from the possible adoption of the UTT code for tracing.

Space savings from code parts of the runtime (logic for arrays, etc.)
are not significant: about 20 kB for a small probe script.  See the
run-time dmesg output of current cvs systemtap.

Is there some other hypothetical benefit?
Comment 2 Masami Hiramatsu 2007-01-22 04:24:31 UTC
(In reply to comment #1)
> Integration of outputs from multiple systemtap sessions can be construed
> as a part of bug #3857 or a user-space post-processor, or even perhaps
> from the possible adoption of the UTT code for tracing.

Sure, the integration itself can be implemented as a part of #3857.

> Space savings from code parts of the runtime (logic for arrays, etc.)
> are not significant: about 20 kB for a small probe script.  See the
> run-time dmesg output of current cvs systemtap.

OK, the code parts are small enough.
As far as I know, there are many dynamic allocated objects (for example,
symbol tables) which can be shared with other scripts.
This feature can also reduce these objects.

> Is there some other hypothetical benefit?

Another benefit is that this feature enables us to merge the runtime
routine into the mainstream kernel. If it is merged, we can reduce
maintenance costs of the code and have consistent interfaces for the
linux kernel.

Thanks,
Comment 3 Frank Ch. Eigler 2007-01-22 19:19:49 UTC
(In reply to comment #2)
> As far as I know, there are many dynamic allocated objects (for example,
> symbol tables) which can be shared with other scripts.
> This feature can also reduce these objects.

One big problem with this is that it is tantamount to putting back the
kallsyms_lookup kernel module export, which is considered verboten by
certain lkml bigwigs.  Heck, the simple module_get_byname() patch was
refused, which could have made some of the current bloat unnecessary.
Luckily, such anti-cooperation from lkml is rare, but on this subject it
is clear.


> > Is there some other hypothetical benefit?
> 
> Another benefit is that this feature enables us to merge the runtime
> routine into the mainstream kernel. If it is merged, we can reduce
> maintenance costs of the code and have consistent interfaces for the
> linux kernel.

Perhaps - though of course it comes with the price of hoping someone
else will do the maintenance; losing control over design / release
schedules.  It would not be a panacea.
Comment 4 Masami Hiramatsu 2007-01-25 12:37:15 UTC
(In reply to comment #2)
> > Integration of outputs from multiple systemtap sessions can be construed
> > as a part of bug #3857 or a user-space post-processor, or even perhaps
> > from the possible adoption of the UTT code for tracing.
> 
> Sure, the integration itself can be implemented as a part of #3857.

Sorry, I think it doubt, because of the following reasons.

1) If we want to integrate the outputs of several scripts, those 
  scripts have to share the interfaces which write into shared 
  (relay) buffers.
2) The interface which will be shared by several modules have to be
  exported.
3) The names of the exported interfaces have to be unique.

Thus, if we integrate the outputs of several scripts, only one of
those scripts can export the unique interfaces.

However, I think there are some problematic cases.
case A) If I'm using 2 scripts and I want to change only one of 
  the scripts which provides the interfaces, I can not remove 
  it with leaving another one.
case B) If I want to use 4 scripts and if 2 of those scripts have to 
  export the interfaces, it will make a conflict.

In each case, if we separate the transport module (which provides
the interfaces) from the runtime, we can share it smoothly.

I know I might merge the script sources. But sometimes (ex. running
systemtap on customer's servers), I can't do it and have to combine
the pre-compiled scripts. So, I proposed this idea.
If you have other good ideas, please tell me...

Thanks,
Comment 5 Frank Ch. Eigler 2007-01-25 13:44:55 UTC
> I know I might merge the script sources. But sometimes (ex. running
> systemtap on customer's servers), I can't do it and have to combine
> the pre-compiled scripts.

Could you explain why this merging needs to be done in kernel space?
Why not in user-space after the fact?  If the individual scripts emit
timestamps, it would be straightforward.  By merging in kernel space,
we would probably be forcing all systemtap modules to synchronize
while sharing a single set of buffers, slowing them all down.  Can
you describe a scenario where this is worth the cost?
Comment 6 Masami Hiramatsu 2007-01-26 13:54:10 UTC
(In reply to comment #5)
> > I know I might merge the script sources. But sometimes (ex. running
> > systemtap on customer's servers), I can't do it and have to combine
> > the pre-compiled scripts.
> 
> Could you explain why this merging needs to be done in kernel space?
> Why not in user-space after the fact?  If the individual scripts emit
> timestamps, it would be straightforward.  By merging in kernel space,
> we would probably be forcing all systemtap modules to synchronize

I think there is no need to apply this feature to all systemtap
modules. I'd like to make this feature selectable.
In my thought, you can specify -DTRANSPORT_MERGE option when we compile
the script which supports this feature, and give staprun the -O"destination
module" option if you merge its output into the destination module's buffer.

> while sharing a single set of buffers, slowing them all down.  Can
> you describe a scenario where this is worth the cost?

OK, I try to explain why we need to merge in kernel space instead of 
user space.

My stories are premised on the kernel flight recorder (especially, using
the relayfs buffers).

Assuming that you have to use two pre-compiled script, foo and bar.
One problem is that we cannot know how frequently each script (each
probe point) will be invoked actually. This means that we cannot know
how much memory each script needs before using it.
So, if you can use only 32MB for tracing kernel, how much memory should
you assign for each script? 
One possible solution is that you assign 16MB for each. But if the foo
and the bar consumes buffers 1MB/s and 100KB/s respectively, you can get
the data which was recorded by both scripts only latest 16 seconds.

However, if the systemtap can merge outputs of the scripts into single
buffer, you can assign 32MB for the single buffer and get the data which
was recorded by both scripts last 30 seconds.

Other possible problem will occur in adding pre-compiled scripts without 
enough amount of memory.
Assuming that you have to maintain a non-stop server by using a systemtap
script. This script has been assigned memory as big as allowed.
After the server began to work, you noticed that the script running 
on the server doesn't have enough probe points. But almost all probe
points were covered by current script.
This server is non-stop, so you should not make any blank period in the
recorded data by removing current script (the system might crash before
reloading new script).
Also, you aren't allowed to add the new script which allocates new buffer
because current script already allocated memory as big as allowed.

In this case, if the systemtap can merge outputs of the scripts into
a single buffer, you can load additional script and merge its output
to the buffer of the current script.

I believe this feature is very useful for RAS tracer.

Thanks,
Comment 7 Frank Ch. Eigler 2007-01-28 15:07:15 UTC
It seems like only a combination of adverse factors all being present at once
would make shared buffering necessary:

- inability to attach (in flight recorder sense) and drain data from the running
scripts
- inability to load a new combined systemtap script due to memory shortage
- buffer space constitutes the majority of memory required by the systemtap
scripts (as opposed to data variables & code)
- compatibility of different scripts' trace data in a single buffer
- acceptability of increased contention slowdowns due to shared buffer

This seems like a rare combination.  Add to that the challenge of sharing
variables across separately compiled sibling modules suggests that we should
look at another way.

How about closing this bug, given that only I/O buffering remains as a
sharing candidate.  Then, let's consider adding to the runtime a facility
for on-the-fly shrinking/growing of relay buffers, say during a
flight recorder attach/detach operation.  Then, the above scenario could
be handled with a sequence like this:

- recognize that old detached probe needs to be extended, has 16MB buffer
- compile new probe with the union of the old and new probes
- attach to old probe, consuming all data and shrinking its buffers down to
(say) 4MB
- start new probe, with small initial trace buffer, (say) 4MB
- stop old probe
- attach to new probe, extending its trace buffer back to 16 or even 32MB.

If this is a plausible solution to the administrative problem above, then
let's close this bug and add such work to the "systemtap control library"
or "flight recorder" bugs instead.
Comment 8 Masami Hiramatsu 2007-01-29 11:35:48 UTC
Hi Frank,

(In reply to comment #7)
> It seems like only a combination of adverse factors all being present at once
> would make shared buffering necessary:
> 
> - inability to attach (in flight recorder sense) and drain data from the running
> scripts
> - inability to load a new combined systemtap script due to memory shortage
> - buffer space constitutes the majority of memory required by the systemtap
> scripts (as opposed to data variables & code)
> - compatibility of different scripts' trace data in a single buffer
> - acceptability of increased contention slowdowns due to shared buffer

- inablility to combinate several pre-compiled scripts on the customer's server.

I feel more concern about the end-users of the SystemTap, especially
System (maintenance) Engineers.
In many cases, IMHO, the gcc/kernel-debuginfo is not installed on the customer's
servers because of security and diskspace. So, we can't compile the scripts
on those servers. However, system engineers want to (or, have to) use the
systemtap to retrieve precise information on those servers.

> This seems like a rare combination.  Add to that the challenge of sharing
> variables across separately compiled sibling modules suggests that we should
> look at another way.

I'd like to share only buffers, not variables.
I think the sharing buffer interfaces will not increase contention
slowdown. What would you think about this?

> How about closing this bug, given that only I/O buffering remains as a
> sharing candidate.

I think we might as well focus on the "I/O buffer sharing by pre-compiled
modules" issue. The title of this bug is very confusion, so I suggest that 
we should make a new entry to discuss this issue.

>  Then, let's consider adding to the runtime a facility
> for on-the-fly shrinking/growing of relay buffers, say during a
> flight recorder attach/detach operation.

On-the-fly shrinking/growing buffers is one of good ideas.
However, unfortunately, that is not enough to me.

> - compile new probe with the union of the old and new probes

This step can't be executed on the server on which there is no
development packages installed. 

Thanks,

Comment 9 Frank Ch. Eigler 2007-02-15 15:26:41 UTC
(In reply to comment #8)
> In many cases, IMHO, the gcc/kernel-debuginfo is not installed on the customer's
> servers because of security and diskspace. So, we can't compile the scripts
> on those servers.

This situation is already addressed to some extent with our cross-compilation
capabilities.  (See the relevant war story wiki page for an example.)

> I'd like to share only buffers, not variables.
> I think the sharing buffer interfaces will not increase contention
> slowdown. What would you think about this?

Unless I am mistaken, sharing buffers by nature increases contention.
Concurrently executing probes would have to use some mutual exclusion
to write into the same buffer.  Plus they would probably have to include
some additional information with every record to identify the script
that produced it.

> I think we might as well focus on the "I/O buffer sharing by pre-compiled
> modules" issue. The title of this bug is very confusion, so I suggest that 
> we should make a new entry to discuss this issue.

Perhaps we can lump it in with the flight recorder functionality.
Comment 10 Masami Hiramatsu 2007-02-16 13:06:40 UTC
(In reply to comment #9)
> > In many cases, IMHO, the gcc/kernel-debuginfo is not installed on the customer's
> > servers because of security and diskspace. So, we can't compile the scripts
> > on those servers.
> 
> This situation is already addressed to some extent with our cross-compilation
> capabilities.  (See the relevant war story wiki page for an example.)

Sure. I thank you for this useful feature.
Unfortunately, even if stap have the cross-compilation capabilities,
our customers might not allow to install those cross-compiled 
script from the laptop. I worry about this situation.

> > I'd like to share only buffers, not variables.
> > I think the sharing buffer interfaces will not increase contention
> > slowdown. What would you think about this?
> 
> Unless I am mistaken, sharing buffers by nature increases contention.
> Concurrently executing probes would have to use some mutual exclusion
> to write into the same buffer.

As far as I know, the systemtap's runtime has small per-cpu buffers for 
buffering output before writing it to relay sub-buffers. There is no
mutual exclusion. You can check it at runtime/print.c.

>  Plus they would probably have to include
> some additional information with every record to identify the script
> that produced it.

Hmm, I just need this feature for integrating trace data which will be
recorded by a common format, for instance LKET. In this case, I think 
we don't need to identify which script has recorded each recorded entry.

For example, I'll attach the patch which implements minimum requirement
of this feature. Please read it.

> > I think we might as well focus on the "I/O buffer sharing by pre-compiled
> > modules" issue. The title of this bug is very confusion, so I suggest that 
> > we should make a new entry to discuss this issue.
> 
> Perhaps we can lump it in with the flight recorder functionality.

I'm not sure. How would you do it?

Comment 11 Masami Hiramatsu 2007-02-16 13:09:08 UTC
Created attachment 1559 [details]
export stp_print* interfaces

 This patch adds the relay channel sharing feature.
 With this patch, you can share one relay channel among several 
 pre-compiled scripts.

 To run a host script which provides interfaces of relay channel:
   stap -b -DRELAY_HOST host.stp

 To add some guest script:
   stap -DRELAY_GUEST guest1.stp
   stap -DRELAY_GUEST guest2.stp

 This patch is just a minimum implementation. So you can NOT run 
 several host scripts concurrently.
Comment 12 Martin Hunt 2007-02-16 15:54:37 UTC
Please see my comments in http://sourceware.org/ml/systemtap/2007-q1/msg00360.html

I am trying to think of how to provide integrated buffers as an option.
Implementing them is not hard, but it requires a huge change in how systemtap
works, will have less performance, and may not be acceptable to everyone.
Comment 13 Masami Hiramatsu 2007-05-17 07:40:50 UTC
Created attachment 1848 [details]
Sharing a buffer amoung several precompiled scripts


This patch adds the relay channel sharing feature against systemtap-20070512.
With this patch, you can share one relay channel among several pre-compiled
scripts.

 To run a host script which provides interfaces of relay channel:
   stap -b -DRELAY_HOST host.stp

 To add some guest script:
   stap -DRELAY_GUEST guest1.stp
   stap -DRELAY_GUEST guest2.stp

This patch is just a minimum implementation. So you can NOT run several host
scripts concurrently.

I'd like to support custom applications which consist of a group of
pre-compiled scripts and share an one huge buffer among them.
I think this kind of custom application is useful for recording detailed events
of servers which don't have any debuginfo and compiling environment by
themselves.
Comment 14 Masami Hiramatsu 2007-10-17 21:45:36 UTC
Created attachment 2050 [details]
Sharing a buffer amoung several scripts

I updated my patch and wrote a testcase for this feature.

This patch add '-DRELAY_HOST' and '-DRELAY_GUEST' options to systemtap.
With this patch, you can share relay channels among several scripts
and additionally you can specify a name to the channels. Thus, you can
run several host scripts concurrently.

For example:
To run a host scripts which provides interfaces of relay channel:
 stap host1.stp -DRELAY_HOST=chan1
 stap host2.stp -DRELAY_HOST=chan2

To add guest scripts to both channels:
 stap guestA.stp -DRELAY_HOST=chan1
 stap guestB.stp -DRELAY_HOST=chan2

Then, you can see the output of guestA.stp is printed on host1.stp's console,
and the output of guestB.stp is printed on host2.stp's console.
Comment 15 Masami Hiramatsu 2007-10-17 21:54:15 UTC
Created attachment 2051 [details]
Testcase for sharing buffer.

Hi,

This patch adds a testcase which tests whether the sharing buffer feature works

correctly or not.
This test checks:
- host script loading
- guest script loading and outputting
- warning message output
- failure of guest script loading if it specify non-exist channel name.

Thanks,
Comment 16 Frank Ch. Eigler 2007-10-17 23:02:47 UTC
Interesting solution (using module-exported functions from the host module)!
Martin, if you have no objection, let's go ahead.

One additional consideration for testing/checking: this now makes it theoretically
possible for distinct probes to call into one particular probe module's runtime
at the same time.  Even though this is limited to the I/O functions, can you
cover whether any additional mutual exclusion may be necessary?  Even on per-cpu
buffers, we have situations where probes don't run atomically any more ("begin" 
probes that don't block interrupts, only preemption).  So if the host does a

 probe begin { while(i++<1000) { log ("lots of stuff") } }

at the same time the guest does

 probe timer.profile { log ("some stuff") }

then could we have a problem?
Comment 17 Martin Hunt 2007-10-18 14:06:45 UTC
(In reply to comment #16)
> Interesting solution (using module-exported functions from the host module)!
> Martin, if you have no objection, let's go ahead.

We have discussed this approach before and I am fine with it.

> One additional consideration for testing/checking: this now makes it theoretically
> possible for distinct probes to call into one particular probe module's runtime
> at the same time.  Even though this is limited to the I/O functions, can you
> cover whether any additional mutual exclusion may be necessary?  Even on per-cpu
> buffers, we have situations where probes don't run atomically any more ("begin" 
> probes that don't block interrupts, only preemption).  So if the host does a
> 
>  probe begin { while(i++<1000) { log ("lots of stuff") } }
> 
> at the same time the guest does
> 
>  probe timer.profile { log ("some stuff") }
> 
> then could we have a problem?

Ouch.  As soon as I read that I realized we currently have a problem with
begin/end probes.  See http://sourceware.org/bugzilla/show_bug.cgi?id=5194

For the buffer sharing, I think the problem is simply that output will get mixed
together in such situations, which is not unexpected in a shared buffer situation.  
Comment 18 Masami Hiramatsu 2007-10-18 14:57:56 UTC
(In reply to comment #17)
> For the buffer sharing, I think the problem is simply that output will get mixed
> together in such situations, which is not unexpected in a shared buffer
situation.  
> 

Yes, I think so, and current code doesn't cover it yet.
Output from begin and end probes might be mixed up with other outputs. So 
currently, we ought to take care of it. For example, if we use these options,
we should not output from begin/end probes, etc.

I think that possible solutions are:
- begin/end probes prohibit interruptions only when we use these options.
- output routines invoked from those probes prohibit interruptions.
- make module local output buffer in each module, and pass it to host script API.
Comment 19 Frank Ch. Eigler 2007-10-18 18:13:20 UTC
(In reply to comment #18)
> Output from begin and end probes might be mixed up with other outputs.

If this is really the worst thing that can happen, and it takes an unusual set of
scripts to trigger it, it may not be worth fixing.  If on the other hand, the
runtime might corrupt its data structure due to this sort of reentrancy, then yes,
it needs to be fixed.
Comment 20 Masami Hiramatsu 2007-10-22 19:48:48 UTC
(In reply to comment #19)
> > Output from begin and end probes might be mixed up with other outputs.
> 
> If this is really the worst thing that can happen, and it takes an unusual set of
> scripts to trigger it, it may not be worth fixing.  If on the other hand, the
> runtime might corrupt its data structure due to this sort of reentrancy, then yes,
> it needs to be fixed.

I examined that point more closely, and I found there could be several cases
that runtime could corrupt its data structure if it was interrupted.
Since the operations of relay channels(especially, relay_reserve) are not
atomic, we have to prohibit other interruptions while using relay channel.

I'm working on a patch which fixes this problem.
Comment 21 Masami Hiramatsu 2007-10-24 22:04:06 UTC
Created attachment 2058 [details]
Sharing a buffer amoung several scripts (take4)

I updated my patch.

- I reduced the number of exported symbols. Now only stp_print_flush is
exported from host script.
- This patch disables irqs only for a short period of time for writing data
into relayfs buffer safely. So, now each guest owns its print buffer(Stp_pbuf).

- This patch also decreases the size of relay buffer of guest scripts to 128KB,
because the guest scripts use relay buffer only for urgent messages.

Other functions and usage are same as the previous patch. So you can test it by
the previous testcase.

Thanks,
Comment 22 Masami Hiramatsu 2007-10-30 19:48:19 UTC
Created attachment 2069 [details]
Sharing a buffer amoung several scripts (take5)

(In reply to comment #21)
> - This patch disables irqs only for a short period of time for writing data
> into relayfs buffer safely. So, now each guest owns its print
buffer(Stp_pbuf).

Since the previous patch(take4) didn't disable irqs when RELAY_HOST was
defined,
there was a chance to break a data structure.

This patch(take5) disables irqs when either RELAY_GUEST or RELAY_HOST is
defined.

Thanks,
Comment 23 Masami Hiramatsu 2007-11-09 18:47:33 UTC
The patches were committed.