Bug 2307

Summary:	binary tracing
Product:	systemtap	Reporter:	Frank Ch. Eigler <fche>
Component:	translator	Assignee:	Martin Hunt <hunt>
Status:	RESOLVED FIXED
Severity:	normal	CC:	guanglei, guij, hiramatu, hunt, jkenisto, jrs, prasadav, zanussi
Priority:	P2
Version:	unspecified
Target Milestone:	---
Host:		Target:
Build:		Last reconfirmed:	2006-02-20 03:00:04
Attachments:	Binary Tracing runtime interface and example script; stpd patch for merge updated binary tracing prototype (runtime interface and example script; patch for merge) binary tracing prototype code modified to remove hard-coded packet header add bin_*printf() interface for comparison prototype generic binary tracer using generic binary tracing interface somehow bin_write.c didn't get into the prototype patch - here's a new patch

Description Frank Ch. Eigler 2006-02-09 03:55:22 UTC

See thread:

http://sources.redhat.com/ml/systemtap/2006-q1/msg00253.html

Comment 1 Li Guanglei 2006-02-17 06:53:07 UTC

The following are my thoughts about binary trace interface. In a summary, I
prefer Martin Hunt's ideas.

The idea Tom proposed will bring the programming convenience and performance
improvement. But it has some problems:
 1. required to introduce struct into systemtap script language. 
 2. Inefficiency when refer to several fields of a struct target in different
assignments. e.g:
     ...
     trace_scsi.hostno = $cmd->device->host->host_no
     trace_scsi.channel = $cmd->device->channel
     trace_scsi.lun = $cmd->device->lun
       ...
     each reference to $target will generate a serial of function
calls(function__tvar_get_cmd, fetch_register, deref etc). This is why I don't
use printf directly in LKET in most cases, but use _stp_printf() in embedded c code.

 3. Another issue is that what will happen if some fields of a struct is a
string? e.g:
     trace_process_fork.processname = str_processname
    How should systemtap generate the corresponding struct? What's the size of
the string field of that struct?


So after thinking about this topic, I prefer what Martin Hunt proposed:
  (http://sources.redhat.com/ml/systemtap/2006-q1/msg00284.html)
  
It looks like a more generic interface, and it allows mixture of data
types(binary & string). We could implement it in the runtime lib, naming it
_stp_bin_printf(const char *fmt, ...). We could introduce a new function:
bin_vsnprintf based on the existing vsnprintf to support _stp_bin_printf, which
will directly assign the integer value instead of converting the integer into a
ASCII string. Of course we should modify stpd to let it support binary data
transfer, but I think BTI should already done about this, maybe only a slight
change is required.

Jose also implemented a feature which will use /proc to control whether the
probe handler should start/stop printing data. So we could also introduce
_lket_printf(const char *fmt, ...) using bin_vsnprintf which will check the
/proc to see if it should print data now and will also do some common work like
log_tracedata_common(). But I think start/stop tracing by /proc is useful and
should be incorporated into systemtap.

By having _stp_bin_printf/bin_vsnprintf/_lket_printf, we can introduce some
wrapper functions, such as lket_trace(). One idea is to introduce lket_print
into translator, just like print, so that we could using lket_print("%d%d%s%d",
int1,int2,str1,int3) in script directly.

I am not sure if I missed some points.

Comment 2 Masami Hiramatsu 2006-02-17 10:54:59 UTC

(In reply to comment #1)
> The following are my thoughts about binary trace interface. In a summary, I
> prefer Martin Hunt's ideas.

Basically, I agree with his ideas too.
I suggest introducing the compressed format when the arguments of the same type
continue.
For example:

printb("%d%d%d%d", long1, long2, long3, long4);

This expression will be compressed as below;

printb("%d4", long1, long2, long3, long4);

This compressed format is simple and reduces transporting data size.

By the way, I also hope the name of the interface is more neutral as printf ;-).

Comment 3 Li Guanglei 2006-02-17 15:32:44 UTC

> printb("%d4", long1, long2, long3, long4);
> 
> This compressed format is simple and reduces transporting data size.
> 
In fact, this is almost the same as what I did in
LKET(log_tracedata_common_compact & log_tracedata_common_compact_more in
logtrace.stp). It should be useful for someone, although not much useful for
LKET maybe.

> By the way, I also hope the name of the interface is more neutral as printf ;-).

How about _stp_bin_printf/bin_vsnprintf corresponding to current
_stp_printf/vsnprintf? Any good suggestions about their names?

_lket_printf is intended for LKET only.

Comment 4 Jose R. Santos 2006-02-19 05:01:15 UTC

Before we go deciding on a name we need to first decide on a general concept of
the interface that's going to be implemented.  I also like Martin's method for
implementing binary tracing since I find it very natural for me.  Since we have
a couple of people voting for this interface, is Martin's method the way to go?

Comment 5 Frank Ch. Eigler 2006-02-20 03:00:04 UTC

It seems to me that an approach along these lines will not satisfy optimized
tracing along IBM's original request (re. struct support in the language), nor
permit much or any translator assistance to provide metadata to a binary
decoder.  I consider it a technically weak solution.

On the other hand, just extending printf with binary formatting directives
should be easy, and not cause adverse interference with other future designs. 
So, if people want to experiment with this part, go ahead.

Comment 6 Martin Hunt 2006-02-21 08:43:01 UTC

Both proposals require some changes to the transport interface to improve
efficiency. It will also improve performance of printf to have our own optimized
copy of vsnprintf() instead of using the kernel version.  Adding binary support
to that will be trivial. So I can add binary support to printf and if we change
our minds later and want to try the other way, we haven't really wasted any time.

I'll get started on it.

Comment 7 Gui,Jian 2006-02-23 02:04:07 UTC

Martin, we are trying to improve LKET based on the new binary trace interface.
Thus implementing such an interface will have the first priority for us
currently. I also did some investigations about it, e.g the binary version of
vsnprintf and the enhancement of stpd to support binary & ascii
package(mentioned by gBTI). So if you like, I can help to do some work of this
new binary trace interface.

Comment 8 Gui,Jian 2006-03-03 11:52:28 UTC

Created attachment 899 [details]
Binary Tracing runtime interface and example script; stpd patch for merge

Comment 9 Gui,Jian 2006-03-03 11:58:41 UTC

I've made a prototype of binary tracing (binary_trace.tar.bz2)based on the 
discussion in the maillist, especially originated from Hitachi's gBTI 
and Martin's idea. It stores integers in binary format, handles both integer 
and string arguments and uses relayfs buffer directly. As a start point, it is 
now only tested on 2.6.14(x86) and Redhat 2.6.9-30EL(ppc64).

Your comments and suggestions are very welcome.

How to use:
--------------------------------------------------------------------
1) patch systemtap-20060225 src with the stpd patch
     patch -p1 <../patch-binary_trace
2) re-install systemtap
3) run the script: stap -g -b bin_trace.stp

Here is the detailed design:
--------------------------------------------------------------------
1. runtime function

A runtime function bin_strace() is provided to write a binary packet 
into the relayfs buffer directly.
  void bin_strace(String Str, const char *fmt, ...);

Its usage is similar to _stp_sprintf(), e.g.
  bin_strace(_stp_stdout, "Traceformat: %1b|%2b|%d|%s|", 
            0xAA, 0x1122, 0x55667788, "hello");

It has some main features:
  1) it supports "%b","%s" "%d" and can print integers and strings
  2) it stores integers in binary format not in ascii
  3) it writes trace data into relayfs buffer directly, while in 
     current ATI, the trace data are formatted into stp-print-buffer 
     first and then transported to relayfs buffer.

The binary packet design is an extension to gBTI packet. 
It looks like:

[32bit seq][8bit flag][8bit size][args ...]

seq:  shared with ATI packets by using _stp_seq_inc(). 
      it will be used for merge pre-cpu outputs
flag: always 0 for a binary packet. 
      it's used to distinguish between binary and ascii packets.
size: total length (bytes) of actual content of trace data. it's 
      calculated and filled by bin_strace().
args: integers (1byte, 2byte or 4byte) or strings (string len is 
      in first byte for post-processing; ending '\0' char is removed)

2. stpd enhancement

stpd is enhanced to recognize the binary packet header and merge the 
binary packets as well as ATI packets. 
The header info (seq,flag,size) will not be kept in "probe.out".

3. wrapper functions

Different wrapper functions can be defined over this runtime interface 
for different use. Here we take LKET as example. We can define
special wrapper functions for event hooks respectively , e.g. 
lket_syscall_trace(). They will all use lket_trace() to output 
common data.

Known bug:
--------------------------------------------------------------------
When bin_strace() and _stp_sprintf() are used in same handler, 
the order of generated trace data may be different from the 
coding order. e.g:
    probe handler { 
      log(something1); 
      lket_trace(something2);
   }
Here, something2 will be stored before something1. The reason 
is that the content of log() is not transported to relayfs until 
_stp_print_flush() is called at the end of handler, while 
bin_strace() stores data into relayfs buffer directly.

Thanks in advance for your comments.

Comment 10 Jose R. Santos 2006-03-06 19:55:13 UTC

Adding interested parties to the CC list.

Comment 11 Gui,Jian 2006-03-07 07:55:08 UTC

Created attachment 907 [details]
updated binary tracing prototype (runtime interface and example script; patch for merge)

fix a few bugs in last one and test it on ppc64/2.6.15.4 and i686/2.6.15.4 with
systemtap-20060304; 
use bin_write instead of bin_trace as interface name;
integrate the temporary solution for Bug 2387 (systemt crash on ppc64/2.6.15.4)

Comment 12 Tom Zanussi 2006-03-07 17:31:21 UTC

(In reply to comment #11)
> Created an attachment (id=907)
> updated binary tracing prototype (runtime interface and example script; patch
> for merge)
> 
> fix a few bugs in last one and test it on ppc64/2.6.15.4 and i686/2.6.15.4 with
> systemtap-20060304; 
> use bin_write instead of bin_trace as interface name;
> integrate the temporary solution for Bug 2387 (systemt crash on ppc64/2.6.15.4)

Just a couple comments...

It would be nice if this could be made to fit into a slightly larger framework
which would make it in essence _a_ systemtap trace format and API rather than
_the_ systemtap trace format and API.

One approach...

On the runtime side, we could add _stp_reserve(), a transport-independent
function to reserve event space in the destination buffer.  Using this would
make your tracing code slightly less efficient because it would have to do the
vararg printing into a temporary buffer first, and then memcpy it to the
reserved space, but it probably shouldn't matter too much in comparison with the
overhead of the bin_*write functions, and it would make the interface cleaner
and more generic and available to other tracing formats/APIs.  You'd also then
have access to all transports, not just relayfs.  Of course, if you knew ahead
of time how large your data would be you'd avoid the memcpy and could write
directly into the reserved space.  

The bin_*write functions seem generic enough for any application that wants to
do this kind of tracing, but maybe it would be more clear to have them called
bin_*printf().

The actual trace struct and wrapper functions e.g. stp_packet and lket_* it
seems to me should be part of an lket tapset.

The other half of this would be the user side, specifically stpd/librelay.  One
possibility would be to further modularize librelay and provide a hook for
format-specific 'handlers' to supply functions or 'plugins' that would handle
the trace data in a format-specific way.  This would also be useful for handling
different format versions for the same application, since formats are likely to
change over time.  The current format-specific code in librelay would just
become another handler specifically for the current ascii scheme, I assume.

In summary, I think there needs to be enough flexibility to allow for any number
of trace formats and APIs, without locking in or giving preference to a
particular one.  If the above vague API ideas make sense, I can hash them out
further, but I don't think it needs to be much more than a thin layer, basically
stp_reserve() in the runtime, and some generic app-specific hook for data
handling in librelay.

Comment 13 Gui,Jian 2006-03-07 20:58:37 UTC

(In reply to comment #12)
> On the runtime side, we could add _stp_reserve(), a transport-independent
> function to reserve event space in the destination buffer.  Using this would
> make your tracing code slightly less efficient because it would have to do the
> vararg printing into a temporary buffer first, and then memcpy it to the
> reserved space
Thanks for your suggestion. I'll do this and compare their overhead. Before this
prototype, I've tried another design to modify vsnprintf() to support "%b" and
keep others untouched, but finally gave up for performance consideration.
> 
> The bin_*write functions seem generic enough for any application that wants to
> do this kind of tracing, but maybe it would be more clear to have them called
> bin_*printf().
I use bin_*write because it not only a binary version of _stp_printf but also
writes to relayfs buffer directly. But you are right, bin_*printf() may be more
clear.
> 
> The actual trace struct and wrapper functions e.g. stp_packet and lket_* it
> seems to me should be part of an lket tapset.
lket_* is part of LKET while the struct stp_packet_h is not. stp_packet_h is
similar as gBTI and should be necessary for each binary packet no matter
whatever trace format. It is mainly used by stpd to merge pre-cpu files into
"probe.out" at the end of tracing. Among its three fields, seq is shared with
current ascii packet, flag is used to differentiate between binary packets and
ascii ones and size is used to locate the end of current binary packet.

Comment 14 Tom Zanussi 2006-03-07 21:58:34 UTC

(In reply to comment #13)
> (In reply to comment #12)
> > On the runtime side, we could add _stp_reserve(), a transport-independent
> > function to reserve event space in the destination buffer.  Using this would
> > make your tracing code slightly less efficient because it would have to do the
> > vararg printing into a temporary buffer first, and then memcpy it to the
> > reserved space
> Thanks for your suggestion. I'll do this and compare their overhead. Before this
> prototype, I've tried another design to modify vsnprintf() to support "%b" and
> keep others untouched, but finally gave up for performance consideration.
> > 
> > The bin_*write functions seem generic enough for any application that wants to
> > do this kind of tracing, but maybe it would be more clear to have them called
> > bin_*printf().
> I use bin_*write because it not only a binary version of _stp_printf but also
> writes to relayfs buffer directly. But you are right, bin_*printf() may be more
> clear.
> > 
> > The actual trace struct and wrapper functions e.g. stp_packet and lket_* it
> > seems to me should be part of an lket tapset.
> lket_* is part of LKET while the struct stp_packet_h is not. stp_packet_h is
> similar as gBTI and should be necessary for each binary packet no matter
> whatever trace format. It is mainly used by stpd to merge pre-cpu files into
> "probe.out" at the end of tracing. Among its three fields, seq is shared with
> current ascii packet, flag is used to differentiate between binary packets and
> ascii ones and size is used to locate the end of current binary packet.
> 

I understand, but what if some other application, or even lket 6 months down the
road, wants to use a different sequencing method, or doesn't need to stream
alongside ascii, or wants to be able to log events larger than 256?  Why force
them to use this event definition when it's not necessary?

I was also suggesting that the merging code in stpd be removed from librelay and
moved into a separate handler, with a suitable API to allow any application to
supply merging methods based on its own sequencing method, or something like that.

It seems to me that the actual trace data can be completely opaque except to the 
specific trace application, and probably all that should be necessary would be
information necessary to distinguish between applications and within that trace
format versions.  How about say a u32 with the high 16 bits identifying the
application and the low 16 for use by the application e.g. to identify
particular trace format versions?

Comment 15 Gui,Jian 2006-03-08 03:54:58 UTC

(In reply to comment #14)
> I understand, but what if some other application, or even lket 6 months down the
> road, wants to use a different sequencing method, or doesn't need to stream
> alongside ascii, or wants to be able to log events larger than 256?  Why force
> them to use this event definition when it's not necessary?
Yes, binary packet header would be unnecessary if the application can provide a
separate handler to do the merge itself, thus what bin_*write() should do is to
write the incoming arguments to the buffer as they are. But for relayfs, I still
think it should be maintained by bin_*write() and not by the application,
otherwise each application has to explicitly access it and pass it as an argmument.
>
> I was also suggesting that the merging code in stpd be removed from librelay and
> moved into a separate handler, with a suitable API to allow any application to
> supply merging methods based on its own sequencing method, or something like that.
I agree. But I think current merge method (by seq-id) could be kept as the
default choice for those users who don't want to take the trouble to rewrite the
merge part. 
> 
> It seems to me that the actual trace data can be completely opaque except to the 
> specific trace application, and probably all that should be necessary would be
> information necessary to distinguish between applications and within that trace
> format versions.  How about say a u32 with the high 16 bits identifying the
> application and the low 16 for use by the application e.g. to identify
> particular trace format versions?
It looks good. If stpd has different built-in merge methods, this will help to
choose the proper one. But if the merge handler is provided by the application,
this should be maintained by the application itself.

Comment 16 Masami Hiramatsu 2006-03-08 13:33:20 UTC

(In reply to comment #11)
> Created an attachment (id=907)
> updated binary tracing prototype (runtime interface and example script; patch
> for merge)

It seems to trip on the same issue of gBTI.
I think the binary transport interface should support the way to output the
contents of binary packets to the console, because of usability.
Why don't you pass stpd the format string to notify the way to display that?

And also, the BTI should be supported by stap as a embedded function(like
printf()). Those embedded functions are handled as reserved interfaces by stap's
parser(see parser::parse_symbol() in parser.cxx). Thus, the BTI will be able to
handle variable arguments naturally.

Comment 17 Gui,Jian 2006-03-09 03:35:52 UTC

(In reply to comment #16)
> It seems to trip on the same issue of gBTI.
Thanks for the ideas from gBTI.
> I think the binary transport interface should support the way to output the
> contents of binary packets to the console, because of usability.
> Why don't you pass stpd the format string to notify the way to display that?
I think the conversion of binary to console output is important, but should be
implemented by tapset providers better than in stpd. As Tom suggested above, the
application can provide handlers for merge. It's also possible for the
application to provide handlers for conversion. 
The job of binary transport interface and stpd is tring to provide a generic and
efficient infrastructure. Different events require different trace formats. If
let stpd handle the details of a trace line, it may cause unnecessary overhead
for some applications or add uncomfortable limits for some others.
> 
> And also, the BTI should be supported by stap as a embedded function(like
> printf()). Those embedded functions are handled as reserved interfaces by stap's
> parser(see parser::parse_symbol() in parser.cxx). Thus, the BTI will be able to
> handle variable arguments naturally.
I agree.

Comment 18 Frank Ch. Eigler 2006-03-09 14:26:23 UTC

This talk about other "applications" and APIs needs to be made more specific. 
Do you imagine something linked into stpd?  Or stpd just dumping the raw data to
disk for some other program to parse after the fact?  I don't expect to have
more than a handful of tracing interfaces to the scripting side, I don't expect
256 or more different sequencing or wrapping schemes.

If the transport code in the runtime will provide a fixed-size buffer
reservation function, this will enable the higher performance, compiled style of
tracing earlier discussed.  As long as that scheme can coexist with the plain
ascii trace stream, and this intermediate one, I will be satisfied.

Comment 19 Jose R. Santos 2006-03-09 14:45:30 UTC

(In reply to comment #18)
> This talk about other "applications" and APIs needs to be made more specific. 
> Do you imagine something linked into stpd?  Or stpd just dumping the raw data to
> disk for some other program to parse after the fact?  I don't expect to have
> more than a handful of tracing interfaces to the scripting side, I don't expect
> 256 or more different sequencing or wrapping schemes.

The way I see this working is stpd just dumping the raw data and nothing else. 
Then we can use other applications to analyse and visualize the data.

Comment 20 Tom Zanussi 2006-03-09 16:38:12 UTC

In reply to comment #18)
> This talk about other "applications" and APIs needs to be made more specific. 
> Do you imagine something linked into stpd?  Or stpd just dumping the raw data to
> disk for some other program to parse after the fact?  I don't expect to have
> more than a handful of tracing interfaces to the scripting side, I don't expect
> 256 or more different sequencing or wrapping schemes.
> 
> If the transport code in the runtime will provide a fixed-size buffer
> reservation function, this will enable the higher performance, compiled style of
> tracing earlier discussed.  As long as that scheme can coexist with the plain
> ascii trace stream, and this intermediate one, I will be satisfied.
> 

I've always thought of stpd as just one instance of many possible systemtap
ppplications, and in fact stpd.c is really just a front-end that provides
command-line parsing with the real work being done in librelay.c.  So the user
side of systemtap is already structured to assume there will be other
applications; there should be nothing to stop someone from writing his own
command by cloning stpd.c and creating a new application or even new system
command command built on top of systemtap.

So it should be possible for me to write a special-purpose standalone trace
utility that for example doesn't require any sequence ids at all (also what
about the case of tracing on a UP system, no need for sequence ids in this case
either), doesn't use any ascii data, so no ascii/binary flag is needed to
distinguish between them, and requires > 256 byte event sizes.  Or maybe my
application only has 2 events and each of those are never > 128 bytes, so I can
stick both my event id and event size into a single byte.  So if I'm forced into
using the suggested trace API I'm stuck with writing a packet header I'll never
use for each event.

All I'm asking for is that I have the ability to define my own trace format for
my own 'application'.  I think the bin_printf() addition to the language should
work fine in all cases, but just don't force a mandatory header to be part of
it.  The 'application' can decide exactly what goes into the trace event from
byte 0 to the end of the event.  One thing that probably needs to be done to
preserve the current functionality is to make the sequence counter
incrementation and retrievel a tapset function like get_cycles(), available for
those that want to use it when applicable.

As far as implementation, tracing applications could be implemented as
standalone commands like this e.g. an 'lket' command, or stpd could be enhanced
to handle different trace formats depending on the 'application'.  stpd could
handle a default case of say the get_sequence_id() sequence id in position 0 as
it does now, basically defining its own format which it understands how to
decipher just as any other application would, and merging the per-cpu files into
a single file (some applications might not want to merge per-cpu output files at
all, but rather provide a database-like API to efficiently retrieve data from
the per-cpu files, imagine huge trace files), but if we're thinking about
handling other trace formats in stpd, why not just have them as separate
applications in the first place, since in most cases they'd be dedicated to
handling only the data output from their bin_printf() calls and won't have a
need to deal with a commingled ascii/binary data stream as stpd does i.e.
they're basically single-minded tracing applications that don't need to deal
with the generic ascii output that stpd does.

So IMHO, the work that needs to be done is in properly modularizing librelay.c
along with some small changes to the runtime to make it easy for new
output-specific applications to be easily created, or if stpd is meant to be the
only systemtap application, modifying it to accomodate multiple trace applications.

Comment 21 Gui,Jian 2006-03-10 14:55:37 UTC

I don't mean to modify stpd greatly. I just want the stpd provide some
capability to allow new packet formats can be easily added as well as current
plain ascii packets.

When stpd merge pre-cpu files, it has to know which packet comes first and where
is the end of current packet. The former is decided by sequence id. For an ascii
packet, the latter is easily decided by the ending null terminator. But things
are different for a binary packet, for there are '0's everywhere. Thus the size
should be passed to stpd by some methods. That's why 'size' is introduced 
in the packet header. 

If such a header is rejected, stpd can also use some API such like 
get_packet_length() to get this. get_sequence_id() proposed above is also a good
API candidate, but I prefer implementing the plain sequencing scheme as
mandatory in bin_*write(). 

These APIs are implemented in the "applications". I consider stpd as an
"application" but mainly the part implemented by the tapset provider or the
user. I don't want to replace stpd or modify stpd greatly. 

If necessary, they can define packet-format-specific APIs and thus change the
default behavior of stpd. They can also let stpd not do merge and later use a
standalone program to process the raw data. By default, stpd will do merge with
built-in plain sequencing scheme.

If we can do this, then stpd can keep untouched when new binary trace formats
are adopted. Moreover, different tapset provider can be free to support their
own packet format without patching stpd source every time.

Roughly speaking, for LKET, If bin_*write() reject stp_packet_h structure,
lket_trace() will be responsible to handle {flag, size} and define
get_packet_length() to tell stpd the length of current packet. But here comes
another problem: is it worthy of the overhead of calculating the packet length
in lket_trace()?

Comment 22 Jose R. Santos 2006-03-10 16:22:14 UTC

(In reply to comment #21)
> I don't mean to modify stpd greatly. I just want the stpd provide some
> capability to allow new packet formats can be easily added as well as current
> plain ascii packets.
> 
> When stpd merge pre-cpu files, it has to know which packet comes first and where
> is the end of current packet. The former is decided by sequence id. For an ascii
> packet, the latter is easily decided by the ending null terminator. But things
> are different for a binary packet, for there are '0's everywhere. Thus the size
> should be passed to stpd by some methods. That's why 'size' is introduced 
> in the packet header. 

To me the obvious solution for this problem for stpd not to do merges of binary
packets
at all and let ascii output be logged to the console while binary data gets
logged on its
own per-cpu file.  An application can later assemble the packets from the files
in what 
ever way it sees fit.

> Roughly speaking, for LKET, If bin_*write() reject stp_packet_h structure,
> lket_trace() will be responsible to handle {flag, size} and define
> get_packet_length() to tell stpd the length of current packet. But here comes
> another problem: is it worthy of the overhead of calculating the packet length
> in lket_trace()?

This should be a job for the lket, I agree with Tom that a generic interface should 
not force a format on the user.

Comment 23 Li Guanglei 2006-03-10 16:53:34 UTC

I prefer the binary_write to use stp buffer instead of the relayfs buffer.

disadvantage:
  1. the additional overhead of using stp buffer is the memcpy, which should be
neglegible and will be offset if there are multiple write calls in a single handler.

advantage:
  1. it could use different transport layers, not confine to relayfs.
  2. it could make use the stp buffer to eliminate the no. of operations to the
relayfs layers. suppose 5 binary_write is called inside a single probe handler.
The overhead of using relayfs buffer directly will be larger than using stp buffer.
  3. It won't have the probles that Guijian mentioned(Known bug:) in comment #9

Comment 24 Li Guanglei 2006-03-10 17:10:17 UTC

> Roughly speaking, for LKET, If bin_*write() reject stp_packet_h structure,
> lket_trace() will be responsible to handle {flag, size} and define
> get_packet_length() to tell stpd the length of current packet. But here comes
> another problem: is it worthy of the overhead of calculating the packet length
> in lket_trace()?
for some events, e.g the backtrace, the actual length of the data to be printed
can't be known until we finished scanning from begin to end.

But this can be solved by write into the pre-reserverd len field at the
beginning of stp buffer(say, 2bytes) when we finished the scanning(vsnprintfb)

Or we can use some delimmers to tell the boundary of each events.

Anyway, all of the above solutions can be done by each individual trace
application. 

So I agree with Tom now. the stp_packet_h is not necessary and it should be
application specific. maybe we need to put some hooks into the runtime and let
each app to register their format specific callback.

Comment 25 Tom Zanussi 2006-03-13 03:25:18 UTC

Created attachment 915 [details]
binary tracing prototype code modified to remove hard-coded packet header

This is a modified version of the prototype which removes the dedicated packet
header.  I also modified lket_trace() to add the removed header data to the
bin_write() call instead (see below).  To enable this, I added a new %n format
specifier (i.e. %n (same as 4n), %1n, %2n, %4n) which tells bin_write() to
write
the 'written' length at the specified position after it's all done.  Since it
doesn't actually write any data from the param list, I also added a function
called this_event_len(), which by design does nothing but provides the user
something reasonable-sounding to put in that position in the param list.

Here's what the lket_trace() function looks like now with the header moved
into the rest of the event (it's basically a drop-in replacement for the old
code, minimally tested but it works for me):

/*
 * Example scripts to use the runtime interface above
 */
%{
#define lket_trace(hookID, fmt, args...) do { \
	struct timeval tv; \
	do_gettimeofday (&tv); \
	bin_write("%b%1b%1n|LKET:%b|%b|%b|%b|%b|%1b|%1b|"fmt"\n", \
		(int64_t)next_seq(), (int64_t)STP_BIN_PACKET,
(int64_t)this_event_len(), \
		(int64_t)tv.tv_sec, (int64_t)tv.tv_usec,
(int64_t)current->tgid, \
		(int64_t)current->parent->pid, (int64_t)current->pid, \
		(int64_t)current->thread_info->cpu, \
		(int64_t)hookID, args);\
} while(0)
%}

I think it's convenient to be able to do this, but I don't think it's actually
necessary to have an event length - with a proper event description, it should
always be possible to make sense of all the data from the beginning of an
event to the end.  A length makes it easier and quicker to get to the next
event, which is useful at times, so I think having %n or something like it
makes sense.

Anyway, at this point I think it would be useful to think through all the
details (e.g. event descriptions, transport/librelay changes if any, etc) by
doing and end-to-end walk-through for a couple of concrete examples.  I'll
take a stab at this over the next couple of days...

Comment 26 Li Guanglei 2006-03-13 06:28:38 UTC

> I also modified lket_trace() to add the removed header data to the
> bin_write() call instead (see below).  To enable this, I added a new %n format
> specifier (i.e. %n (same as 4n), %1n, %2n, %4n) which tells bin_write() to
> write
> the 'written' length at the specified position after it's all done.  Since it
> doesn't actually write any data from the param list, I also added a function
> called this_event_len(), which by design does nothing but provides the user
> something reasonable-sounding to put in that position in the param list.
I looked at the your bin_write.stp, and found bin_vsnwrite should changes they
way it handle %n specifier:

root:/home/lgl> diff -uprN bin_write.stp  bin_write_mod.stp 
--- bin_write.stp       2006-03-12 16:04:56.000000000 +0800
+++ bin_write_mod.stp   2006-03-13 14:21:20.000000000 +0800
@@ -141,6 +141,21 @@ int bin_vsnwrite(char *buf, size_t size,
                                temp = va_arg(args, int64_t);
                                write_len = str;
                                write_len_width = field_width;
+                               switch(write_len_width) {
+                                       case 1: 
+                                               str+=1;
+                                               written+=1;
+                                               break;
+                                       case 2: 
+                                               str+=2;
+                                               written+=2;
+                                               break;
+                                       case 4:
+                                               str+=4;
+                                               written+=4;
+                                       default: // "%4b" by default
+                                               break;
+                               }
                                break;
                        default:
                                if (str <= end)

Comment 27 Li Guanglei 2006-03-13 06:36:45 UTC

oops, 
Change the handler %4n & %n to:
> +                                       case 4:
> +                                       default: // "%4b" by default
> +                                               str+=4;
> +                                               written+=4;
> +                                               break;
> +                               }

Comment 28 Gui,Jian 2006-03-13 08:49:50 UTC

(In reply to comment #25)
> This is a modified version of the prototype which removes the dedicated packet
> header.  I also modified lket_trace() to add the removed header data to the
> bin_write() call instead (see below).  To enable this, I added a new %n format
> specifier (i.e. %n (same as 4n), %1n, %2n, %4n) which tells bin_write() to
> write the 'written' length at the specified position after it's all done. 
I think "%n" is a really good idea for solving the packet length problem.

In bin_vsnwrite(), the "written" is the total number of bytes which would be
generated for the given input taking no account of "size" limit. Thus it would
be different from the actual packet length when the packet is accross the subbuf
boundary. If the packet length is greater than subbuf size, the packet has to be
truncated. It hardly happens but should be considered.

Comment 29 Gui,Jian 2006-03-13 15:13:11 UTC

Created attachment 917 [details]
add bin_*printf() interface for comparison

This is modified based on Tom's code. I add a new binary trace interface
bin_*printf() for comparison based on discussions above.

Main changes compared with bin_*write():
1) It is similar to _stp_*printf() and not only limited to relayfs mode. 
2) It gives up writing to relayfs buffer directly and uses stp_print_buf
instead.
3) It shares stp_print_buf with _stp_*print() and thus can be used together to 

   generate a single packet.
4) It avoids the known bug mentioned in #9
5) It is consistent with current plain ascii scheme

As for stpd, I added two new functions: get_sequence_id() and get_packet_len().


Thanks in advance for your comments.

Comment 30 Frank Ch. Eigler 2006-03-13 16:02:26 UTC

One other request: the original ascii-vs-binary printing benchmark that prompted
binary tracing should be rerun with this new code, and also sent to wcohen for
inclusion in the forthcoming benchmark suite.

Comment 31 Tom Zanussi 2006-03-13 17:22:49 UTC

(In reply to comment #29)
> Created an attachment (id=917)
> add bin_*printf() interface for comparison
> 
> This is modified based on Tom's code. I add a new binary trace interface
> bin_*printf() for comparison based on discussions above.
> 
> Main changes compared with bin_*write():
> 1) It is similar to _stp_*printf() and not only limited to relayfs mode. 
> 2) It gives up writing to relayfs buffer directly and uses stp_print_buf
> instead.
> 3) It shares stp_print_buf with _stp_*print() and thus can be used together to 
> 
>    generate a single packet.
> 4) It avoids the known bug mentioned in #9
> 5) It is consistent with current plain ascii scheme
> 
> As for stpd, I added two new functions: get_sequence_id() and get_packet_len().
> 
> 
> Thanks in advance for your comments.

I noticed in the new code that you removed the sequence id, but it seems to me
unless I'm missing something you still need a sequence for each event (or you
should be able to use the timestamp you're already logging instead, see below),
otherwise you could have event a at time 0 logged on cpu0, event b at time 1 on
cpu1 and event c at time 2 on cpu 0, but appearing in the output as acb or bac.
       The sequence number generated by _stp_flush() would give you the ordering
between events if there was only 1 event per flush, but since you're buffering
them up the sequence is correct relative to the events on that cpu, but it
doesn't tell you the ordering between events on different cpus.  But anyway,
you're logging a timestamp, so you should be able to use that for ordering.e.g.
we should be able to generalize the merge in stpd and do something similar to
the compare() function in qsort:

static inline int __compare(long a, long b)
{
	if (a < b)
		return -1;
	else if (a > b)
		return 1;
	
	return 0;
}


#define SEC_OFFSET (sizeof u_int_8)
#define USEC_OFFSET (SEC_OFFSET + sizeof long)

int compare(const void *event_a, const void *event_b)
{
	long a = *(long *)event_a + SEC_OFFSET;
	long b = *(long *)event_b + SEC_OFFSET;

	if (a == b)
		return __compare(a, b);
	
	a = *(long *)event_a + USEC_OFFSET;
	b = *(long *)event_b + USEC_OFFSET;
	return __compare(a, b);
}

As for the bug mentioned in #9, is that really a bug?  Why couldn't you just do:

lket_trace(something1);
lket_trace(something2);

instead of

log(something1);
lket_trace(something2);

i.e. it seems to me that the ascii and binary streams should be viewed as
logically separate streams even though they actually go over the same 'wire',
and since you can log ascii as well over the binary stream, why do you need to
correlate with the ascii stream?

Comment 32 Jose R. Santos 2006-03-13 17:59:46 UTC

(In reply to comment #31)
> I noticed in the new code that you removed the sequence id, but it seems to me
> unless I'm missing something you still need a sequence for each event (or you
> should be able to use the timestamp you're already logging instead, see below),
> otherwise you could have event a at time 0 logged on cpu0, event b at time 1 on
> cpu1 and event c at time 2 on cpu 0, but appearing in the output as acb or bac.
>        The sequence number generated by _stp_flush() would give you the ordering
> between events if there was only 1 event per flush, but since you're buffering
> them up the sequence is correct relative to the events on that cpu, but it
> doesn't tell you the ordering between events on different cpus.  But anyway,
> you're logging a timestamp, so you should be able to use that for ordering.e.g.
> we should be able to generalize the merge in stpd and do something similar to
> the compare() function in qsort:

Correct, one of the purpose of the timestamp is for sequencing of events. 
Sequencing is something we would do in user space.

> As for the bug mentioned in #9, is that really a bug?  Why couldn't you just do:
> 
> lket_trace(something1);
> lket_trace(something2);
> 
> instead of
> 
> log(something1);
> lket_trace(something2);
> 
> i.e. it seems to me that the ascii and binary streams should be viewed as
> logically separate streams even though they actually go over the same 'wire',
> and since you can log ascii as well over the binary stream, why do you need to
> correlate with the ascii stream?

Correct again...  I'm not sure why Gui Jian view this as a bug I do not see a
need to correlate binary and ascii streams for tracing purposes.

Comment 33 Gui,Jian 2006-03-14 01:33:51 UTC

(In reply to comment #31)
> As for the bug mentioned in #9, is that really a bug?  Why couldn't you just do:
> lket_trace(something1);
> lket_trace(something2);
> 
> instead of
> 
> log(something1);
> lket_trace(something2);
> 
> i.e. it seems to me that the ascii and binary streams should be viewed as
> logically separate streams even though they actually go over the same 'wire',
> and since you can log ascii as well over the binary stream, why do you need to
> correlate with the ascii stream?
>
I think if the binary trace interface can be integrated into runtime and
provided to the user just like _stp_*printf(), we should allow the user to use
them freely. For the user's convenience, we should try minimize the constraints
about where, how and in what order to use it. The user may has no idea about
stp_*flush() or the subtle difference between _stp_*printf() and bin_*printf().
It reasonable for him to require what he gets is what he codes :-) This is why I
said it is a bug.
Anyway, if we don't need to correlate binary stream with ascii stream at all or
don't assume the user to use them in similar ways, I agree it should not be
considered as a bug.

Comment 34 Gui,Jian 2006-03-14 03:04:38 UTC

(In reply to comment #31)
> I noticed in the new code that you removed the sequence id, but it seems to me
> unless I'm missing something you still need a sequence for each event (or you
> should be able to use the timestamp you're already logging instead, see below),
> otherwise you could have event a at time 0 logged on cpu0, event b at time 1 on
> cpu1 and event c at time 2 on cpu 0, but appearing in the output as acb or bac.
> The sequence number generated by _stp_flush() would give you the ordering
> between events if there was only 1 event per flush, but since you're buffering
> them up the sequence is correct relative to the events on that cpu, but it
> doesn't tell you the ordering between events on different cpus.  
The original bin_*write() will generate a binary packet each time it is called,
because it forces a binary packet header. But the design of bin_*printf() is
different. It only writes data into the stp-print-buffer and lets _stp_*flush()
handle the sequence as you meantioned. It is to be consistent with
_stp_*printf() and to get rid of the limit to relayfs mode. 
>
> you're logging a timestamp, so you should be able to use that for ordering.e.g.
> we should be able to generalize the merge in stpd and do something similar to
> the compare() function in qsort
>
For LKET, it is good to use timestamp for merging and sequence information is
redundant. But in general, taking data space into consideration, it is luxurious
to merge by timestamp(8byte for ppc32 or 16byte for ppc64). So I agree the
comparison is modularized but more schemes should be considered.

I think one important question is: what we want the binary trace interface to
be?  something like _stp_*sprintf() except that it stores data in binary format,
or something totally controlled by the user to store data as they want?  I think
this is also the main difference between bin_*printf() and bin_*write().

Comment 35 Tom Zanussi 2006-03-14 05:21:45 UTC

(In reply to comment #33)
> (In reply to comment #31)
> > As for the bug mentioned in #9, is that really a bug?  Why couldn't you just do:
> > lket_trace(something1);
> > lket_trace(something2);
> > 
> > instead of
> > 
> > log(something1);
> > lket_trace(something2);
> > 
> > i.e. it seems to me that the ascii and binary streams should be viewed as
> > logically separate streams even though they actually go over the same 'wire',
> > and since you can log ascii as well over the binary stream, why do you need to
> > correlate with the ascii stream?
> >
> I think if the binary trace interface can be integrated into runtime and
> provided to the user just like _stp_*printf(), we should allow the user to use
> them freely. For the user's convenience, we should try minimize the constraints
> about where, how and in what order to use it. The user may has no idea about
> stp_*flush() or the subtle difference between _stp_*printf() and bin_*printf().
> It reasonable for him to require what he gets is what he codes :-) This is why I
> said it is a bug.
> Anyway, if we don't need to correlate binary stream with ascii stream at all or
> don't assume the user to use them in similar ways, I agree it should not be
> considered as a bug.

Just my opinion, but as a user when I think of printf() I think of data going to
a console, and when I think of log() or trace(), I think of binary data which I
probably don't want to go to the console.  If I want to trace to the console,
I'd just use printf().  So maybe there should be a clear difference made to the
user between the two.  log(), though, seems like it may be misnamed i.e. it
sounds like a binary function but it's actually not.  The printf() part of
bin_printf() is ok for me because it's an analogue of printf() but since it's
prefixed with bin_ it's clear to me what it's for.

Comment 36 Tom Zanussi 2006-03-14 05:51:23 UTC

(In reply to comment #34)
> (In reply to comment #31)
> > I noticed in the new code that you removed the sequence id, but it seems to me
> > unless I'm missing something you still need a sequence for each event (or you
> > should be able to use the timestamp you're already logging instead, see below),
> > otherwise you could have event a at time 0 logged on cpu0, event b at time 1 on
> > cpu1 and event c at time 2 on cpu 0, but appearing in the output as acb or bac.
> > The sequence number generated by _stp_flush() would give you the ordering
> > between events if there was only 1 event per flush, but since you're buffering
> > them up the sequence is correct relative to the events on that cpu, but it
> > doesn't tell you the ordering between events on different cpus.  
> The original bin_*write() will generate a binary packet each time it is called,
> because it forces a binary packet header. But the design of bin_*printf() is
> different. It only writes data into the stp-print-buffer and lets _stp_*flush()
> handle the sequence as you meantioned. It is to be consistent with
> _stp_*printf() and to get rid of the limit to relayfs mode. 
> >
> > you're logging a timestamp, so you should be able to use that for ordering.e.g.
> > we should be able to generalize the merge in stpd and do something similar to
> > the compare() function in qsort
> >
> For LKET, it is good to use timestamp for merging and sequence information is
> redundant. But in general, taking data space into consideration, it is luxurious
> to merge by timestamp(8byte for ppc32 or 16byte for ppc64). So I agree the
> comparison is modularized but more schemes should be considered.
> 

I agree, it should be up to the application to log whatever it needs to do it's
merge, and the compare function should also be tailored for the application, or
were you thinking of something else?

> I think one important question is: what we want the binary trace interface to
> be?  something like _stp_*sprintf() except that it stores data in binary format,
> or something totally controlled by the user to store data as they want?  I think
> this is also the main difference between bin_*printf() and bin_*write().
> 

I thought it was supposed to be something that would give the user
complete control over a trace.  At least, if it's not too hard, why
not make it as generic and flexible as possible?  Now that the header
is gone, it's almost there; the only other thing that needs to be
removed is the sequence number being generated by _stp_print_flush(),
since it's redundant if you're tracing your own sequences or
timestamps.  For stpd, an int still needs to be in that same position,
in order to coexist with the ascii stream, so maybe for a binary
event, you can define an _stp_bin_print_flush() that doesn't log the
sequence counter, add a %b to the beginning of your event and log a 0
there in place of the sequence counter, which is never used as a
sequence counter in stpd ascii, and store your own sequence or
timestamp in your event as you already do.  So your code stays
basically the same and you don't have to log a redundant sequence
counter, you get the transport-indepence, and other applications that
don't have to coexist with an ascii stream can call bin_printf() and
won't have to log the sequence counter or anything other than what
they want.  You could then also get rid of the binary/ascii flag since you
can use the 0 sequence counter for that.

Maybe there's a better way, or maybe it's not worth the effort, but it would be
nice to be able to have a trace interface that would allow applications to log
only the data they want and nothing more.

Comment 37 Jose R. Santos 2006-03-14 16:24:52 UTC

(In reply to comment #35)
> Just my opinion, but as a user when I think of printf() I think of data going to
> a console, and when I think of log() or trace(), I think of binary data which I
> probably don't want to go to the console.  If I want to trace to the console,
> I'd just use printf().  So maybe there should be a clear difference made to the

This is my opinion as well.  I personaly dont think there is any use to print
binary data to the console.  I guess one can make stpd be able to decode the
data and print it to the console in formated fashion, but what would be the
point of that.  Thats what printf() is for.

> user between the two.  log(), though, seems like it may be misnamed i.e. it
> sounds like a binary function but it's actually not.  The printf() part of
> bin_printf() is ok for me because it's an analogue of printf() but since it's
> prefixed with bin_ it's clear to me what it's for.

In my opinion the hole bin_printf() is a bit missleading since to me the "print"
part implies that we are using printable characters.  I like using write() or
bin_write() because this implies that the data is just beeing writen to a buffer
with out any form of formatting involved.

Comment 38 Martin Hunt 2006-03-14 17:28:06 UTC

I've been trying to catch up with all the comments.

I don't like the idea of having a "binary output" and an "ascii output". There
should be one output. And a way to always write console messages. We mostly have
that now.

With procfs, if you write binary data (including mixed binary and ascii), you
want it to go to a file or to another application which will display it. You can
redirect to a file with "stap foo.stp > out" or "stap -o out foo.stp" Or pipe it
to another app.

Once we have binary output, I argee that we can remove the timestamp and
interleaving code for relayfs. The output would be the per-cpu files.
Nothing would get displayed on the console from the bulk data in relayfs.
So redirecting the output would be meaningless for relayfs.

Right now, we have two functions warn() and error() that always write to the
console. Actually stpd writes their messages to stderr.  We could easily make
log() do the same, without signalling an error. Or call it print_console()?

log() is obsolete and should be deprecated or redefined to do something useful.

So in the end, I would still like to see things as simple as possible with only
the following

print()
printf() (which recognizes %b binary extensions)
print_console() - always short ASCII text messages. Should not be used for data
from the script output.

Comment 39 Tom Zanussi 2006-03-14 21:56:52 UTC

(In reply to comment #38)
> I've been trying to catch up with all the comments.
> 
> I don't like the idea of having a "binary output" and an "ascii output". There
> should be one output. And a way to always write console messages. We mostly have
> that now.
> 
> With procfs, if you write binary data (including mixed binary and ascii), you
> want it to go to a file or to another application which will display it. You can
> redirect to a file with "stap foo.stp > out" or "stap -o out foo.stp" Or pipe it
> to another app.
> 
> Once we have binary output, I argee that we can remove the timestamp and
> interleaving code for relayfs. The output would be the per-cpu files.
> Nothing would get displayed on the console from the bulk data in relayfs.
> So redirecting the output would be meaningless for relayfs.
> 
> Right now, we have two functions warn() and error() that always write to the
> console. Actually stpd writes their messages to stderr.  We could easily make
> log() do the same, without signalling an error. Or call it print_console()?
> 
> log() is obsolete and should be deprecated or redefined to do something useful.
> 
> So in the end, I would still like to see things as simple as possible with only
> the following
> 
> print()
> printf() (which recognizes %b binary extensions)
> print_console() - always short ASCII text messages. Should not be used for data
> from the script output.
> 

It sounds good to me - if I understand correctly, any print() or printf() data 
for a script invoked as "stap foo.stp" goes to the console or can be redirected
to a file, while with "stap -b foo.stp" it goes to per-cpu files, and any
print_console() data or warnings basically goes through the OOB channel and
isn't part of the data.  The one question I have is that if the script is run
with -b, it needs to use timestamps or a sequence for ordering, and right now,
there's the code in _stp_print_flush() that writes one, so everything will work
fine if that's there, but if we wanted to remove it, scripts that don't do their
own timestamping and are run with -b won't be ordered correctly.  So it seems to
me we'd need to either not worry about it and mandate the timestamp for -b, or
by default have the timestamp generated but provide an alternative function
that's exactly the same, but doesn't write the timestamp, for scripts that want
to  manage it themselve e.g. __printf().

Comment 40 Martin Hunt 2006-03-14 23:23:48 UTC

Subject: Re:  binary tracing

On Tue, 2006-03-14 at 21:56 +0000, zanussi at us dot ibm dot com wrote:

> It sounds good to me - if I understand correctly, any print() or printf() data 
> for a script invoked as "stap foo.stp" goes to the console or can be redirected
> to a file, while with "stap -b foo.stp" it goes to per-cpu files, and any
> print_console() data or warnings basically goes through the OOB channel and
> isn't part of the data.  

Yes.

> The one question I have is that if the script is run
> with -b, it needs to use timestamps or a sequence for ordering, and right now,
> there's the code in _stp_print_flush() that writes one, so everything will work
> fine if that's there, but if we wanted to remove it, scripts that don't do their
> own timestamping and are run with -b won't be ordered correctly.  So it seems to
> me we'd need to either not worry about it and mandate the timestamp for -b, or
> by default have the timestamp generated but provide an alternative function
> that's exactly the same, but doesn't write the timestamp, for scripts that want
> to  manage it themselve e.g. __printf().

I assume relayfs users are probably using systemtap as part of a larger
tool and they want the files in a particular format. So I don't think we
should enforce any sequence number or timestamp. In that case, the
output would be a group of files, one per cpu. There would be no
automatic combining that into a single file. Perhaps "stap -b -o xyzzy
foo.stp" would create files xyzzy.[cpu_number]. 

Is there is a case for automatically (as an option or default)
generating timestamps or sequence numbers? I put that code in there to
do it but that was when we were trying to make relayfs do everything. It
seems to me that if you want sequential output, you would use procfs. 

But, I rarely use relayfs so I may be misunderstanding the issue.

Comment 41 Tom Zanussi 2006-03-15 00:21:19 UTC

(In reply to comment #40)
> Subject: Re:  binary tracing
> 
> On Tue, 2006-03-14 at 21:56 +0000, zanussi at us dot ibm dot com wrote:
> 
> > It sounds good to me - if I understand correctly, any print() or printf() data 
> > for a script invoked as "stap foo.stp" goes to the console or can be redirected
> > to a file, while with "stap -b foo.stp" it goes to per-cpu files, and any
> > print_console() data or warnings basically goes through the OOB channel and
> > isn't part of the data.  
> 
> Yes.
> 
> > The one question I have is that if the script is run
> > with -b, it needs to use timestamps or a sequence for ordering, and right now,
> > there's the code in _stp_print_flush() that writes one, so everything will work
> > fine if that's there, but if we wanted to remove it, scripts that don't do their
> > own timestamping and are run with -b won't be ordered correctly.  So it seems to
> > me we'd need to either not worry about it and mandate the timestamp for -b, or
> > by default have the timestamp generated but provide an alternative function
> > that's exactly the same, but doesn't write the timestamp, for scripts that want
> > to  manage it themselve e.g. __printf().
> 
> I assume relayfs users are probably using systemtap as part of a larger
> tool and they want the files in a particular format. So I don't think we
> should enforce any sequence number or timestamp. In that case, the
> output would be a group of files, one per cpu. There would be no
> automatic combining that into a single file. Perhaps "stap -b -o xyzzy
> foo.stp" would create files xyzzy.[cpu_number]. 
> 
> Is there is a case for automatically (as an option or default)
> generating timestamps or sequence numbers? I put that code in there to
> do it but that was when we were trying to make relayfs do everything. It
> seems to me that if you want sequential output, you would use procfs. 
> 

I think so.  If you use the -b option with a script that doesn't print its own
binary timestamps, it should still work (as it does now).

Comment 42 Tom Zanussi 2006-03-21 16:51:50 UTC

Created attachment 936 [details]
prototype generic binary tracer using generic binary tracing interface

To test out the prototype binary trace interface, I wrote a prototype
generic binary tracer that uses it, which I've attached here.  It's
based on Jian Gui's latest bin_write.stp, but at his and Jose's
suggestion, I've split things up and moved the runtime part to
runtime/bin_write.c and renamed it, etc.

Basically, the trace format it implements uses the first byte to
identify the event, and to start off, there's only one event, which is
the 'define new event' event, used to define new events during the
trace.	read() and open() are instrumented here as examples.

Event types are defined within the probe (in begin()), actual probe
hits log the matching binary events in their handlers, and it all gets
generically sorted out at the end of the run in stpd.  This version
replaces the merge() in stpd with a function called print_binary()
that also merges the output but uses the event descriptions to
translate the binary events into generic text versions that are then
displayed to the console.  Of course, the output is generic and
therefore not pretty (see below for example), but it shouldn't be too
hard to add 'format strings' to event descriptions that would be used
in place of the generically formatted print function.  Also, it should
be easy for anyone to write their own replacement for print_binary.c,
which wouldn't need to print at all e.g. import_into_db_and_query.c
(the generic descriptions should make this easy) or
postprocess_and_graph.c.

It also makes changes to the runtime, removing the need to manipulate
the relayfs buffers in bin_write - basically what it does is use
relay_reserve() to get 'buffers' that replace the _stp_pbuf static
buffers.

Again, it's only been minimally tested and probably breaks non -b
tracing, and there are most likely other problems; this is all just
prototype code meant for demonstration and discussion purposes, so
buyer beware.

Some thoughts on the interface after having gone through the exercise
of actually logging binary data using it...

- keeping data positions straight when dealing with long format
  strings is really a pain, but I guess no different from regular
  printf in that respect.  I can't really think of a better way,
  though, other than having a way to break up the logging for an
  event into separate calls.

- defining event descriptions is also a pain; there should be some
  higher-level way of doing it e.g. define_event("event name"),
  add_field("field name", type).  Of course this is a format-specific
  thing, so not really on-topic here.

- tailoring the printed trace output could be accomplished by
  supplying a format string to the event description as mentioned
  above.  Wanting to do this at all is an application-specific thing
  though i.e. only a generically-printing back-end would want it, so
  again, it's probably off-topic for a generic trace interface.

- there doesn't seem to be a way to log variable-length data, other
  than strings.  To log non-string variable-length data e.g. arrays or
  the contents of target variables, there could be some function like
  bin_log(object).  This could in turn be defined as e.g
  __bin_log(object, length).  This would log a length followed by
  'length' number of bytes from the object.  If an object already has
  a length associated with it e.g. a systemtap array, bin_log(array)
  could be used and internally translated to __bin_log(array,
  array.len), which would log the count of elements in the array
  followed by the array elements.  To log the contents of a target
  struct, you could say for example bin_log(*$file) if $file referred
  to a struct file *, which would map to the same thing as
  __bin_log($file, sizeof(*$file)); I assume the dwarf info would make
  that possible.  To make all this accessible to printf() we could add
  another format specifier, say '%o'.  If the object passed in has
  length information associated with it, '%o' would be able to
  automatically log the object contents.  In cases where there is no
  associated length for an object, or you want to log a subset, maybe
  there could be a function called make_print_object(object, len).
  This would create an object that could then be passed to printf()
  and have the necessary length information associated with it.

- For the binary vs. console stream, maybe it makes sense to add an
  fprintf() and define printf() and print_console() in terms of that,
  so everything stays the same except that binary data could be made
  to go to a separate file if wanted, by using fprintf().  printf()
  would remain as it is but be would be defined as fprintf(stdout),
  print_console() would internally be fprintf(stderr) and binary data
  could be made to go to file by fprintf(stdfile).  There wouldn't be
  any corrsponding fopen(), stdfile would be the only file stream
  available and that would refer to normal stap file output.


Example output:

<0> read: id: 1, timeval: 1142896381.115226, tgid: 13495, parent pid: 13408,
pid: 13495, name: read, fd: 3, count: 8192, 
<0> read: id: 1, timeval: 1142896381.115233, tgid: 13495, parent pid: 13408,
pid: 13495, name: read, fd: 3, count: 8192, 
<0> open: id: 2, timeval: 1142896381.875678, tgid: 2739, parent pid: 2256, pid:
2739, filename: /dev/hdc, execname: hald-addon-stor, 
<0> read: id: 1, timeval: 1142896383.876072, tgid: 3532, parent pid: 3529, pid:
3532, name: read, fd: 3, count: 16384, 
<1> read: id: 1, timeval: 1142896383.876199, tgid: 13495, parent pid: 13408,
pid: 13495, name: read, fd: 3, count: 8192, 
<0> open: id: 2, timeval: 1142896383.879811, tgid: 2739, parent pid: 2256, pid:
2739, filename: /dev/hdc, execname: hald-addon-stor, 
events_processed: 8

Comment 43 Tom Zanussi 2006-03-22 04:02:46 UTC

Created attachment 937 [details]
somehow bin_write.c didn't get into the prototype patch - here's a new patch

Oops, somehow bin_write.c didn't get included, here's an updated tarball with
the right patch.

Comment 44 Gui,Jian 2006-03-22 09:39:19 UTC

This prototype works in my i686/2.6.15.6 with systemtap-20060318.              
      
But when I used _stp_*print() in my script, system always crashed. 
I found the reason is that _stp_pbuf[cpu] pointers are not 
initialized before they are used in _stp_sprintf() as in bin_swrite().

I think the following small patch should be integrated into this 
prototype. If I am wrong with something, please let me know.

diff -Nur src-20060318.bin_write/runtime/runtime.h src-modified/runtime/runtime.h
--- src-20060318.bin_write/runtime/runtime.h    2006-03-23 01:38:24.000000000 +0800
+++ src-modified/runtime/runtime.h      2006-03-23 01:42:27.000000000 +0800
@@ -59,8 +59,8 @@
 #endif /* RELAYFS */

 #include "print.c"
-#include "string.c"
 #include "bin_write.c"
+#include "string.c"
 #include "arith.c"
 #include "copy.c"
 #include "sym.h"
diff -Nur src-20060318.bin_write/runtime/string.c src-modified/runtime/string.c
--- src-20060318.bin_write/runtime/string.c     2005-11-01 03:23:21.000000000 +0800
+++ src-modified/runtime/string.c       2006-03-23 01:42:27.000000000 +0800
@@ -59,9 +59,12 @@
        int num;
        va_list args;
        if (str == _stp_stdout) {
+               char *buf;
+               int size;
                int cpu = smp_processor_id();
-               char *buf = &_stp_pbuf[cpu][STP_PRINT_BUF_START] +
_stp_pbuf_len[cpu];
-               int size = STP_PRINT_BUF_LEN -_stp_pbuf_len[cpu] + 1;
+               check_null_buffer(cpu);
+               buf = &_stp_pbuf[cpu][STP_PRINT_BUF_START] + _stp_pbuf_len[cpu];
+               size = STP_PRINT_BUF_LEN -_stp_pbuf_len[cpu] + 1;
                va_start(args, fmt);
                num = vsnprintf(buf, size, fmt, args);
                va_end(args);

Comment 45 Tom Zanussi 2006-03-22 16:59:52 UTC

(In reply to comment #44)
> This prototype works in my i686/2.6.15.6 with systemtap-20060318.              
>       
> But when I used _stp_*print() in my script, system always crashed. 
> I found the reason is that _stp_pbuf[cpu] pointers are not 
> initialized before they are used in _stp_sprintf() as in bin_swrite().
> 
> I think the following small patch should be integrated into this 
> prototype. If I am wrong with something, please let me know.
> 
> diff -Nur src-20060318.bin_write/runtime/runtime.h src-modified/runtime/runtime.h
> --- src-20060318.bin_write/runtime/runtime.h    2006-03-23 01:38:24.000000000
+0800
> +++ src-modified/runtime/runtime.h      2006-03-23 01:42:27.000000000 +0800
> @@ -59,8 +59,8 @@
>  #endif /* RELAYFS */
> 
>  #include "print.c"
> -#include "string.c"
>  #include "bin_write.c"
> +#include "string.c"
>  #include "arith.c"
>  #include "copy.c"
>  #include "sym.h"
> diff -Nur src-20060318.bin_write/runtime/string.c src-modified/runtime/string.c
> --- src-20060318.bin_write/runtime/string.c     2005-11-01 03:23:21.000000000
+0800
> +++ src-modified/runtime/string.c       2006-03-23 01:42:27.000000000 +0800
> @@ -59,9 +59,12 @@
>         int num;
>         va_list args;
>         if (str == _stp_stdout) {
> +               char *buf;
> +               int size;
>                 int cpu = smp_processor_id();
> -               char *buf = &_stp_pbuf[cpu][STP_PRINT_BUF_START] +
> _stp_pbuf_len[cpu];
> -               int size = STP_PRINT_BUF_LEN -_stp_pbuf_len[cpu] + 1;
> +               check_null_buffer(cpu);
> +               buf = &_stp_pbuf[cpu][STP_PRINT_BUF_START] + _stp_pbuf_len[cpu];
> +               size = STP_PRINT_BUF_LEN -_stp_pbuf_len[cpu] + 1;
>                 va_start(args, fmt);
>                 num = vsnprintf(buf, size, fmt, args);
>                 va_end(args);


Yeah, that should work for now.  The buffers should be set up when the tranport
is intialized anyway, so check_null_buffer() should go away - I'll do that in a
next version.  For this, I only modified what needed to be modified for the
demo, so didn't go through the runtime making sure everything else was right.

Comment 46 Gui,Jian 2006-03-23 06:29:14 UTC

I compared the overhead of bin_write vs. _stp_printf. 
Here are the results. The values are the average time 
per getuid() call in the loop, as shown in test_getuid.c.

Results on ppc64/2.6.9-34.EL:
No stap:                   333 ns
Empty probe:              1492 ns
100 char binary write:    3067 ns
100 char ascii printf:    2849 ns
 5 integers binary write: 2045 ns
 5 integers ascii printf: 5077 ns
10 integers binary write: 2451 ns
10 integers ascii printf: 8661 ns

Results on i686/2.6.15.6: 
No stap:                   570 ns
Empty probe:              2258 ns
100 char binary write:    3486 ns
100 char ascii printf:    2774 ns
 5 integers binary write: 2427 ns
 5 integers ascii printf: 3447 ns
10 integers binary write: 2984 ns
10 integers ascii printf: 4522 ns

As we expected, the binary trace interface is more effective than _stp_printf 
when writing integers. When the total of interger arguments increases,
such overhead decrease becomes especially significant. But bin_write is not 
as good as _stp_sprintf when writing only characters. It is a little strange.

I have tried to do this with the new benchmark framework
(http://sourceware.org/ml/systemtap/2006-q1/msg00790.html), 
but still have problem to let it work with current runtime patch by now. 
Another problem I encounter is that system always crashes when I increase the 
LOOPS in test_getuid.c, e.g. 100000LL. I am still working on these.

Here are the stp script and the C file I used to do the comparison. 
I use another shell script to run "stap -g -b -DTESTNUM=n getuid.stp -c
./test_getuid" iteratively.

/* getuid.stp */
function trace_func() 
%{
#if (TESTNUM==1)
        bin_write

("1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890");
#elif (TESTNUM==2)
        _stp_printf

("1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890");
#elif (TESTNUM==3)
        bin_write("%1b%2b%4b%d%8b", 0x11LL, 0x2233LL, 0x44556677LL,
0x44556677LL, 0x1122334455667788LL);
#elif (TESTNUM==4)
        _stp_printf("%d%d%d%d%lld", 0x11LL, 0x2233LL, 0x44556677LL,
0x44556677LL, 0x1122334455667788LL);
#elif (TESTNUM==5)
        bin_write("%1b%2b%4b%d%8b%1b%2b%4b%d%8b", 
                0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL, 
                0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL);
#elif (TESTNUM==6)
        _stp_printf("%d%d%d%d%lld%d%d%d%d%lld", 
                0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL, 
                0x11LL, 0x2233LL, 0x44556677LL, 0x44556677LL, 0x1122334455667788LL);
#else // empty probe
        ;       
#endif
%}

probe kernel.function("sys_getuid") {
        trace_func()
}

/* test_getuid.c */
#include <sys/types.h>
#include <sys/time.h>
#include <unistd.h>

#define LOOPS 10000LL

int main(void)
{
        struct timeval  tv[2];  
        int i;  
        double time;
        
        gettimeofday(&tv[0], NULL);
        for(i=0; i<LOOPS; i++){
                getuid();
        }       
        gettimeofday(&tv[1], NULL);

        time = ((tv[1].tv_sec - tv[0].tv_sec)*1000000.0 
                + (tv[1].tv_usec - tv[0].tv_usec))*1000/LOOPS;
        printf("%d ns\n", (int)time);
}

Comment 47 Martin Hunt 2006-04-10 05:42:11 UTC

See http://sources.redhat.com/ml/systemtap/2006-q2/msg00084.html

I've checked in a new printf() to implement the features discussed in this PR.
If I missed anything, let me know.

When writing lots of numbers, the speed increase is huge. There are still some
optimizations that could be done, including Tom's proposed translator
optimization to inline some printfs.

Comment 48 Martin Hunt 2006-11-20 13:40:33 UTC

Nothing left to do here. There is 3217 for compiled binary tracing as a further
enhancement.