[PATCH 01/19] include: new header ctf.h: file format description

Fri May 3 11:15:00 GMT 2019

[Sorry about the response delay: two-day family thing.]

On 1 May 2019, Jim Wilson spake thusly:

> On Wed, May 1, 2019 at 9:57 AM Nick Clifton <nickc@redhat.com> wrote:
>> > +/* CTF format description.
>> > +   Copyright (C) 2004-2019 Free Software Foundation, Inc.
>>
>> Copyright starting from 2004, really ?
>
> Looks like CTF is part of dtrace which Oracle inherited from Sun.
> Wikipedia tells me that the first release of dtrace was in Jan 2005,
> so a 2004 copyright looks right if this is the original sources from
> Sun subsequently modified by Oracle.

Exactly.

>> > +/* CTF - Compact ANSI-C Type Format
>> ANSI-C ?  Isn't everyone using ISO-C these days ?
>
> I was going to say the same thing.

Historical naming wart. I'm happy to adjust it. (The original headers
were inconsistent here and sometimes said ANSI-C and sometimes just C
and sometimes just 'Compact' with no language at all! Only the last is
definitely wrong.)

>> Also - does this format explicitly exclude other languages like C++ or Go or Rust ?
>
> Apparently doesn't explicitly exclude them, it just doesn't explicitly
> include them, and with only 64 possible type classes, it looks like
> you could run out without some clever encoding for other languages.

I had some ideas half an hour ago which should allow substantially more
format flexibility without making the libctf codebase horrifically
unreadable (in fact it should increase the readability of the codebase
by dropping most of the casts in it): this would let us have not only an
even more compact version of the ctf_stype_t for common C cases, but
also a longer ctt_info word for non-C cases with, oh, is 2^32 type
classes enough, or should I go to 2^64? ;) there will obviously be a
slight cost in space, but not a large one.

At this point I am mostly worried about the complexity of speccing
things like C++ out. I'm fairly sure the format can expand to handle
them in future (without breaking existing users) but I'm not so sure my
brain can!

A bigger question where multi-language support is concerned is whether
we need to handle more than one language in a given hierarchy of CTF
sections: in effect, allowing for multi-language translation units.

This would mean we could deduplicate types together for different
languages, but I doubt this would be useful for many language pairs
(which would have largely distinct language-specific type kinds). It
would increase compactness a bit more to say "dammit, if you have two
languages in your project you should have two CTF section hierarchies",
and come up with names like .ctf.cpp and .ctf.rust or something for the
other languages.

If we might handle additional languages in a one-language-per-container,
we might want to reserve a word in the header to indicate language even
though we don't plan to add any other languages yet, just to make it
possible to add them in future without another backward-compatibility
break.

>> > +#define CTF_VERSION_3 4
>> > +#define CTF_VERSION CTF_VERSION_3 /* Current version.  */
>>
>> Hang on - so the value of CTF_VERSION_3 is 4 ?  Does this mean that the
>> full version number is 3.4, or 4.0 or just 4 ?  I am a bit confused...
>
> Looks like there was a version 1+ which took number 2.
> https://github.com/oracle/libdtrace-ctf/blob/master/include/sys/ctf.h#L149

The history is... complicated, and all my fault, I'm afraid.

When we took libctf into the DTrace for Linux project, it was already at
v2: v1 then was an ancient Sun-era thing which had literally nothing but
the version number surviving in the codebase, much like you see above. I
reset it to v1, but after a few years its limitations became fairly
extreme: it only allowed 2^16 types in one program, only 998 members in
any one structure or union, we were running out of type kinds, etc. So I
introduced a v3... but v3 boosted the set of types to 2^32, thus changed
the boundary between parent and child type IDs, since type-parenthood is
indicated by the most significant bit in the type ID.

We upgrade old formats to new ones in memory aggressively at open time
to avoid duplicating codepaths for old formats, so this change in
parent/child boundary would have required the backward-compatibility
code to *renumber all the types* at the same time. This seemed
excessive, given that CTF containers are read-only after creation, so an
upgraded container couldn't ever have enough types in it for that
renumbering to be necessary: but we needed to note the fact that the
parent/child boundary was lower in some persistent form, in case the
user opened an old container (upgrading it in the process), then wrote
it back out again: we had to preserve the knowledge that this had once
been a v1 container, with a v1 parent/child boundary, *somewhere*.

So as a backward-compatibility hack I decreed that v1 when upgraded to
v2 would gain the CTF_VERSION_1_UPGRADED_3 version number, which was
interpreted as 'just like v2, except the parent/child boundary is like
v1'. If I'd been starting from scratch, a family of feature flags or
something might have been neater... but this works and the maintenance
burden is minimal (one conditional to note the existeince of
CTF_VERSION_1_UPGRADED_3 and set the parent/child boundary
appropriately).

> I don't have any expertise with CTF, I was just curious, so did a
> little looking around for more info and found the version number
> encoding.  I also found a FreeBSD man page which has some useful intro
> data.
> https://www.freebsd.org/cgi/man.cgi?query=ctf&sektion=5&manpath=freebsd-release-ports

Yep, that's the old v1 format all right (Sun format v2). Too small for
some real projects, even in the presence of aggressive deduplication.