libctf: new enum-related API functions: request for better names

Tue May 21 10:13:46 GMT 2024

[Jose, Indu: your only interest here might be my musing about
 identifying what mysterious values in memory dumps are: search for
 "#define". If this were to happen it wouldn't be soon, would definitely
 not be on by default, but it would need compiler help to do something
 much like -g3 does now.]

On 20 May 2024, Stephen Brennan spake thusly:

> Hi Nick,
>
> I'm not subscribed here but found the mailto: link with In-Reply-To
> header set on the archive page; hopefully this reply works as expected.

It does! I like your suggested API and am switching straight to that
instead.

It's such a good API that my eye skipped over one function and I thought
'oh, we're missing that' and proposed a new one with the exact same name
and parameters in the same order before noticing it was already in your
list. :)

> Nick Alcock writes:
>> So Stephen Brennan pointed out many years ago that libctf's handling of
>> enumeration constants is needlessly unhelpful: it treats them as if they
>> are scoped within a given enum: you can only query from constant name ->
>> value and back within a given enum's scope, so if you don't already know
>> what enum something is part of you have to walk over every enum in the
>> dict hunting for it.
>>
>> Worse yet, we do not consider enum constants with clashing values to be
>> a sign of a type conflict, so can easily end up with multiple distinct
>> enums containing enumeration constants with the *same name* appearing
>> in the shared dict. This definitely violates the principle of least
>> surprise and the (largely unstated) assumption that the shared dict
>> should be "as if" the entire C program's non-conflicting types were
>> declared in a single giant file which was compiled with -gctf: you can't
>> write a C file that declares the same enumeration constant twice!
>>
>> Half of this is easy to fix: libctf, and in particular the deduplicator,
>> should track enumeration constant names just like it does all other
>> identifiers, and push enums with clashing names into child dicts. (This
>> might eat a lot of space when the enums have many other enumerators, but
>> most of that space is identical strings, which means we can win nearly
>> all the space back in v4 via the string-saving trick that is the second
>> entry in <https://sourceware.org/binutils/wiki/CTF/Todo/Compactness>.)
>>
>> But I'm having trouble figuring out names for the new API functions
>> we'll need for the rest.  Right now libctf has these:
>>
>> /* Convert the specified value to the corresponding enum tag name, if a
>>    matching name can be found.  Otherwise NULL is returned.  */
>>
>> const char *ctf_enum_name (ctf_dict_t *fp, ctf_id_t type, int value);
>>
>> /* Convert the specified enum tag name to the corresponding value, if a
>>    matching name can be found.  Otherwise CTF_ERR is returned.  */
>>
>> int ctf_enum_value (ctf_dict_t *fp, ctf_id_t type, const char *name,
>> 		    int *valp);
>>
>> /* Iterate over the members of an ENUM.  We pass the string name and
>>    associated integer value of each enum element to the specified callback
>>    function.  */
>>
>> int ctf_enum_iter (ctf_dict_t *fp, ctf_id_t type, ctf_enum_f *func, void *arg);
>>
>> /* Iterate over the members of an enum TYPE, returning each enumerand's NAME or
>>    NULL at end of iteration or error, and optionally passing back the
>>    enumerand's integer VALue.  */
>>
>> const char *ctf_enum_next (ctf_dict_t *fp, ctf_id_t type, ctf_next_t **it,
>>     	                   int *val);
>>
>> At the very least we want something like dict-wide equivalents of the
>> first two: but ctf_enum_name has the very annoying behaviour of just
>> picking the first name if there are multiple conflicting ones with the
>> same value, and on a dict-wide basis there will be huge numbers of these
>> (can you imagine how many enumeration constants have the value 1? :) )
>
> I've never personally had a use-case for ctf_enum_name(), looking up an
> enumerator by the integer value. However, I can understand why you might
> want to do it if you know the type ID already (e.g. a debugger may want
> to represent an enum variable with the symbolic name).

I mostly put it there for completeness -- it's something you can't do
without a global view of the type system, which you *can* do with one.

> But I can't imagine a case where:
>
>   a. I have an integer value, and I know it's an enum, but
>   b. I don't know which enum type it belongs to, and yet

Indeed -- this alone suggests you have no idea what it's being passed
to, so where did you get it from? Usually in my case this is "memory
dumps but I'm not quite sure what type it is" and I want to know if some
huge mysterious magic number is actually an enum -- but for this to be
really useful we also need to translate #defines of integers into
something enum-like (a single big "enum" named "#DEFINE" perhaps, or
some other C-invalid name). Hmmmm...

>   c. I *do* know which CTF dictionary it belongs to...
>   d. And I want to get the list of all possible enumerators it could be

... but this is a good point. Sounds like a cross-archive one would be
more useful, but still not very useful :) I'll drop it for now.

> Maybe I'm deficient in imagination. I'm sure this use case could come
> up, but is it something that libctf really ought to optimize for?  I'd
> argue that there are only two use cases that really ought to be
> supported at the dict-wide level:
>
> 1. Lookup an enumerator by name. This is something that users of C do
> constantly, implicitly, just by using the constant name in their source
> code. So libctf really ought to support it efficiently with some sort of
> string index. (IMO, it should be supported at the dict and archive
> level).

And this is something we need to track *anyway* to prevent people adding
the same enumerator identifier repeatedly (in a given dict), you know
like they can now without libctf complaining at all, even though it's
totally impossible in C.

We probably do need to provide an archive-wide iterator version of this
as well, so we can look up all enumerators with a given name -- and that
would also handle the unfortunate existing case of multiple identifiers
with different values, coming from different enums, in the same dict.
Like the one you propose below!

> 2. Iterate over all enumerators. This one is already supported quite
> well in with libctf today:

... of course this too only works on a per-dict level, but that's
probably what you're after, since usually you're looking at things from
the perspective of a particular child dict.

> You could introduce an API to eliminate some of the boilerplate, which
> could be nice enough. As an existing user, I probably wouldn't take
> advantage of the new function, since I need to support older libctf
> versions.

For now :)

> function. However, a new API for this would be much less flexible... The
> above function allows me to run code for each enum type, before and
> after handling all of the enumerators for that type. A
> ctf_enumerator_next() function cannot really give me that information.
[...]

I think it could if properly defined, though the definition would be a
bit odd. Something like:

/* Iterate over the members of an enum TYPE, returning each enumerand's NAME or
   NULL at end of iteration or error, and optionally passing back the
   enumerand's integer VALue.  On end of iteration, sets ECTF_NEXT_END.
   At end of each enum, sets ECTF_NEXT_ENUM_END (and iteration
   continues).  */

const char *ctf_enumerator_next (ctf_dict_t *fp, ctf_next_t **it,
    	                         ctf_id_t *enum, int *val);

With that in place, you can do this:

const char *enumrator;
ctf_id_t enum_id;
int64_t value;

while ((enumerator = ctf_enumerator_next (dict, &next, &enum_id, &value)) != NULL
       || ctf_errno (dict) == ECTF_NEXT_ENUM_END) {
{
	if (ctf_errno (dict) == ECTF_NEXT_ENUM_END) {
		/* end-of-this-enum-type stuff */
		continue;
	}
	/* one-enumerand stuff... */
}
if (ctf_errno (dict) != ECTF_NEXT_END) {
	ctf_next_destroy (next);
	/* oops, error... */
}

Now possibly this is too different from the way other iterators work,
I'm not sure... the repetition of ECTF_NEXT_ENUM_END is ugly but there
is probably a less ugly way if I just thought for a moment :)

Compared to your example, hm, I'm not sure if this new API would buy us
enough to be worth it, but it's at least *possible* to do the same thing
looks pretty similar, and means you don't need to filter out non-enums:

> ctf_next_t *next = NULL, *enum_next;
> ctf_id_t id;
> int isroot, enum_value;
> const char *enum_name;
> while ((id = ctf_type_next(dict, &next, &isroot, 1)) != CTF_ERR) {
>     if (ctf_type_kind(dict, id) != CTF_K_ENUM)
>         continue;
>     enum_next = NULL;
>     while ((enum_name = ctf_enum_next(dict, id, &enum_next, &enum_value))) {
>         /* do something with (dict, id, enum_name, enum_value, isroot) */
>     }
> }

Maybe I'm just overdesigning again. I'm not proposing adding this one
yet, not until you tell me it might be useful.

> I'd argue it would be better to let users manually do their own
> iteration. Especially since they could combine all their iteration needs
> into a single ctf_type_next() loop.

True!

>> I'm very unsatisfied with the naming: to me, ctf_enumerator_* does not
>> read "like ctf_enum_* but dict-wide": but ctf_dict_enum_* feels wrong
>> too, as if it were dealing with enums *of* dicts. Suggestions?
>
> Given my (maybe not so humble) opinion above, I think that naming can
> become a bit clearer if you don't try to handle so many use cases. To

I knew my problem here was wild overdesign, but I wasn't sure what
direction I was overdesigning :) this is very valuable feedback, thank
you.

> me, this is a clear case of a "lookup". The word "lookup" to me implies
> a wide search, not simply within a single enum type. And if you only
> support name lookup, then the functions can be called:

Good point.

> ctf_id_t ctf_lookup_enumerator(ctf_dict_t *, const char *, int64_t *enum_value);
> ctf_id_t ctf_lookup_enumerator_next(ctf_dict_t *, const char *,
>                                     int64_t *enum_value, ctf_next_t **next);
>
> This would also make it easier to perform the (in my opinion, equally
> important) archive-level lookup:
>
> ctf_id_t ctf_arc_lookup_enumerator_next(ctf_archive_t *, const char *,
>                                         int64_t *enum_value, ctf_dict_t **dict,
>                                         ctf_next_t **next);

Very nice! I hereby ditch my design and switch to these, with one slight
rearrangement of parameters to satisfy the not-terribly-well-documented
parameter ordering rule for _next iterators ("dict, query, iterator,
out"):

ctf_id_t ctf_lookup_enumerator_next(ctf_dict_t *, const char *,
                                    ctf_next_t **next, int64_t *enum_value);

ctf_id_t ctf_arc_lookup_enumerator_next(ctf_archive_t *, const char *,
                                        ctf_next_t **next, int64_t *enum_value,
                                        ctf_dict_t **dict);

(but do we want to return the *enum type*, or the value? I guess the
type, as you do above, because you can do more with it.)

ctf_lookup_enumerator has an annoying naming difference from the
one-enum-type existing version, ctf_enum_value, but so be it. It's too
common an operation to force people to use iterators every time.

One question remains about ctf_lookup_enumerator: if it finds more than
one enumerator with that name (i.e. an old dict, before the deduplicator
considered such things conflicts), should we error or just pick one to
return? I'm wavering towards erroring with ECTF_DUPLICATE so the caller
can switch to using the iterator (or just choose not to, since such
cases are rare even now and will get rarer).

> The reason this feels so important for me is that, from a debugger's
> perspective, we don't frequently know which dictionary to search.

Yes indeed. I want to provide better "find an appropriate dict holding
this conflicting THING" functions, but I was holding off doing that
until after v4 and until I could think up a way to let the caller impose
an ordering over dicts -- in every case I know of so far the caller
knows that some child dicts, if they exist, should take priority over
others if the THING is found in them, so a function doing the search in
an arbitrary order would be useless.

> Frequently a user is just saying "give me the constant FOO", with no
> scope or anything to give a hint. Looking up a constant at the dict
> level is good enough for the 95% of the time when the constant lives in
> the parent dict. But in the remaining 5% it really stinks that you would
> need to go through each dict and re-do the search (which would likely
> repeat the failed search in the parent dict).

Hell yes.

>  => Note: with these archive-level lookups, though, it would be really
>  nice to avoid returning a brand new ctf_dict_t *. I don't know how the
>  semantics would work: only search dictionaries that already have an
>  open handle? Reference count the dictionaries and simply return the
>  same one if it's already open?

This is already handled :) dicts are alrady refcounted (which is where
the huge memory leak you spotted a while back came from...). We cache
the dicts internally to the ctf_archive_t, so you can act as if they're
new dicts, close them when you're done, and not pay any of the opening
costs except for the first time.

All this is done for you for all dicts returned from ctf_archive_next,
so everything that uses that to walk over dicts will get automatic
caching. See ctf-archive.c:ctf_dict_open_cached.

> If you do stick with your proposed API, which is good too, then I

Nah, yours is much better! I was just seduced by the previously
impossible operation of looking up enumerator constants given values :)
but as noted it's also a pretty useless operation most of the time, so
we can put it off until we actually have a use for it (and a better
interface, and a way to turn #defines into something similar).

> Hopefully something in my ramblings above proves helpful!

Absolutely!