Bug 30221

Summary: Negative cache should differentiate failure types
Product: elfutils Reporter: Vicki Pfau <vi>
Component: debuginfodAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED WORKSFORME    
Severity: normal CC: amerey, elfutils-devel, fche, mark
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Vicki Pfau 2023-03-11 01:32:33 UTC
Having a negative cache is essential for performant lookups with debuginfod, but it doesn't differentiate between different types of failure, e.g. transient (timeout, interrupted, etc) and non-transient (server responded 404 or 410). A 404 error is unlikely to be resolved in as short a timespan as a transient error, and some servers don't have great debug info coverage, not to mention local builds. It's very annoying to have to hit tons of builds only to get a 404 so frequently, so it'd be a good idea to have different cache timeouts for different failure types.

I looked into how the negative cache works, and it looks like it's just the presence of a 0 byte file. Unfortunately, this means there's no metadata associated at all, and I'm not sure how tacking on metadata should work in the first place. xattrs perhaps?
Comment 1 Frank Ch. Eigler 2023-03-13 16:40:48 UTC
Even a 404 error may be transient, as a server may just not have gotten around to indexing new content yet.  Other transient errors may persist awhile.  I don't know of any unambiguous winning policy here.

As to the question that, if such a policy were formulated, how could the results be represented in the filesystem:  xattrs, yeah maybe.  But even simpler would be to have the code set the mtime or ctime of the 0-length file to a cause-related artificial timestamp that will inform the "cache_miss_s" expiry calculations.
Comment 2 Vicki Pfau 2023-03-14 01:47:28 UTC
404 and the like *may* be transient, but the fact of the matter is that *most* of the time it won't be And it's a cache, not a definitive answer saying this will never exist. Having a 404 cache for 10x the amount of time as a Ctrl-C would be a benefit to users 99% of the time, if not more. You don't need to overgeneralize to a surefire 100% of the time for something that's already "soft" like a cache. I'm already dealing with gdb taking well over 30 seconds to start running a program with a bunch of shared object dependencies that aren't in debuginfod...only to have to do that again in 10 minutes because there's no way for the cache to say "this probably won't appear in the short term." Setting cache_miss_s higher works, but is a workaround.

Using an artificial timestamp to fake out the cache_miss_s expiry is a hack. There's no other way of describing it. You're trying to wedge down additional information to a dumber system instead of making the system smarter if you go for that approach. Your filesystem representation works for the small, simple case you have here, but it won't scale if you try and extend the system with any metadata at all. You have one inode per negative cache file instead of one entry in, e.g. a SQLite database, which you can add additional columns to. xattrs are still a bit of a kludge but at least aren't trying to spoof information to fool a system unaware of complexity existing.
Comment 3 Vicki Pfau 2023-03-17 01:00:28 UTC
I have a proof of concept patch that I can attach here or submit to the mailing list if you think the xattrs approach is a good way to go. Alternatively, a metadata directory could be added under each buildid for per-file info, which would work in the absence of functional xattrs, but be slightly more complex.
Comment 4 Frank Ch. Eigler 2023-03-17 01:08:20 UTC
(In reply to Vicki Pfau from comment #3)
> I have a proof of concept patch that I can attach here or submit to the
> mailing list if you think the xattrs approach is a good way to go.
> Alternatively, a metadata directory could be added under each buildid for
> per-file info, which would work in the absence of functional xattrs, but be
> slightly more complex.

Have you considered the idea of encoding the retention deadline in the boring inode mtime or ctime?
Comment 5 Vicki Pfau 2023-03-17 01:16:26 UTC
I have a proof of concept patch that I can attach here or submit to the mailing list if you think the xattrs approach is a good way to go. Alternatively, a metadata directory could be added under each buildid for per-file info, which would work in the absence of functional xattrs, but be slightly more complex.(In reply to Frank Ch. Eigler from comment #4)
> (In reply to Vicki Pfau from comment #3)
> > I have a proof of concept patch that I can attach here or submit to the
> > mailing list if you think the xattrs approach is a good way to go.
> > Alternatively, a metadata directory could be added under each buildid for
> > per-file info, which would work in the absence of functional xattrs, but be
> > slightly more complex.
> 
> Have you considered the idea of encoding the retention deadline in the
> boring inode mtime or ctime?

I did, and in comment 2 I already explained why I think it's a bad idea.
Comment 6 Frank Ch. Eigler 2023-03-17 01:20:27 UTC
(In reply to Vicki Pfau from comment #2)
> 404 and the like *may* be transient, but the fact of the matter is that
> *most* of the time it won't be And it's a cache, not a definitive answer
> saying this will never exist. Having a 404 cache for 10x the amount of time
> as a Ctrl-C

I don't understand - a ctrl-C should not result in a cached artifact at all.
If that's happening, we should fix that.

> I'm already dealing with gdb taking well
> over 30 seconds to start running a program with a bunch of shared object
> dependencies that aren't in debuginfod...

Uncached misses from debuginfod tend to take on the order of milliseconds,
much less than seconds.  Do you have a trace of what's happening?
(DEBUGINFOD_VERBOSE=1 or something like that?)

> [...] because there's no way for the cache to say "this probably won't
> appear in the short term." Setting cache_miss_s higher works, but is a
> workaround.

That workaround is precisely the parameter for the quantity you seek.

> Your filesystem representation works
> for the small, simple case you have here, but it won't scale if you try and
> extend the system with any metadata at all.

That's fine.  If we can revisit when rationale exists for more metadata.
Comment 7 Vicki Pfau 2023-03-17 01:28:54 UTC
I have a proof of concept patch that I can attach here or submit to the mailing list if you think the xattrs approach is a good way to go. Alternatively, a metadata directory could be added under each buildid for per-file info, which would work in the absence of functional xattrs, but be slightly more complex.(In reply to Frank Ch. Eigler from comment #4)
> (In reply to Vicki Pfau from comment #3)
> > I have a proof of concept patch that I can attach here or submit to the
> > mailing list if you think the xattrs approach is a good way to go.
> > Alternatively, a metadata directory could be added under each buildid for
> > per-file info, which would work in the absence of functional xattrs, but be
> > slightly more complex.
> 
> Have you considered the idea of encoding the retention deadline in the
> boring inode mtime or ctime?

I did, and in comment 2 I already explained why I think it's a bad idea.(In reply to Frank Ch. Eigler from comment #6)
> (In reply to Vicki Pfau from comment #2)
> > 404 and the like *may* be transient, but the fact of the matter is that
> > *most* of the time it won't be And it's a cache, not a definitive answer
> > saying this will never exist. Having a 404 cache for 10x the amount of time
> > as a Ctrl-C
> 
> I don't understand - a ctrl-C should not result in a cached artifact at all.
> If that's happening, we should fix that.

Okay, that is kinda weird then. I'm seeing it in gdb--perhaps it's a gdb issue then.

> > I'm already dealing with gdb taking well
> > over 30 seconds to start running a program with a bunch of shared object
> > dependencies that aren't in debuginfod...
> 
> Uncached misses from debuginfod tend to take on the order of milliseconds,
> much less than seconds.  Do you have a trace of what's happening?
> (DEBUGINFOD_VERBOSE=1 or something like that?)

The issue appears to be the debuginfod server taking a not-insignificant amount of time per request (500ms - 2s I'd estimate) to report the absence of an associated artifact. Perhaps this is just an issue with how the server is configured. I'm using the elfutils server, but I've seen the same issue on Arch's server (the distro I'm using). It's worth noting too that some users will undoubtedly have higher latency. A way to asynchronously initiate requests so you can have multiple going at once would be great to try and alleviate this somewhat, but it doesn't look like there's a way to do this yet.

> > [...] because there's no way for the cache to say "this probably won't
> > appear in the short term." Setting cache_miss_s higher works, but is a
> > workaround.
> 
> That workaround is precisely the parameter for the quantity you seek.

Assuming the Ctrl-C issue I mentioned above is resolved, you could well be right. It's definitely the biggest source of the "transient" issues I mentioned, though things like timeouts might still qualify.

> > Your filesystem representation works
> > for the small, simple case you have here, but it won't scale if you try and
> > extend the system with any metadata at all.
> 
> That's fine.  If we can revisit when rationale exists for more metadata.

Sounds good.
Comment 8 Vicki Pfau 2023-03-17 01:30:47 UTC
Apologies for the double-post of the first part of that comment. I reloaded the page and apparently hitting the reply button didn't clear the comment at the top and I didn't notice until I replied.
Comment 9 Aaron Merey 2023-03-17 16:16:01 UTC
(In reply to Vicki Pfau from comment #7)
> I did, and in comment 2 I already explained why I think it's a bad idea.(In
> reply to Frank Ch. Eigler from comment #6)
> > (In reply to Vicki Pfau from comment #2)
> > > 404 and the like *may* be transient, but the fact of the matter is that
> > > *most* of the time it won't be And it's a cache, not a definitive answer
> > > saying this will never exist. Having a 404 cache for 10x the amount of time
> > > as a Ctrl-C
> > 
> > I don't understand - a ctrl-C should not result in a cached artifact at all.
> > If that's happening, we should fix that.
> 
> Okay, that is kinda weird then. I'm seeing it in gdb--perhaps it's a gdb
> issue then.

The issue was in libdebuginfod itself. I merged a fix for this:
https://sourceware.org/pipermail/elfutils-devel/2023q1/006050.html

> > > I'm already dealing with gdb taking well
> > > over 30 seconds to start running a program with a bunch of shared object
> > > dependencies that aren't in debuginfod...
> > 
> > Uncached misses from debuginfod tend to take on the order of milliseconds,
> > much less than seconds.  Do you have a trace of what's happening?
> > (DEBUGINFOD_VERBOSE=1 or something like that?)
> 
> The issue appears to be the debuginfod server taking a not-insignificant
> amount of time per request (500ms - 2s I'd estimate) to report the absence
> of an associated artifact. Perhaps this is just an issue with how the server
> is configured. I'm using the elfutils server, but I've seen the same issue
> on Arch's server (the distro I'm using). It's worth noting too that some
> users will undoubtedly have higher latency. A way to asynchronously initiate
> requests so you can have multiple going at once would be great to try and
> alleviate this somewhat, but it doesn't look like there's a way to do this
> yet.

There has been some discussion about gdb downloading from debuginfod in
background worker threads.  I would like to get this feature added eventually.
Comment 10 Aaron Merey 2023-03-17 16:53:38 UTC
(In reply to Vicki Pfau from comment #7)
> The issue appears to be the debuginfod server taking a not-insignificant
> amount of time per request (500ms - 2s I'd estimate) to report the absence
> of an associated artifact. 

Long-lived TCP connections to debuginfod servers were added to GDB 11.1. Before that we'd set up and tear down a connection for each query which added unnecessary latency. So if you are using an older version of GDB this could explain some of the delay.
Comment 11 Vicki Pfau 2023-03-17 23:39:47 UTC
I am using 11.1, but I think part of the problem is that Arch adopted debuginfod relatively recently and hasn't backfilled packages. I updated my packages yesterday and it took forever to start gdb today, but I think it was actually downloading most of those packages so it shouldn't leave negative cache this time. I don't know how debuginfod federation works either, but I'd absolutely believe that Arch's server is just slow for one reason or another, and somehow that's causing issues even though I'm querying the sourceware one directly.

When I updated gdb a few days ago I did notice that the way information was presented to the user changed, and it did seem faster, but I'm unsure if that was just placebo effect due to the fact that it was telling me more information now.
Comment 12 Frank Ch. Eigler 2023-03-24 11:22:31 UTC
There was a wild performance regression in sqlite 3.41 that archlinux's debuginfod server got hit with.  This was identified and corrected yesterday.  (It had nothing to do with caching.)  https://sqlite.org/forum/forumpost/a284a63124
Comment 13 Mark Wielaard 2023-04-08 13:29:51 UTC
Has this issue been fixed with the fixe from comment #9 https://sourceware.org/cgit/elfutils/commit/?id=5527216460c6131527c27b06dada015b67525966
And/Or was it caused by the sqlite performance regression mentioned in comment #9
Comment 14 Frank Ch. Eigler 2023-04-21 02:04:58 UTC
We believe the current code behaves better with respect to aborted downloads.  Thank you for your report.