Summary: | Negative cache should differentiate failure types | ||
---|---|---|---|
Product: | elfutils | Reporter: | Vicki Pfau <vi> |
Component: | debuginfod | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED WORKSFORME | ||
Severity: | normal | CC: | amerey, elfutils-devel, fche, mark |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: |
Description
Vicki Pfau
2023-03-11 01:32:33 UTC
Even a 404 error may be transient, as a server may just not have gotten around to indexing new content yet. Other transient errors may persist awhile. I don't know of any unambiguous winning policy here. As to the question that, if such a policy were formulated, how could the results be represented in the filesystem: xattrs, yeah maybe. But even simpler would be to have the code set the mtime or ctime of the 0-length file to a cause-related artificial timestamp that will inform the "cache_miss_s" expiry calculations. 404 and the like *may* be transient, but the fact of the matter is that *most* of the time it won't be And it's a cache, not a definitive answer saying this will never exist. Having a 404 cache for 10x the amount of time as a Ctrl-C would be a benefit to users 99% of the time, if not more. You don't need to overgeneralize to a surefire 100% of the time for something that's already "soft" like a cache. I'm already dealing with gdb taking well over 30 seconds to start running a program with a bunch of shared object dependencies that aren't in debuginfod...only to have to do that again in 10 minutes because there's no way for the cache to say "this probably won't appear in the short term." Setting cache_miss_s higher works, but is a workaround. Using an artificial timestamp to fake out the cache_miss_s expiry is a hack. There's no other way of describing it. You're trying to wedge down additional information to a dumber system instead of making the system smarter if you go for that approach. Your filesystem representation works for the small, simple case you have here, but it won't scale if you try and extend the system with any metadata at all. You have one inode per negative cache file instead of one entry in, e.g. a SQLite database, which you can add additional columns to. xattrs are still a bit of a kludge but at least aren't trying to spoof information to fool a system unaware of complexity existing. I have a proof of concept patch that I can attach here or submit to the mailing list if you think the xattrs approach is a good way to go. Alternatively, a metadata directory could be added under each buildid for per-file info, which would work in the absence of functional xattrs, but be slightly more complex. (In reply to Vicki Pfau from comment #3) > I have a proof of concept patch that I can attach here or submit to the > mailing list if you think the xattrs approach is a good way to go. > Alternatively, a metadata directory could be added under each buildid for > per-file info, which would work in the absence of functional xattrs, but be > slightly more complex. Have you considered the idea of encoding the retention deadline in the boring inode mtime or ctime? I have a proof of concept patch that I can attach here or submit to the mailing list if you think the xattrs approach is a good way to go. Alternatively, a metadata directory could be added under each buildid for per-file info, which would work in the absence of functional xattrs, but be slightly more complex.(In reply to Frank Ch. Eigler from comment #4) > (In reply to Vicki Pfau from comment #3) > > I have a proof of concept patch that I can attach here or submit to the > > mailing list if you think the xattrs approach is a good way to go. > > Alternatively, a metadata directory could be added under each buildid for > > per-file info, which would work in the absence of functional xattrs, but be > > slightly more complex. > > Have you considered the idea of encoding the retention deadline in the > boring inode mtime or ctime? I did, and in comment 2 I already explained why I think it's a bad idea. (In reply to Vicki Pfau from comment #2) > 404 and the like *may* be transient, but the fact of the matter is that > *most* of the time it won't be And it's a cache, not a definitive answer > saying this will never exist. Having a 404 cache for 10x the amount of time > as a Ctrl-C I don't understand - a ctrl-C should not result in a cached artifact at all. If that's happening, we should fix that. > I'm already dealing with gdb taking well > over 30 seconds to start running a program with a bunch of shared object > dependencies that aren't in debuginfod... Uncached misses from debuginfod tend to take on the order of milliseconds, much less than seconds. Do you have a trace of what's happening? (DEBUGINFOD_VERBOSE=1 or something like that?) > [...] because there's no way for the cache to say "this probably won't > appear in the short term." Setting cache_miss_s higher works, but is a > workaround. That workaround is precisely the parameter for the quantity you seek. > Your filesystem representation works > for the small, simple case you have here, but it won't scale if you try and > extend the system with any metadata at all. That's fine. If we can revisit when rationale exists for more metadata. I have a proof of concept patch that I can attach here or submit to the mailing list if you think the xattrs approach is a good way to go. Alternatively, a metadata directory could be added under each buildid for per-file info, which would work in the absence of functional xattrs, but be slightly more complex.(In reply to Frank Ch. Eigler from comment #4) > (In reply to Vicki Pfau from comment #3) > > I have a proof of concept patch that I can attach here or submit to the > > mailing list if you think the xattrs approach is a good way to go. > > Alternatively, a metadata directory could be added under each buildid for > > per-file info, which would work in the absence of functional xattrs, but be > > slightly more complex. > > Have you considered the idea of encoding the retention deadline in the > boring inode mtime or ctime? I did, and in comment 2 I already explained why I think it's a bad idea.(In reply to Frank Ch. Eigler from comment #6) > (In reply to Vicki Pfau from comment #2) > > 404 and the like *may* be transient, but the fact of the matter is that > > *most* of the time it won't be And it's a cache, not a definitive answer > > saying this will never exist. Having a 404 cache for 10x the amount of time > > as a Ctrl-C > > I don't understand - a ctrl-C should not result in a cached artifact at all. > If that's happening, we should fix that. Okay, that is kinda weird then. I'm seeing it in gdb--perhaps it's a gdb issue then. > > I'm already dealing with gdb taking well > > over 30 seconds to start running a program with a bunch of shared object > > dependencies that aren't in debuginfod... > > Uncached misses from debuginfod tend to take on the order of milliseconds, > much less than seconds. Do you have a trace of what's happening? > (DEBUGINFOD_VERBOSE=1 or something like that?) The issue appears to be the debuginfod server taking a not-insignificant amount of time per request (500ms - 2s I'd estimate) to report the absence of an associated artifact. Perhaps this is just an issue with how the server is configured. I'm using the elfutils server, but I've seen the same issue on Arch's server (the distro I'm using). It's worth noting too that some users will undoubtedly have higher latency. A way to asynchronously initiate requests so you can have multiple going at once would be great to try and alleviate this somewhat, but it doesn't look like there's a way to do this yet. > > [...] because there's no way for the cache to say "this probably won't > > appear in the short term." Setting cache_miss_s higher works, but is a > > workaround. > > That workaround is precisely the parameter for the quantity you seek. Assuming the Ctrl-C issue I mentioned above is resolved, you could well be right. It's definitely the biggest source of the "transient" issues I mentioned, though things like timeouts might still qualify. > > Your filesystem representation works > > for the small, simple case you have here, but it won't scale if you try and > > extend the system with any metadata at all. > > That's fine. If we can revisit when rationale exists for more metadata. Sounds good. Apologies for the double-post of the first part of that comment. I reloaded the page and apparently hitting the reply button didn't clear the comment at the top and I didn't notice until I replied. (In reply to Vicki Pfau from comment #7) > I did, and in comment 2 I already explained why I think it's a bad idea.(In > reply to Frank Ch. Eigler from comment #6) > > (In reply to Vicki Pfau from comment #2) > > > 404 and the like *may* be transient, but the fact of the matter is that > > > *most* of the time it won't be And it's a cache, not a definitive answer > > > saying this will never exist. Having a 404 cache for 10x the amount of time > > > as a Ctrl-C > > > > I don't understand - a ctrl-C should not result in a cached artifact at all. > > If that's happening, we should fix that. > > Okay, that is kinda weird then. I'm seeing it in gdb--perhaps it's a gdb > issue then. The issue was in libdebuginfod itself. I merged a fix for this: https://sourceware.org/pipermail/elfutils-devel/2023q1/006050.html > > > I'm already dealing with gdb taking well > > > over 30 seconds to start running a program with a bunch of shared object > > > dependencies that aren't in debuginfod... > > > > Uncached misses from debuginfod tend to take on the order of milliseconds, > > much less than seconds. Do you have a trace of what's happening? > > (DEBUGINFOD_VERBOSE=1 or something like that?) > > The issue appears to be the debuginfod server taking a not-insignificant > amount of time per request (500ms - 2s I'd estimate) to report the absence > of an associated artifact. Perhaps this is just an issue with how the server > is configured. I'm using the elfutils server, but I've seen the same issue > on Arch's server (the distro I'm using). It's worth noting too that some > users will undoubtedly have higher latency. A way to asynchronously initiate > requests so you can have multiple going at once would be great to try and > alleviate this somewhat, but it doesn't look like there's a way to do this > yet. There has been some discussion about gdb downloading from debuginfod in background worker threads. I would like to get this feature added eventually. (In reply to Vicki Pfau from comment #7) > The issue appears to be the debuginfod server taking a not-insignificant > amount of time per request (500ms - 2s I'd estimate) to report the absence > of an associated artifact. Long-lived TCP connections to debuginfod servers were added to GDB 11.1. Before that we'd set up and tear down a connection for each query which added unnecessary latency. So if you are using an older version of GDB this could explain some of the delay. I am using 11.1, but I think part of the problem is that Arch adopted debuginfod relatively recently and hasn't backfilled packages. I updated my packages yesterday and it took forever to start gdb today, but I think it was actually downloading most of those packages so it shouldn't leave negative cache this time. I don't know how debuginfod federation works either, but I'd absolutely believe that Arch's server is just slow for one reason or another, and somehow that's causing issues even though I'm querying the sourceware one directly. When I updated gdb a few days ago I did notice that the way information was presented to the user changed, and it did seem faster, but I'm unsure if that was just placebo effect due to the fact that it was telling me more information now. There was a wild performance regression in sqlite 3.41 that archlinux's debuginfod server got hit with. This was identified and corrected yesterday. (It had nothing to do with caching.) https://sqlite.org/forum/forumpost/a284a63124 Has this issue been fixed with the fixe from comment #9 https://sourceware.org/cgit/elfutils/commit/?id=5527216460c6131527c27b06dada015b67525966 And/Or was it caused by the sqlite performance regression mentioned in comment #9 We believe the current code behaves better with respect to aborted downloads. Thank you for your report. |