When a big debuginfod server starts grooming, and starts finding stale data (archives or files being removed), its self-cleaning efforts can take a long time. It's been observed to take O(seconds) to do a single sqlite query. In the metrics, see the sqlite3_milliseconds_count...{"nuke..."} ones. And the groom() function will check every file for staleness, until interrupted by a SIGUSR1, so that O(50000) stale files could take a whole day. During all this time, the server can service buildid requests, so it's not that bad, but it cannot scan for new files. We should investigate whether a more time-bounded groom operation could serve about as well. We could limit groom to a certain percentage of time, like 1 hr/day, then abort. (We'd have to traverse the file list in some stateful or random way in order not to just recheck the same ones over and over.) The post-loop cleanup ops ("nuke orphan buildids" ... end of function) are relatively quick and not worth worrying about at this time. Alternately, there may be a way to accelerate the individual nuke queries, maybe with more indexes, at the cost of more storage.
commit c1e8c8c6b25cb2b5c16553609f19a9ed5dd4e146 Author: Frank Ch. Eigler <fche@redhat.com> Date: Thu Nov 4 13:08:35 2021 -0400 PR28514: debuginfod: limit groom operation times