[PATCH] PR31265 - rework debuginfod archive-extract fdcache

Aaron Merey amerey@redhat.com
Tue Feb 13 16:36:10 GMT 2024


Hi Frank,

Overall the patch LGTM. You've rigorously tested it so I think it's
ready for merging.

Aaron

On Mon, Feb 12, 2024 at 11:07 AM Frank Ch. Eigler <fche@elastic.org> wrote:
>
> Hi -
>
> I plan to roll out this improvement to all the servers I look after,
> via COPR builds or such, it's that impactful.
>
>
> commit 7bfc10acc7f0e00c5bc45416d1bf8ee183d79ff3 (HEAD -> master)
> Author: Frank Ch. Eigler <fche@redhat.com>
> Date:   Thu Feb 8 19:33:55 2024 -0500
>
>     PR31265 - rework debuginfod archive-extract fdcache
>
>     Completely replace the "fdcache" algorithms in debuginfod, which
>     manages files extracted from archives.  Previous logic was a LRU queue
>     for files requested by users, and a separate LRU queue for prefetched
>     files found nearby the ones users requested.  The code did not handle
>     annoying edge cases like infrequently accessed but very costly
>     extraction of files like fedora kernels' vdso.debug.  In addition, the
>     queue was searched linearly for normal lookups.  It was also
>     unceremoniously dropped at each groom cycle.
>
>     New code replaces this with an indexed datastructure for quick
>     lookups, and extra metadata for use during eviction decisions.  Each
>     entry tracks size and such, but now also tracks how recently and how
>     many times it was requested, how long it took to originally extract.
>     The new code combines these quantities in a score, by which eviction
>     eligibility is ranked.  Intuitively, the effect is to prefer to hoard
>     small / slow-to-access files, and prefer to jettison large / fast /
>     never accessed ones.
>
>     It's a tricky thing to balance.  The parameters in this configuration
>     were tested by timing-accurate replaying a few days' worth of actual
>     traffic of the main fedora debuginfod server.  The peer
>     debuginfod.stg.fedoraproject.org runs the new code.  It shows good
>     performance, excellent use of the cache storage, and strong preference
>     to hold onto those vdso.debug files.  But who knows, it might need
>     tweaking later.  The new code adds more prometheus metrics to make it
>     possible to grok the effectiveness of the few remaining
>     fdcache-related options.
>
>     Patch includes doc updates and NEWS.  The changes are invisible to the
>     testsuite (except with respect to the new metrics).  Code changes are
>     focused on all the member functions of class libarchive_fdcache, and
>     their callers.  Unused parameters are removed, with previous command
>     line options hidden/accepted/ignored.  Some other minor error-path
>     tempfile-gc was fixed in the extraction paths.
>
>     Signed-Off-By: Frank Ch. Eigler <fche@redhat.com>
>
> diff --git a/NEWS b/NEWS
> index 98dc78f5130b..21263456b9b7 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -1,5 +1,8 @@
>  Version 0.191 (after 0.190)
>
> +debuginfod: Caching eviction logic improvements to improve retention
> +            of small/frequent/slow files such as Fedora's vdso.debug.
> +
>  srcfiles: Can now fetch the source files of a DWARF/ELF file and
>            place them into a zip.
>
> diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
> index 6b21f46f70ea..63424e59c28c 100644
> --- a/debuginfod/debuginfod.cxx
> +++ b/debuginfod/debuginfod.cxx
> @@ -1,5 +1,5 @@
>  /* Debuginfo-over-http server.
> -   Copyright (C) 2019-2021 Red Hat, Inc.
> +   Copyright (C) 2019-2024 Red Hat, Inc.
>     Copyright (C) 2021, 2022 Mark J. Wielaard <mark@klomp.org>
>     This file is part of elfutils.
>
> @@ -69,6 +69,8 @@ extern "C" {
>  #include <unistd.h>
>  #include <fcntl.h>
>  #include <netdb.h>
> +#include <math.h>
> +#include <float.h>
>
>
>  /* If fts.h is included before config.h, its indirect inclusions may not
> @@ -417,7 +419,7 @@ static const struct argp_option options[] =
>     { "verbose", 'v', NULL, 0, "Increase verbosity.", 0 },
>     { "regex-groom", 'r', NULL, 0,"Uses regexes from -I and -X arguments to groom the database.",0},
>  #define ARGP_KEY_FDCACHE_FDS 0x1001
> -   { "fdcache-fds", ARGP_KEY_FDCACHE_FDS, "NUM", 0, "Maximum number of archive files to keep in fdcache.", 0 },
> +   { "fdcache-fds", ARGP_KEY_FDCACHE_FDS, "NUM", OPTION_HIDDEN, NULL, 0 },
>  #define ARGP_KEY_FDCACHE_MBS 0x1002
>     { "fdcache-mbs", ARGP_KEY_FDCACHE_MBS, "MB", 0, "Maximum total size of archive file fdcache.", 0 },
>  #define ARGP_KEY_FDCACHE_PREFETCH 0x1003
> @@ -425,11 +427,9 @@ static const struct argp_option options[] =
>  #define ARGP_KEY_FDCACHE_MINTMP 0x1004
>     { "fdcache-mintmp", ARGP_KEY_FDCACHE_MINTMP, "NUM", 0, "Minimum free space% on tmpdir.", 0 },
>  #define ARGP_KEY_FDCACHE_PREFETCH_MBS 0x1005
> -   { "fdcache-prefetch-mbs", ARGP_KEY_FDCACHE_PREFETCH_MBS, "MB", 0,"Megabytes allocated to the \
> -      prefetch cache.", 0},
> +   { "fdcache-prefetch-mbs", ARGP_KEY_FDCACHE_PREFETCH_MBS, "MB", OPTION_HIDDEN, NULL, 0},
>  #define ARGP_KEY_FDCACHE_PREFETCH_FDS 0x1006
> -   { "fdcache-prefetch-fds", ARGP_KEY_FDCACHE_PREFETCH_FDS, "NUM", 0,"Number of files allocated to the \
> -      prefetch cache.", 0},
> +   { "fdcache-prefetch-fds", ARGP_KEY_FDCACHE_PREFETCH_FDS, "NUM", OPTION_HIDDEN, NULL, 0},
>  #define ARGP_KEY_FORWARDED_TTL_LIMIT 0x1007
>     {"forwarded-ttl-limit", ARGP_KEY_FORWARDED_TTL_LIMIT, "NUM", 0, "Limit of X-Forwarded-For hops, default 8.", 0},
>  #define ARGP_KEY_PASSIVE 0x1008
> @@ -482,12 +482,9 @@ static regex_t file_include_regex;
>  static regex_t file_exclude_regex;
>  static bool regex_groom = false;
>  static bool traverse_logical;
> -static long fdcache_fds;
>  static long fdcache_mbs;
>  static long fdcache_prefetch;
>  static long fdcache_mintmp;
> -static long fdcache_prefetch_mbs;
> -static long fdcache_prefetch_fds;
>  static unsigned forwarded_ttl_limit = 8;
>  static bool scan_source_info = true;
>  static string tmpdir;
> @@ -495,7 +492,9 @@ static bool passive_p = false;
>  static long scan_checkpoint = 256;
>
>  static void set_metric(const string& key, double value);
> -// static void inc_metric(const string& key);
> +static void inc_metric(const string& key);
> +static void add_metric(const string& metric,
> +                       double value);
>  static void set_metric(const string& metric,
>                         const string& lname, const string& lvalue,
>                         double value);
> @@ -653,7 +652,7 @@ parse_opt (int key, char *arg,
>        regex_groom = true;
>        break;
>      case ARGP_KEY_FDCACHE_FDS:
> -      fdcache_fds = atol (arg);
> +      // deprecated
>        break;
>      case ARGP_KEY_FDCACHE_MBS:
>        fdcache_mbs = atol (arg);
> @@ -673,14 +672,10 @@ parse_opt (int key, char *arg,
>        source_paths.insert(string(arg));
>        break;
>      case ARGP_KEY_FDCACHE_PREFETCH_FDS:
> -      fdcache_prefetch_fds = atol(arg);
> -      if ( fdcache_prefetch_fds < 0)
> -        argp_failure(state, 1, EINVAL, "fdcache prefetch fds");
> +      // deprecated
>        break;
>      case ARGP_KEY_FDCACHE_PREFETCH_MBS:
> -      fdcache_prefetch_mbs = atol(arg);
> -      if ( fdcache_prefetch_mbs < 0)
> -        argp_failure(state, 1, EINVAL, "fdcache prefetch mbs");
> +      // deprecated
>        break;
>      case ARGP_KEY_PASSIVE:
>        passive_p = true;
> @@ -1438,76 +1433,79 @@ class libarchive_fdcache
>  private:
>    mutex fdcache_lock;
>
> +  typedef pair<string,string> key; // archive, entry
>    struct fdcache_entry
>    {
> -    string archive;
> -    string entry;
> -    string fd;
> +    string fd; // file name (probably in $TMPDIR), not an actual open fd (EMFILE)
>      double fd_size_mb; // slightly rounded up megabytes
> +    time_t freshness; // when was this entry created or requested last
> +    unsigned request_count; // how many requests were made; or 0=prefetch only
> +    double latency; // how many seconds it took to extract the file
>    };
> -  deque<fdcache_entry> lru; // @head: most recently used
> +
> +  map<key,fdcache_entry> entries; // optimized for lookup
> +  time_t last_cleaning;
>    long max_fds;
> -  deque<fdcache_entry> prefetch; // prefetched
>    long max_mbs;
> -  long max_prefetch_mbs;
> -  long max_prefetch_fds;
>
>  public:
>    void set_metrics()
>    {
>      double fdcache_mb = 0.0;
>      double prefetch_mb = 0.0;
> -    for (auto i = lru.begin(); i < lru.end(); i++)
> -      fdcache_mb += i->fd_size_mb;
> -    for (auto j = prefetch.begin(); j < prefetch.end(); j++)
> -      prefetch_mb += j->fd_size_mb;
> +    unsigned fdcache_count = 0;
> +    unsigned prefetch_count = 0;
> +    for (auto &i : entries) {
> +      if (i.second.request_count) {
> +        fdcache_mb += i.second.fd_size_mb;
> +        fdcache_count ++;
> +      } else {
> +        prefetch_mb += i.second.fd_size_mb;
> +        prefetch_count ++;
> +      }
> +    }
>      set_metric("fdcache_bytes", fdcache_mb*1024.0*1024.0);
> -    set_metric("fdcache_count", lru.size());
> +    set_metric("fdcache_count", fdcache_count);
>      set_metric("fdcache_prefetch_bytes", prefetch_mb*1024.0*1024.0);
> -    set_metric("fdcache_prefetch_count", prefetch.size());
> +    set_metric("fdcache_prefetch_count", prefetch_count);
>    }
>
> -  void intern(const string& a, const string& b, string fd, off_t sz, bool front_p)
> +  void intern(const string& a, const string& b, string fd, off_t sz,
> +              bool requested_p, double lat)
>    {
>      {
>        unique_lock<mutex> lock(fdcache_lock);
> -      // nuke preexisting copy
> -      for (auto i = lru.begin(); i < lru.end(); i++)
> -        {
> -          if (i->archive == a && i->entry == b)
> -            {
> -              unlink (i->fd.c_str());
> -              lru.erase(i);
> -              inc_metric("fdcache_op_count","op","dequeue");
> -              break; // must not continue iterating
> -            }
> -        }
> -      // nuke preexisting copy in prefetch
> -      for (auto i = prefetch.begin(); i < prefetch.end(); i++)
> +      time_t now = time(NULL);
> +      // there is a chance it's already in here, just wasn't found last time
> +      // if so, there's nothing to do but count our luck
> +      auto i = entries.find(make_pair(a,b));
> +      if (i != entries.end())
>          {
> -          if (i->archive == a && i->entry == b)
> -            {
> -              unlink (i->fd.c_str());
> -              prefetch.erase(i);
> -              inc_metric("fdcache_op_count","op","prefetch_dequeue");
> -              break; // must not continue iterating
> -            }
> +          inc_metric("fdcache_op_count","op","redundant_intern");
> +          if (requested_p) i->second.request_count ++; // repeat prefetch doesn't count
> +          i->second.freshness = now;
> +          // We need to nuke the temp file, since interning passes
> +          // responsibility over the path to this structure.  It is
> +          // possible that the caller still has an fd open, but that's
> +          // OK.
> +          unlink (fd.c_str());
> +          return;
>          }
>        double mb = (sz+65535)/1048576.0; // round up to 64K block
> -      fdcache_entry n = { a, b, fd, mb };
> -      if (front_p)
> -        {
> -          inc_metric("fdcache_op_count","op","enqueue");
> -          lru.push_front(n);
> -        }
> +      fdcache_entry n = { .fd=fd, .fd_size_mb=mb,
> +                          .freshness=now, .request_count = requested_p?1U:0U,
> +                          .latency=lat};
> +      entries.insert(make_pair(make_pair(a,b),n));
> +
> +      if (requested_p)
> +        inc_metric("fdcache_op_count","op","enqueue");
>        else
> -        {
> -          inc_metric("fdcache_op_count","op","prefetch_enqueue");
> -          prefetch.push_front(n);
> -        }
> +        inc_metric("fdcache_op_count","op","prefetch_enqueue");
> +
>        if (verbose > 3)
>          obatched(clog) << "fdcache interned a=" << a << " b=" << b
> -                       << " fd=" << fd << " mb=" << mb << " front=" << front_p << endl;
> +                       << " fd=" << fd << " mb=" << mb << " front=" << requested_p
> +                       << " latency=" << lat << endl;
>
>        set_metrics();
>      }
> @@ -1517,10 +1515,9 @@ class libarchive_fdcache
>        {
>          inc_metric("fdcache_op_count","op","emerg-flush");
>          obatched(clog) << "fdcache emergency flush for filling tmpdir" << endl;
> -        this->limit(0, 0, 0, 0); // emergency flush
> +        this->limit(0); // emergency flush
>        }
> -    else if (front_p)
> -      this->limit(max_fds, max_mbs, max_prefetch_fds, max_prefetch_mbs); // age cache if required
> +    this->limit(max_mbs); // age cache if required
>    }
>
>    int lookup(const string& a, const string& b)
> @@ -1528,156 +1525,170 @@ class libarchive_fdcache
>      int fd = -1;
>      {
>        unique_lock<mutex> lock(fdcache_lock);
> -      for (auto i = lru.begin(); i < lru.end(); i++)
> +      auto i = entries.find(make_pair(a,b));
> +      if (i != entries.end())
>          {
> -          if (i->archive == a && i->entry == b)
> -            { // found it; move it to head of lru
> -              fdcache_entry n = *i;
> -              lru.erase(i); // invalidates i, so no more iteration!
> -              lru.push_front(n);
> -              inc_metric("fdcache_op_count","op","requeue_front");
> -              fd = open(n.fd.c_str(), O_RDONLY);
> -              break;
> -            }
> -        }
> -      // Iterate through prefetch while fd == -1 to ensure that no duplication between lru and
> -      // prefetch occurs.
> -      for ( auto i = prefetch.begin(); fd == -1 && i < prefetch.end(); ++i)
> -        {
> -          if (i->archive == a && i->entry == b)
> -            { // found it; take the entry from the prefetch deque to the lru deque, since it has now been accessed.
> -              fdcache_entry n = *i;
> -              prefetch.erase(i);
> -              lru.push_front(n);
> -              inc_metric("fdcache_op_count","op","prefetch_access");
> -              fd = open(n.fd.c_str(), O_RDONLY);
> -              break;
> +          if (i->second.request_count == 0) // was a prefetch!
> +            {
> +              inc_metric("fdcache_prefetch_saved_milliseconds_count");
> +              add_metric("fdcache_prefetch_saved_milliseconds_sum", i->second.latency*1000.);
>              }
> +          i->second.request_count ++;
> +          i->second.freshness = time(NULL);
> +          // brag about our success
> +          inc_metric("fdcache_op_count","op","prefetch_access"); // backward compat
> +          inc_metric("fdcache_saved_milliseconds_count");
> +          add_metric("fdcache_saved_milliseconds_sum", i->second.latency*1000.);
> +          fd = open(i->second.fd.c_str(), O_RDONLY);
>          }
>      }
>
> +    if (fd >= 0)
> +      inc_metric("fdcache_op_count","op","lookup_hit");
> +    else
> +      inc_metric("fdcache_op_count","op","lookup_miss");
> +
>      if (statfs_free_enough_p(tmpdir, "tmpdir", fdcache_mintmp))
>        {
>          inc_metric("fdcache_op_count","op","emerg-flush");
>          obatched(clog) << "fdcache emergency flush for filling tmpdir" << endl;
> -        this->limit(0, 0, 0, 0); // emergency flush
> +        this->limit(0); // emergency flush
>        }
>      else if (fd >= 0)
> -      this->limit(max_fds, max_mbs, max_prefetch_fds, max_prefetch_mbs); // age cache if required
> +      this->limit(max_mbs); // age cache if required
>
>      return fd;
>    }
>
> -  int probe(const string& a, const string& b) // just a cache residency check - don't modify LRU state, don't open
> +  int probe(const string& a, const string& b) // just a cache residency check - don't modify state, don't open
>    {
>      unique_lock<mutex> lock(fdcache_lock);
> -    for (auto i = lru.begin(); i < lru.end(); i++)
> -      {
> -        if (i->archive == a && i->entry == b)
> -          {
> -            inc_metric("fdcache_op_count","op","probe_hit");
> -            return true;
> -          }
> -      }
> -    for (auto i = prefetch.begin(); i < prefetch.end(); i++)
> -      {
> -        if (i->archive == a && i->entry == b)
> -          {
> -            inc_metric("fdcache_op_count","op","prefetch_probe_hit");
> -            return true;
> -          }
> -      }
> -    inc_metric("fdcache_op_count","op","probe_miss");
> -    return false;
> +    auto i = entries.find(make_pair(a,b));
> +    if (i != entries.end()) {
> +      inc_metric("fdcache_op_count","op","probe_hit");
> +      return true;
> +    } else {
> +      inc_metric("fdcache_op_count","op","probe_miss");
> +      return false;
> +   }
>    }
> -
> +
>    void clear(const string& a, const string& b)
>    {
>      unique_lock<mutex> lock(fdcache_lock);
> -    for (auto i = lru.begin(); i < lru.end(); i++)
> -      {
> -        if (i->archive == a && i->entry == b)
> -          { // found it; erase it from lru
> -            fdcache_entry n = *i;
> -            lru.erase(i); // invalidates i, so no more iteration!
> -            inc_metric("fdcache_op_count","op","clear");
> -            unlink (n.fd.c_str());
> -            set_metrics();
> -            return;
> -          }
> -      }
> -    for (auto i = prefetch.begin(); i < prefetch.end(); i++)
> -      {
> -        if (i->archive == a && i->entry == b)
> -          { // found it; erase it from lru
> -            fdcache_entry n = *i;
> -            prefetch.erase(i); // invalidates i, so no more iteration!
> -            inc_metric("fdcache_op_count","op","prefetch_clear");
> -            unlink (n.fd.c_str());
> -            set_metrics();
> -            return;
> -          }
> -      }
> +    auto i = entries.find(make_pair(a,b));
> +    if (i != entries.end()) {
> +      inc_metric("fdcache_op_count","op",
> +                 i->second.request_count > 0 ? "clear" : "prefetch_clear");
> +      unlink (i->second.fd.c_str());
> +      entries.erase(i);
> +      set_metrics();
> +      return;
> +    }
>    }
>
> -  void limit(long maxfds, long maxmbs, long maxprefetchfds, long maxprefetchmbs , bool metrics_p = true)
> +  void limit(long maxmbs, bool metrics_p = true)
>    {
> -    if (verbose > 3 && (this->max_fds != maxfds || this->max_mbs != maxmbs))
> -      obatched(clog) << "fdcache limited to maxfds=" << maxfds << " maxmbs=" << maxmbs << endl;
> +    time_t now = time(NULL);
> +
> +    // avoid overly frequent limit operations
> +    if (maxmbs > 0 && (now - this->last_cleaning) < 10) // probably not worth parametrizing
> +      return;
> +    this->last_cleaning = now;
> +
> +    if (verbose > 3 && (this->max_mbs != maxmbs))
> +      obatched(clog) << "fdcache limited to maxmbs=" << maxmbs << endl;
>
>      unique_lock<mutex> lock(fdcache_lock);
> -    this->max_fds = maxfds;
> +
>      this->max_mbs = maxmbs;
> -    this->max_prefetch_fds = maxprefetchfds;
> -    this->max_prefetch_mbs = maxprefetchmbs;
> -    long total_fd = 0;
>      double total_mb = 0.0;
> -    for (auto i = lru.begin(); i < lru.end(); i++)
> +
> +    map<double, pair<string,string>> sorted_entries;
> +    for (auto &i: entries)
>        {
> -        // accumulate totals from most recently used one going backward
> -        total_fd ++;
> -        total_mb += i->fd_size_mb;
> -        if (total_fd > this->max_fds || total_mb > this->max_mbs)
> -          {
> -            // found the cut here point!
> -
> -            for (auto j = i; j < lru.end(); j++) // close all the fds from here on in
> -              {
> -                if (verbose > 3)
> -                  obatched(clog) << "fdcache evicted a=" << j->archive << " b=" << j->entry
> -                                 << " fd=" << j->fd << " mb=" << j->fd_size_mb << endl;
> -                if (metrics_p)
> -                  inc_metric("fdcache_op_count","op","evict");
> -                unlink (j->fd.c_str());
> -              }
> +        total_mb += i.second.fd_size_mb;
> +
> +        // need a scalar quantity that combines these inputs in a sensible way:
> +        //
> +        // 1) freshness of this entry (last time it was accessed)
> +        // 2) size of this entry
> +        // 3) number of times it has been accessed (or if just prefetched with 0 accesses)
> +        // 4) latency it required to extract
> +        //
> +        // The lower the "score", the earlier garbage collection will
> +        // nuke it, so to prioritize entries for preservation, the
> +        // score should be higher, and vice versa.
> +        time_t factor_1_freshness = (now - i.second.freshness); // seconds
> +        double factor_2_size = i.second.fd_size_mb; // megabytes
> +        unsigned factor_3_accesscount = i.second.request_count; // units
> +        double factor_4_latency = i.second.latency; // seconds
> +
> +        #if 0
> +        double score = - factor_1_freshness; // simple LRU
> +        #endif
> +
> +        double score = 0.
> +          - log1p(factor_1_freshness)                // penalize old file
> +          - log1p(factor_2_size)                     // penalize large file
> +          + factor_4_latency * factor_3_accesscount; // reward slow + repeatedly read files
> +
> +        if (verbose > 4)
> +          obatched(clog) << "fdcache scored score=" << score
> +                         << " a=" << i.first.first << " b=" << i.first.second
> +                         << " f1=" << factor_1_freshness << " f2=" << factor_2_size
> +                         << " f3=" << factor_3_accesscount << " f4=" << factor_4_latency
> +                         << endl;
> +
> +        sorted_entries.insert(make_pair(score, i.first));
> +      }
>
> -            lru.erase(i, lru.end()); // erase the nodes generally
> -            break;
> -          }
> +    unsigned cleaned = 0;
> +    unsigned entries_original = entries.size();
> +    double cleaned_score_min = DBL_MAX;
> +    double cleaned_score_max = DBL_MIN;
> +
> +    // drop as many entries[] as needed to bring total mb down to the threshold
> +    for (auto &i: sorted_entries) // in increasing score order!
> +      {
> +        if (this->max_mbs > 0 // if this is not a "clear entire table"
> +            && total_mb < this->max_mbs) // we've cleared enough to meet threshold
> +          break; // stop clearing
> +
> +        auto j = entries.find(i.second);
> +        if (j == entries.end())
> +          continue; // should not happen
> +
> +        if (cleaned == 0)
> +          cleaned_score_min = i.first;
> +        cleaned++;
> +        cleaned_score_max = i.first;
> +
> +        if (verbose > 3)
> +          obatched(clog) << "fdcache evicted score=" << i.first
> +                         << " a=" << i.second.first << " b=" << i.second.second
> +                         << " fd=" << j->second.fd << " mb=" << j->second.fd_size_mb
> +                         << " rq=" << j->second.request_count << " lat=" << j->second.latency
> +                         << " fr=" << (now - j->second.freshness)
> +                         << endl;
> +        if (metrics_p)
> +          inc_metric("fdcache_op_count","op","evict");
> +
> +        total_mb -= j->second.fd_size_mb;
> +        unlink (j->second.fd.c_str());
> +        entries.erase(j);
>        }
> -    total_fd = 0;
> -    total_mb = 0.0;
> -    for(auto i = prefetch.begin(); i < prefetch.end(); i++){
> -      // accumulate totals from most recently used one going backward
> -        total_fd ++;
> -        total_mb += i->fd_size_mb;
> -        if (total_fd > this->max_prefetch_fds || total_mb > this->max_prefetch_mbs)
> -          {
> -            // found the cut here point!
> -            for (auto j = i; j < prefetch.end(); j++) // close all the fds from here on in
> -              {
> -                if (verbose > 3)
> -                  obatched(clog) << "fdcache evicted from prefetch a=" << j->archive << " b=" << j->entry
> -                                 << " fd=" << j->fd << " mb=" << j->fd_size_mb << endl;
> -                if (metrics_p)
> -                  inc_metric("fdcache_op_count","op","prefetch_evict");
> -                unlink (j->fd.c_str());
> -              }
>
> -            prefetch.erase(i, prefetch.end()); // erase the nodes generally
> -            break;
> -          }
> -    }
> +    if (metrics_p)
> +      inc_metric("fdcache_op_count","op","evict_cycle");
> +
> +    if (verbose > 1 && cleaned > 0)
> +      {
> +        obatched(clog) << "fdcache evicted num=" << cleaned << " of=" << entries_original
> +                       << " min=" << cleaned_score_min << " max=" << cleaned_score_max
> +                       << endl;
> +      }
> +
>      if (metrics_p) set_metrics();
>    }
>
> @@ -1686,7 +1697,7 @@ class libarchive_fdcache
>    {
>      // unlink any fdcache entries in $TMPDIR
>      // don't update metrics; those globals may be already destroyed
> -    limit(0, 0, 0, 0, false);
> +    limit(0, false);
>    }
>  };
>  static libarchive_fdcache fdcache;
> @@ -1701,7 +1712,8 @@ static libarchive_fdcache fdcache;
>
>  static int
>  extract_section (int elf_fd, int64_t parent_mtime,
> -                const string& b_source, const string& section)
> +                const string& b_source, const string& section,
> +                 const timespec& extract_begin)
>  {
>    /* Search the fdcache.  */
>    struct stat fs;
> @@ -1771,29 +1783,41 @@ extract_section (int elf_fd, int64_t parent_mtime,
>
>               /* Create temporary file containing the section.  */
>               char *tmppath = NULL;
> -             rc = asprintf (&tmppath, "%s/debuginfod.XXXXXX", tmpdir.c_str());
> +             rc = asprintf (&tmppath, "%s/debuginfod-section.XXXXXX", tmpdir.c_str());
>               if (rc < 0)
>                 throw libc_exception (ENOMEM, "cannot allocate tmppath");
>               defer_dtor<void*,void> tmmpath_freer (tmppath, free);
>               fd = mkstemp (tmppath);
>               if (fd < 0)
>                 throw libc_exception (errno, "cannot create temporary file");
> +
>               ssize_t res = write_retry (fd, data->d_buf, data->d_size);
> -             if (res < 0 || (size_t) res != data->d_size)
> +             if (res < 0 || (size_t) res != data->d_size) {
> +                close (fd);
> +                unlink (tmppath);
>                 throw libc_exception (errno, "cannot write to temporary file");
> +              }
>
>               /* Set mtime to be the same as the parent file's mtime.  */
>               struct timespec tvs[2];
> -             if (fstat (elf_fd, &fs) != 0)
> +             if (fstat (elf_fd, &fs) != 0) {
> +                close (fd);
> +                unlink (tmppath);
>                 throw libc_exception (errno, "cannot fstat file");
> -
> +              }
> +
>               tvs[0].tv_sec = 0;
>               tvs[0].tv_nsec = UTIME_OMIT;
>               tvs[1] = fs.st_mtim;
>               (void) futimens (fd, tvs);
>
> +              struct timespec extract_end;
> +              clock_gettime (CLOCK_MONOTONIC, &extract_end);
> +              double extract_time = (extract_end.tv_sec - extract_begin.tv_sec)
> +                + (extract_end.tv_nsec - extract_begin.tv_nsec)/1.e9;
> +
>               /* Add to fdcache.  */
> -             fdcache.intern (b_source, section, tmppath, data->d_size, true);
> +             fdcache.intern (b_source, section, tmppath, data->d_size, true, extract_time);
>               break;
>             }
>         }
> @@ -1817,10 +1841,14 @@ handle_buildid_f_match (bool internal_req_t,
>                          int *result_fd)
>  {
>    (void) internal_req_t; // ignored
> +
> +  struct timespec extract_begin;
> +  clock_gettime (CLOCK_MONOTONIC, &extract_begin);
> +
>    int fd = open(b_source0.c_str(), O_RDONLY);
>    if (fd < 0)
>      throw libc_exception (errno, string("open ") + b_source0);
> -
> +
>    // NB: use manual close(2) in error case instead of defer_dtor, because
>    // in the normal case, we want to hand the fd over to libmicrohttpd for
>    // file transfer.
> @@ -1843,7 +1871,7 @@ handle_buildid_f_match (bool internal_req_t,
>
>    if (!section.empty ())
>      {
> -      int scn_fd = extract_section (fd, s.st_mtime, b_source0, section);
> +      int scn_fd = extract_section (fd, s.st_mtime, b_source0, section, extract_begin);
>        close (fd);
>
>        if (scn_fd >= 0)
> @@ -1918,6 +1946,9 @@ handle_buildid_r_match (bool internal_req_p,
>                          const string& section,
>                          int *result_fd)
>  {
> +  struct timespec extract_begin;
> +  clock_gettime (CLOCK_MONOTONIC, &extract_begin);
> +
>    struct stat fs;
>    int rc = stat (b_source0.c_str(), &fs);
>    if (rc != 0)
> @@ -1948,7 +1979,7 @@ handle_buildid_r_match (bool internal_req_p,
>         {
>           int scn_fd = extract_section (fd, fs.st_mtime,
>                                         b_source0 + ":" + b_source1,
> -                                       section);
> +                                       section, extract_begin);
>           close (fd);
>           if (scn_fd >= 0)
>             fd = scn_fd;
> @@ -2009,6 +2040,7 @@ handle_buildid_r_match (bool internal_req_p,
>          archive_decoder = arch.second;
>        }
>    FILE* fp;
> +
>    defer_dtor<FILE*,int>::dtor_fn dfn;
>    if (archive_decoder != "cat")
>      {
> @@ -2079,7 +2111,7 @@ handle_buildid_r_match (bool internal_req_p,
>
>        // extract this file to a temporary file
>        char* tmppath = NULL;
> -      rc = asprintf (&tmppath, "%s/debuginfod.XXXXXX", tmpdir.c_str());
> +      rc = asprintf (&tmppath, "%s/debuginfod-fdcache.XXXXXX", tmpdir.c_str());
>        if (rc < 0)
>          throw libc_exception (ENOMEM, "cannot allocate tmppath");
>        defer_dtor<void*,void> tmmpath_freer (tmppath, free);
> @@ -2106,13 +2138,18 @@ handle_buildid_r_match (bool internal_req_p,
>        tvs[1].tv_nsec = archive_entry_mtime_nsec(e);
>        (void) futimens (fd, tvs);  /* best effort */
>
> +      struct timespec extract_end;
> +      clock_gettime (CLOCK_MONOTONIC, &extract_end);
> +      double extract_time = (extract_end.tv_sec - extract_begin.tv_sec)
> +        + (extract_end.tv_nsec - extract_begin.tv_nsec)/1.e9;
> +
>        if (r != 0) // stage 3
>          {
>            // NB: now we know we have a complete reusable file; make fdcache
>            // responsible for unlinking it later.
>            fdcache.intern(b_source0, fn,
>                           tmppath, archive_entry_size(e),
> -                         false); // prefetched ones go to the prefetch cache
> +                         false, extract_time); // prefetched ones go to the prefetch cache
>            prefetch_count --;
>            close (fd); // we're not saving this fd to make a mhd-response from!
>            continue;
> @@ -2122,13 +2159,13 @@ handle_buildid_r_match (bool internal_req_p,
>        // responsible for unlinking it later.
>        fdcache.intern(b_source0, b_source1,
>                       tmppath, archive_entry_size(e),
> -                     true); // requested ones go to the front of lru
> +                     true, extract_time); // requested ones go to the front of the line
>
>        if (!section.empty ())
>         {
>           int scn_fd = extract_section (fd, b_mtime,
>                                         b_source0 + ":" + b_source1,
> -                                       section);
> +                                       section, extract_begin);
>           close (fd);
>           if (scn_fd >= 0)
>             fd = scn_fd;
> @@ -2583,14 +2620,12 @@ set_metric(const string& metric, double value)
>    unique_lock<mutex> lock(metrics_lock);
>    metrics[metric] = value;
>  }
> -#if 0 /* unused */
>  static void
>  inc_metric(const string& metric)
>  {
>    unique_lock<mutex> lock(metrics_lock);
>    metrics[metric] ++;
>  }
> -#endif
>  static void
>  set_metric(const string& metric,
>             const string& lname, const string& lvalue,
> @@ -2618,7 +2653,6 @@ add_metric(const string& metric,
>    unique_lock<mutex> lock(metrics_lock);
>    metrics[key] += value;
>  }
> -#if 0
>  static void
>  add_metric(const string& metric,
>             double value)
> @@ -2626,7 +2660,6 @@ add_metric(const string& metric,
>    unique_lock<mutex> lock(metrics_lock);
>    metrics[metric] += value;
>  }
> -#endif
>
>
>  // and more for higher arity labels if needed
> @@ -3479,7 +3512,7 @@ archive_classify (const string& rps, string& archive_extension, int64_t archivei
>
>            // extract this file to a temporary file
>            char* tmppath = NULL;
> -          rc = asprintf (&tmppath, "%s/debuginfod.XXXXXX", tmpdir.c_str());
> +          rc = asprintf (&tmppath, "%s/debuginfod-classify.XXXXXX", tmpdir.c_str());
>            if (rc < 0)
>              throw libc_exception (ENOMEM, "cannot allocate tmppath");
>            defer_dtor<void*,void> tmmpath_freer (tmppath, free);
> @@ -3490,8 +3523,10 @@ archive_classify (const string& rps, string& archive_extension, int64_t archivei
>            defer_dtor<int,int> minifd_closer (fd, close);
>
>            rc = archive_read_data_into_fd (a, fd);
> -          if (rc != ARCHIVE_OK)
> +          if (rc != ARCHIVE_OK) {
> +            close (fd);
>              throw archive_exception(a, "cannot extract file");
> +          }
>
>            // finally ... time to run elf_classify on this bad boy and update the database
>            bool executable_p = false, debuginfo_p = false;
> @@ -4199,9 +4234,12 @@ void groom()
>    sqlite3_db_release_memory(dbq); // ... for both connections
>    debuginfod_pool_groom(); // and release any debuginfod_client objects we've been holding onto
>
> -  fdcache.limit(0,0,0,0); // release the fdcache contents
> -  fdcache.limit(fdcache_fds, fdcache_mbs, fdcache_prefetch_fds, fdcache_prefetch_mbs); // restore status quo parameters
> -
> +#if 0 /* PR31265: don't jettison cache unnecessarily */
> +
> +  fdcache.limit(0); // release the fdcache contents
> +  fdcache.limit(fdcache_mbs); // restore status quo parameters
> +#endif
> +
>    clock_gettime (CLOCK_MONOTONIC, &ts_end);
>    double deltas = (ts_end.tv_sec - ts_start.tv_sec) + (ts_end.tv_nsec - ts_start.tv_nsec)/1.e9;
>
> @@ -4414,7 +4452,6 @@ main (int argc, char *argv[])
>      fdcache_mbs = sfs.f_bavail * sfs.f_bsize / 1024 / 1024 / 4; // 25% of free space
>    fdcache_mintmp = 25; // emergency flush at 25% remaining (75% full)
>    fdcache_prefetch = 64; // guesstimate storage is this much less costly than re-decompression
> -  fdcache_fds = (concurrency + fdcache_prefetch) * 2;
>
>    /* Parse and process arguments.  */
>    int remaining;
> @@ -4423,17 +4460,10 @@ main (int argc, char *argv[])
>        error (EXIT_FAILURE, 0,
>               "unexpected argument: %s", argv[remaining]);
>
> -  // Make the prefetch cache spaces a fraction of the main fdcache if
> -  // unspecified.
> -  if (fdcache_prefetch_fds == 0)
> -    fdcache_prefetch_fds = fdcache_fds / 2;
> -  if (fdcache_prefetch_mbs == 0)
> -    fdcache_prefetch_mbs = fdcache_mbs / 2;
> -
>    if (scan_archives.size()==0 && !scan_files && source_paths.size()>0)
>      obatched(clog) << "warning: without -F -R -U -Z, ignoring PATHs" << endl;
>
> -  fdcache.limit(fdcache_fds, fdcache_mbs, fdcache_prefetch_fds, fdcache_prefetch_mbs);
> +  fdcache.limit(fdcache_mbs);
>
>    (void) signal (SIGPIPE, SIG_IGN); // microhttpd can generate it incidentally, ignore
>    (void) signal (SIGINT, signal_handler); // ^C
> @@ -4615,15 +4645,12 @@ main (int argc, char *argv[])
>      obatched(clog) << "rescan time " << rescan_s << endl;
>      obatched(clog) << "scan checkpoint " << scan_checkpoint << endl;
>    }
> -  obatched(clog) << "fdcache fds " << fdcache_fds << endl;
>    obatched(clog) << "fdcache mbs " << fdcache_mbs << endl;
>    obatched(clog) << "fdcache prefetch " << fdcache_prefetch << endl;
>    obatched(clog) << "fdcache tmpdir " << tmpdir << endl;
>    obatched(clog) << "fdcache tmpdir min% " << fdcache_mintmp << endl;
>    if (! passive_p)
>      obatched(clog) << "groom time " << groom_s << endl;
> -  obatched(clog) << "prefetch fds " << fdcache_prefetch_fds << endl;
> -  obatched(clog) << "prefetch mbs " << fdcache_prefetch_mbs << endl;
>    obatched(clog) << "forwarded ttl limit " << forwarded_ttl_limit << endl;
>
>    if (scan_archives.size()>0)
> diff --git a/doc/debuginfod.8 b/doc/debuginfod.8
> index c17f1519a600..42e0fc9fbb34 100644
> --- a/doc/debuginfod.8
> +++ b/doc/debuginfod.8
> @@ -234,29 +234,26 @@ loops in the symbolic directory tree might lead to \fIinfinite
>  traversal\fP.
>
>  .TP
> -.B "\-\-fdcache\-fds=NUM"  "\-\-fdcache\-mbs=MB"
> +.B "\-\-fdcache\-mbs=MB"
>  Configure limits on a cache that keeps recently extracted files from
> -archives.  Up to NUM requested files and up to a total of MB megabytes
> -will be kept extracted, in order to avoid having to decompress their
> -archives over and over again. The default NUM and MB values depend on
> -the concurrency of the system, and on the available disk space on the
> -$TMPDIR or \fB/tmp\fP filesystem.  This is because that is where the
> -most recently used extracted files are kept.  Grooming cleans out this
> -cache.
> +archives.  Up to a total of MB megabytes will be kept extracted, in
> +order to avoid having to decompress their archives over and over
> +again. The default MB values depend on the concurrency of the system,
> +and on the available disk space on the $TMPDIR or \fB/tmp\fP
> +filesystem.  (This is because that is where the most recently used
> +extracted files are kept.)  While previous versions used plain LRU,
> +the cache now attempts to preserve more frequently & recently accessed
> +files, and especially those that took a long time to extract (e.g.,
> +vdso.debug!), and penalizes large / old files.
>
>  .TP
> -.B "\-\-fdcache\-prefetch\-fds=NUM"  "\-\-fdcache\-prefetch\-mbs=MB"
> -.B "\-\-fdcache\-prefetch=NUM2"
> -
> -In addition to the main fdcache, up to NUM2 other files from an
> -archive may be prefetched into another cache before they are even
> -requested.  Configure how many file descriptors (fds) and megabytes
> -(mbs) are allocated to the prefetch fdcache. If unspecified, these
> -values depend on concurrency of the system and on the available disk
> -space on the $TMPDIR.  Allocating more to the prefetch cache will
> -improve performance in environments where different parts of several
> -large archives are being accessed.  This cache is also cleaned out
> -during grooming.
> +.B "\-\-fdcache\-prefetch=NUM"
> +Up to NUM other files from an archive may be prefetched into the
> +cache before they are even requested.  If unspecified, these values
> +depend on concurrency of the system and on the available disk space on
> +the $TMPDIR.  Allocating more will improve performance in environments
> +where multiple different parts of several large archives are being
> +accessed.
>
>  .TP
>  .B "\-\-fdcache\-mintmp=NUM"
>



More information about the Elfutils-devel mailing list