This is the mail archive of the
mailing list for the elfutils project.
Re: PR25394 patch: debuginfod groom vs. scan race condition
- From: Mark Wielaard <mark at klomp dot org>
- To: "Frank Ch. Eigler" <fche at redhat dot com>, elfutils-devel at sourceware dot org
- Date: Mon, 20 Jan 2020 18:03:31 +0100
- Subject: Re: PR25394 patch: debuginfod groom vs. scan race condition
- References: <20200120014347.GA19106@redhat.com>
On Sun, 2020-01-19 at 20:43 -0500, Frank Ch. Eigler wrote:
> This was a doozie. As the commit text hints, I haven't found a way
> testing this on an elfutils/tests snack-sized data set (except perhaps
> by injecting long sleeps?), but the changes work on the larger
> archives such as the debuginfod.systemtap.org set.
> commit 34e67018914cf9ebbef07065965755b6554fd66e (HEAD -> master, origin/fche/debuginfod-PR25394)
> Author: Frank Ch. Eigler <firstname.lastname@example.org>
> Date: Sun Jan 19 20:33:32 2020 -0500
> PR25394: debuginfod mutex between grooming and scanning
> Extended the work-queue concept with "idlers" - other threads that
> block on the work queue until it becomes empty (rather than normal
> consumers that block on it until it becomes non-empty). Use this
> facility for the groomer thread to avoid working at the same time as
> the scanner threads. Use this for the fts traversal thread for
> similar reasons. One user-visible effect: response to SIGUSR1 and
> SIGUSR2 will wait until the work queue runs empty, but the man page
> was unspecific so does not need changing.
> It's not obvious how to test this with a tests/ dataset so small that
> scanning takes negligible time, so the former races are very tight.
> P.S. We also evaluated using sqlite level transactions to isolate the
> scanner thread groups-of-operations from the groomer. These
> experiments failed to produce a nominally concurrent debuginfod,
> having triggered "database locked" type errors. So we remain
> single-threaded (fully serialized) at the sqlite API level.
> diff --git a/debuginfod/ChangeLog b/debuginfod/ChangeLog
> index 35130b2a5d85..83b94a1108a6 100644
> --- a/debuginfod/ChangeLog
> +++ b/debuginfod/ChangeLog
> @@ -1,3 +1,14 @@
> +2020-01-19 Frank Ch. Eigler <email@example.com>
> + * debuginfod.cxx (scanq): Rework to let groomer/fts threads
> + synchronize with an empty workqueue, and lock out workqueue
> + consumers.
> + (thread_groom): Adopt new scanq idle APIs to lock out scanners.
> + (thread_main_fts_source_paths): Adopt new scanq idler API to
> + avoid being restarted while scanners haven't even finished yet.
> + (thread_main_*): Increment thread_work_total metric only after
> + a work cycle is completed, not when it begins.
Nice work. And I think I almost understand it. I am just a little
confused about thread_main_fts_source_paths also being an "idler":
> while (! interrupted)
> - set_metric("thread_timer", "role","traverse", rescan_timer);
> - // set_metric("thread_forced_total", "role","traverse", forced_rescan_count);
> - if (rescan_s && rescan_timer > rescan_s)
> - rescan_timer = 0;
> + sleep (1);
> + scanq.wait_idle(); // don't start a new traversal while scanners haven't finished the job
> + scanq.done_idle(); // release the hounds
> + if (interrupted) break;
I guess the comment "release the hounds" gets me confused. I don't know precisely what that means. Is this needed for correctness, or is it simply to have an hint that the scanners have ran because you register as an idler and so at this point you could only run if no scanner is running? But given you immediately tell them you are done a scanner could in theory start their work up again while you are doing your own traversal?