This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: what to do with flaky tests?
On 03/05/2019 16:50, Florian Weimer wrote:
> * Szabolcs Nagy:
>
>> On 03/05/2019 11:29, Florian Weimer wrote:
>>> * Szabolcs Nagy:
>>>
>>>> currently these tests are flaky on the aarch64 buildbot:
>>>>
>>>> FAIL: malloc/tst-mallocfork2
>>>> FAIL: nptl/tst-eintr1
>>>>
>>>> the first one is a timeout: the test can take unbounded time
>>>> since there is no guarantee that malloc ever gets interrupted
>>>> by the signal (so increasing the timeout limit does not help:
>>>> sometimes it finishes immediately sometimes it take several
>>>> minutes).
>>>
>>> I think we can count the total delivered signals instead, so the run
>>> time is more predictable. I can write a patch.
>>
>> yeah if there is a guaranteed limit that works for me.
>> (i guess the test can be marked UNSUPPORTED if it cannot
>> trigger the right condition)
>
> The patch below uses process-shared barriers for synchronization. I see
> numbers like these when running it:
>
> $ time malloc/tst-mallocfork2
> info: signals received during fork: 833
> info: signals received during free: 924
> info: signals received during malloc: 3625
>
> real 0m5.996s
> user 0m2.715s
> sys 0m2.883s
>
> Due to code churn, it's not easy to verify if the changes interfere with
> the original test objective. However, I temporarily added this at the
> beginning of the test:
>
> {
> auto void *cb(void *closure) { pause (); return NULL; }
> xpthread_create (NULL, cb, NULL);
> }
>
> This puts the process into multi-threaded mode, and this still results
> in a deadlock because none of the single-thread code paths are taken
> anymore. So I think the test still works as intended.
>
> Does this patch make the test pass for you?
yes, the test now seems to pass reliably.
on the problematic machine i see
$ time ./testrun.sh malloc/tst-mallocfork2
info: signals received during fork: 103
info: signals received during free: 482
info: signals received during malloc: 75
real 0m22.477s
user 0m10.366s
sys 0m12.284s
(the time does not vary much so the 100s
timeout setting works now)
thanks.
>>>> the second one can break make check: kernel side task
>>>> accounting is broken and even though no more than 100 threads
>>>> are alive at a time the cgroup task limit can be hit (even
>>>> if that's >20000 tasks) and then all fork/clone fails so
>>>> make check cannot continue and test results are lost.
>>>
>>> Is this with a current kernel? We haven't seen this in a while, after
>>> encountering it rather persistently.
>>
>> the kernel is ubuntu 4.15.0-34-generic
>> do you know since what kernel version was this improved?
>
> Sorry, I don't know.
>
> Thanks,
> Florian
>
>
> malloc/tst-mallocfork2: Use process-shared barriers
>
> This synchronization method has a lower overhead and makes
> it more likely that the signal arrives during one of the critical
> functions.
>
> Also test for fork deadlocks explicitly.
>
> 2019-05-03 Florian Weimer <fweimer@redhat.com>
>
> malloc/tst-mallocfork2: Use process-shared barriers.
> * malloc/tst-mallocfork2.c: Switch to <support/test-driver.c>.
> (signal_count, sigusr1_sender_pid): Remove.
> (iterations): Define constant.
> (shared): New variable.
> (sigusr1_received): Update comment.
> (sigusr1_handler): Do not send SIGSTOP to the sender process.
> (signal_sender): Optional use barriers to avoid sending signals
> during irrelevant times.
> (do_it): Initialize variable shared. Use xfork for error
> checking. Launch multiple SIGUSR1-sending subprocesses. Limit
> the iteration count, independent of signal delivery. Check for
> deadlocks in fork. Introduce barriers for reducing signal
> traffic. Do not send SIGCONT to the SIGUSR1-sending processes;
> replaced by the barriers. Count signals during fork/free/malloc
> and report them.
> * malloc/Makefile (tst-mallocfork): Link with libpthread.
>
> diff --git a/malloc/Makefile b/malloc/Makefile
> index aadf602dfd..d2fba29953 100644
> --- a/malloc/Makefile
> +++ b/malloc/Makefile
> @@ -252,3 +252,4 @@ $(objpfx)tst-dynarray-fail-mem.out: $(objpfx)tst-dynarray-fail.out
>
> $(objpfx)tst-malloc-tcache-leak: $(shared-thread-library)
> $(objpfx)tst-malloc_info: $(shared-thread-library)
> +$(objpfx)tst-mallocfork2: $(shared-thread-library)
> diff --git a/malloc/tst-mallocfork2.c b/malloc/tst-mallocfork2.c
> index 474b0cea4a..30e7abe6fd 100644
> --- a/malloc/tst-mallocfork2.c
> +++ b/malloc/tst-mallocfork2.c
> @@ -34,6 +34,11 @@
> #include <sys/wait.h>
> #include <time.h>
> #include <unistd.h>
> +#include <array_length.h>
> +#include <support/check.h>
> +#include <support/support.h>
> +#include <support/xthread.h>
> +#include <support/xunistd.h>
>
> /* How many malloc objects to keep arond. */
> enum { malloc_objects = 1009 };
> @@ -41,19 +46,16 @@ enum { malloc_objects = 1009 };
> /* The maximum size of an object. */
> enum { malloc_maximum_size = 70000 };
>
> -/* How many signals need to be delivered before the test exits. */
> -enum { signal_count = 1000 };
> +/* How many iterations the test performs before exiting. */
> +enum { iterations = 10000 };
>
> -static int do_test (void);
> -#define TIMEOUT 100
> -#define TEST_FUNCTION do_test ()
> -#include "../test-skeleton.c"
> -
> -/* Process ID of the subprocess which sends SIGUSR1 signals. */
> -static pid_t sigusr1_sender_pid;
> +/* Barrier for synchronization with the processes sending SIGUSR1
> + signals, to make it more likely that the signals arrive during a
> + fork/free/malloc call. */
> +static struct { pthread_barrier_t barrier; } *shared;
>
> /* Set to 1 if SIGUSR1 is received. Used to detect a signal during
> - malloc/free. */
> + fork/free/malloc. */
> static volatile sig_atomic_t sigusr1_received;
>
> /* Periodically set to 1, to indicate that the process is making
> @@ -63,17 +65,6 @@ static volatile sig_atomic_t progress_indicator = 1;
> static void
> sigusr1_handler (int signo)
> {
> - /* Let the main program make progress, by temporarily suspending
> - signals from the subprocess. */
> - if (sigusr1_received)
> - return;
> - /* sigusr1_sender_pid might not be initialized in the parent when
> - the first SIGUSR1 signal arrives. */
> - if (sigusr1_sender_pid > 0 && kill (sigusr1_sender_pid, SIGSTOP) != 0)
> - {
> - write_message ("error: kill (SIGSTOP)\n");
> - abort ();
> - }
> sigusr1_received = 1;
>
> /* Perform a fork with a trivial subprocess. */
> @@ -108,6 +99,8 @@ liveness_signal_handler (int signo)
> write_message ("warning: process seems to be stuck\n");
> }
>
> +/* Send SIGNO to the parent process. If SLEEP, wait a second between
> + signals, otherwise use barriers to delay sending signals. */
> static void
> __attribute__ ((noreturn))
> signal_sender (int signo, bool sleep)
> @@ -115,6 +108,8 @@ signal_sender (int signo, bool sleep)
> pid_t target = getppid ();
> while (true)
> {
> + if (!sleep)
> + xpthread_barrier_wait (&shared->barrier);
> if (kill (target, signo) != 0)
> {
> dprintf (STDOUT_FILENO, "error: kill: %m\n");
> @@ -123,14 +118,17 @@ signal_sender (int signo, bool sleep)
> if (sleep)
> usleep (1 * 1000 * 1000);
> else
> - /* Reduce the rate at which we send signals. */
> - sched_yield ();
> + xpthread_barrier_wait (&shared->barrier);
> }
> }
>
> static int
> do_test (void)
> {
> + /* shared->barrier is intialized along with sigusr1_sender_pids
> + below. */
> + shared = support_shared_allocate (sizeof (*shared));
> +
> struct sigaction action =
> {
> .sa_handler = sigusr1_handler,
> @@ -150,48 +148,74 @@ do_test (void)
> return 1;
> }
>
> - pid_t sigusr2_sender_pid = fork ();
> + pid_t sigusr2_sender_pid = xfork ();
> if (sigusr2_sender_pid == 0)
> signal_sender (SIGUSR2, true);
> - sigusr1_sender_pid = fork ();
> - if (sigusr1_sender_pid == 0)
> - signal_sender (SIGUSR1, false);
> +
> + /* Send SIGUSR1 signals from several processes. Hopefully, one
> + signal will hit one of the ciritical functions. Use a barrier to
> + avoid sending signals while not running fork/free/malloc. */
> + pid_t sigusr1_sender_pids[5];
> + {
> + pthread_barrierattr_t attr;
> + xpthread_barrierattr_init (&attr);
> + xpthread_barrierattr_setpshared (&attr, PTHREAD_PROCESS_SHARED);
> + xpthread_barrier_init (&shared->barrier, &attr,
> + array_length (sigusr1_sender_pids) + 1);
> + xpthread_barrierattr_destroy (&attr);
> + }
> + for (size_t i = 0; i < array_length (sigusr1_sender_pids); ++i)
> + {
> + sigusr1_sender_pids[i] = fork ();
> + if (sigusr1_sender_pids[i] == 0)
> + signal_sender (SIGUSR1, false);
> + }
>
> void *objects[malloc_objects] = {};
> - unsigned signals = 0;
> + unsigned int fork_signals = 0;
> + unsigned int free_signals = 0;
> + unsigned int malloc_signals = 0;
> unsigned seed = 1;
> - time_t last_report = 0;
> - while (signals < signal_count)
> + for (int i = 0; i < iterations; ++i)
> {
> progress_indicator = 1;
> int slot = rand_r (&seed) % malloc_objects;
> size_t size = rand_r (&seed) % malloc_maximum_size;
> - if (kill (sigusr1_sender_pid, SIGCONT) != 0)
> +
> + /* Occasionally do a fork first, to catch deadlocks there as
> + well (see bug 24161). */
> + bool do_fork = (rand_r (&seed) % 7) == 0;
> +
> + xpthread_barrier_wait (&shared->barrier);
> + if (do_fork)
> {
> - printf ("error: kill (SIGCONT): %m\n");
> - signal (SIGUSR1, SIG_IGN);
> - kill (sigusr1_sender_pid, SIGKILL);
> - kill (sigusr2_sender_pid, SIGKILL);
> - return 1;
> + sigusr1_received = 0;
> + pid_t pid = xfork ();
> + if (sigusr1_received)
> + ++fork_signals;
> + if (pid == 0)
> + _exit (0);
> + int status;
> + int ret = TEMP_FAILURE_RETRY (waitpid (pid, &status, 0));
> + if (ret < 0)
> + FAIL_EXIT1 ("waitpid: %m");
> + TEST_COMPARE (status, 0);
> }
> - sigusr1_received = false;
> + sigusr1_received = 0;
> free (objects[slot]);
> + if (sigusr1_received)
> + ++free_signals;
> + sigusr1_received = 0;
> objects[slot] = malloc (size);
> if (sigusr1_received)
> - {
> - ++signals;
> - time_t current = time (0);
> - if (current != last_report)
> - {
> - printf ("info: SIGUSR1 signal count: %u\n", signals);
> - last_report = current;
> - }
> - }
> + ++malloc_signals;
> + xpthread_barrier_wait (&shared->barrier);
> +
> if (objects[slot] == NULL)
> {
> printf ("error: malloc: %m\n");
> - signal (SIGUSR1, SIG_IGN);
> - kill (sigusr1_sender_pid, SIGKILL);
> + for (size_t i = 0; i < array_length (sigusr1_sender_pids); ++i)
> + kill (sigusr1_sender_pids[i], SIGKILL);
> kill (sigusr2_sender_pid, SIGKILL);
> return 1;
> }
> @@ -201,11 +225,20 @@ do_test (void)
> for (int slot = 0; slot < malloc_objects; ++slot)
> free (objects[slot]);
>
> - /* Terminate the signal-sending subprocess. The SIGUSR1 handler
> - should no longer run because it uses sigusr1_sender_pid. */
> - signal (SIGUSR1, SIG_IGN);
> - kill (sigusr1_sender_pid, SIGKILL);
> + for (size_t i = 0; i < array_length (sigusr1_sender_pids); ++i)
> + kill (sigusr1_sender_pids[i], SIGKILL);
> kill (sigusr2_sender_pid, SIGKILL);
>
> + printf ("info: signals received during fork: %u\n", fork_signals);
> + printf ("info: signals received during free: %u\n", free_signals);
> + printf ("info: signals received during malloc: %u\n", malloc_signals);
> +
> + /* Do not destroy the barrier because of the SIGKILL above, which
> + may have left the barrier in an inconsistent state. */
> + support_shared_free (shared);
> +
> return 0;
> }
> +
> +#define TIMEOUT 100
> +#include <support/test-driver.c>
>