This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: what to do with flaky tests?


On 03/05/2019 16:50, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
>> On 03/05/2019 11:29, Florian Weimer wrote:
>>> * Szabolcs Nagy:
>>>
>>>> currently these tests are flaky on the aarch64 buildbot:
>>>>
>>>> FAIL: malloc/tst-mallocfork2
>>>> FAIL: nptl/tst-eintr1
>>>>
>>>> the first one is a timeout: the test can take unbounded time
>>>> since there is no guarantee that malloc ever gets interrupted
>>>> by the signal (so increasing the timeout limit does not help:
>>>> sometimes it finishes immediately sometimes it take several
>>>> minutes).
>>>
>>> I think we can count the total delivered signals instead, so the run
>>> time is more predictable.  I can write a patch.
>>
>> yeah if there is a guaranteed limit that works for me.
>> (i guess the test can be marked UNSUPPORTED if it cannot
>> trigger the right condition)
> 
> The patch below uses process-shared barriers for synchronization.  I see
> numbers like these when running it:
> 
> $ time malloc/tst-mallocfork2 
> info: signals received during fork: 833
> info: signals received during free: 924
> info: signals received during malloc: 3625
> 
> real    0m5.996s
> user    0m2.715s
> sys     0m2.883s
> 
> Due to code churn, it's not easy to verify if the changes interfere with
> the original test objective.  However, I temporarily added this at the
> beginning of the test:
> 
>   {
>     auto void *cb(void *closure) { pause (); return NULL; }
>     xpthread_create (NULL, cb, NULL);
>   }
> 
> This puts the process into multi-threaded mode, and this still results
> in a deadlock because none of the single-thread code paths are taken
> anymore.  So I think the test still works as intended.
> 
> Does this patch make the test pass for you?

yes, the test now seems to pass reliably.
on the problematic machine i see

$ time ./testrun.sh malloc/tst-mallocfork2
info: signals received during fork: 103
info: signals received during free: 482
info: signals received during malloc: 75

real	0m22.477s
user	0m10.366s
sys	0m12.284s

(the time does not vary much so the 100s
timeout setting works now)

thanks.


>>>> the second one can break make check: kernel side task
>>>> accounting is broken and even though no more than 100 threads
>>>> are alive at a time the cgroup task limit can be hit (even
>>>> if that's >20000 tasks) and then all fork/clone fails so
>>>> make check cannot continue and test results are lost.
>>>
>>> Is this with a current kernel?  We haven't seen this in a while, after
>>> encountering it rather persistently.
>>
>> the kernel is ubuntu 4.15.0-34-generic
>> do you know since what kernel version was this improved?
> 
> Sorry, I don't know.
> 
> Thanks,
> Florian
> 
> 
> malloc/tst-mallocfork2: Use process-shared barriers
> 
> This synchronization method has a lower overhead and makes
> it more likely that the signal arrives during one of the critical
> functions.
> 
> Also test for fork deadlocks explicitly.
> 
> 2019-05-03  Florian Weimer  <fweimer@redhat.com>
> 
> 	malloc/tst-mallocfork2: Use process-shared barriers.
> 	* malloc/tst-mallocfork2.c: Switch to <support/test-driver.c>.
> 	(signal_count, sigusr1_sender_pid): Remove.
> 	(iterations): Define constant.
> 	(shared): New variable.
> 	(sigusr1_received): Update comment.
> 	(sigusr1_handler): Do not send SIGSTOP to the sender process.
> 	(signal_sender): Optional use barriers to avoid sending signals
> 	during irrelevant times.
> 	(do_it): Initialize variable shared.  Use xfork for error
> 	checking.  Launch multiple SIGUSR1-sending subprocesses.  Limit
> 	the iteration count, independent of signal delivery.  Check for
> 	deadlocks in fork.  Introduce barriers for reducing signal
> 	traffic.  Do not send SIGCONT to the SIGUSR1-sending processes;
> 	replaced by the barriers.  Count signals during fork/free/malloc
> 	and report them.
> 	* malloc/Makefile (tst-mallocfork): Link with libpthread.
> 
> diff --git a/malloc/Makefile b/malloc/Makefile
> index aadf602dfd..d2fba29953 100644
> --- a/malloc/Makefile
> +++ b/malloc/Makefile
> @@ -252,3 +252,4 @@ $(objpfx)tst-dynarray-fail-mem.out: $(objpfx)tst-dynarray-fail.out
>  
>  $(objpfx)tst-malloc-tcache-leak: $(shared-thread-library)
>  $(objpfx)tst-malloc_info: $(shared-thread-library)
> +$(objpfx)tst-mallocfork2: $(shared-thread-library)
> diff --git a/malloc/tst-mallocfork2.c b/malloc/tst-mallocfork2.c
> index 474b0cea4a..30e7abe6fd 100644
> --- a/malloc/tst-mallocfork2.c
> +++ b/malloc/tst-mallocfork2.c
> @@ -34,6 +34,11 @@
>  #include <sys/wait.h>
>  #include <time.h>
>  #include <unistd.h>
> +#include <array_length.h>
> +#include <support/check.h>
> +#include <support/support.h>
> +#include <support/xthread.h>
> +#include <support/xunistd.h>
>  
>  /* How many malloc objects to keep arond.  */
>  enum { malloc_objects = 1009 };
> @@ -41,19 +46,16 @@ enum { malloc_objects = 1009 };
>  /* The maximum size of an object.  */
>  enum { malloc_maximum_size = 70000 };
>  
> -/* How many signals need to be delivered before the test exits.  */
> -enum { signal_count = 1000 };
> +/* How many iterations the test performs before exiting.  */
> +enum { iterations = 10000 };
>  
> -static int do_test (void);
> -#define TIMEOUT 100
> -#define TEST_FUNCTION do_test ()
> -#include "../test-skeleton.c"
> -
> -/* Process ID of the subprocess which sends SIGUSR1 signals.  */
> -static pid_t sigusr1_sender_pid;
> +/* Barrier for synchronization with the processes sending SIGUSR1
> +   signals, to make it more likely that the signals arrive during a
> +   fork/free/malloc call.  */
> +static struct { pthread_barrier_t barrier; } *shared;
>  
>  /* Set to 1 if SIGUSR1 is received.  Used to detect a signal during
> -   malloc/free.  */
> +   fork/free/malloc.  */
>  static volatile sig_atomic_t sigusr1_received;
>  
>  /* Periodically set to 1, to indicate that the process is making
> @@ -63,17 +65,6 @@ static volatile sig_atomic_t progress_indicator = 1;
>  static void
>  sigusr1_handler (int signo)
>  {
> -  /* Let the main program make progress, by temporarily suspending
> -     signals from the subprocess.  */
> -  if (sigusr1_received)
> -    return;
> -  /* sigusr1_sender_pid might not be initialized in the parent when
> -     the first SIGUSR1 signal arrives.  */
> -  if (sigusr1_sender_pid > 0 && kill (sigusr1_sender_pid, SIGSTOP) != 0)
> -    {
> -      write_message ("error: kill (SIGSTOP)\n");
> -      abort ();
> -    }
>    sigusr1_received = 1;
>  
>    /* Perform a fork with a trivial subprocess.  */
> @@ -108,6 +99,8 @@ liveness_signal_handler (int signo)
>      write_message ("warning: process seems to be stuck\n");
>  }
>  
> +/* Send SIGNO to the parent process.  If SLEEP, wait a second between
> +   signals, otherwise use barriers to delay sending signals.  */
>  static void
>  __attribute__ ((noreturn))
>  signal_sender (int signo, bool sleep)
> @@ -115,6 +108,8 @@ signal_sender (int signo, bool sleep)
>    pid_t target = getppid ();
>    while (true)
>      {
> +      if (!sleep)
> +        xpthread_barrier_wait (&shared->barrier);
>        if (kill (target, signo) != 0)
>          {
>            dprintf (STDOUT_FILENO, "error: kill: %m\n");
> @@ -123,14 +118,17 @@ signal_sender (int signo, bool sleep)
>        if (sleep)
>          usleep (1 * 1000 * 1000);
>        else
> -        /* Reduce the rate at which we send signals.  */
> -        sched_yield ();
> +        xpthread_barrier_wait (&shared->barrier);
>      }
>  }
>  
>  static int
>  do_test (void)
>  {
> +  /* shared->barrier is intialized along with sigusr1_sender_pids
> +     below.  */
> +  shared = support_shared_allocate (sizeof (*shared));
> +
>    struct sigaction action =
>      {
>        .sa_handler = sigusr1_handler,
> @@ -150,48 +148,74 @@ do_test (void)
>        return 1;
>      }
>  
> -  pid_t sigusr2_sender_pid = fork ();
> +  pid_t sigusr2_sender_pid = xfork ();
>    if (sigusr2_sender_pid == 0)
>      signal_sender (SIGUSR2, true);
> -  sigusr1_sender_pid = fork ();
> -  if (sigusr1_sender_pid == 0)
> -    signal_sender (SIGUSR1, false);
> +
> +  /* Send SIGUSR1 signals from several processes.  Hopefully, one
> +     signal will hit one of the ciritical functions.  Use a barrier to
> +     avoid sending signals while not running fork/free/malloc.  */
> +  pid_t sigusr1_sender_pids[5];
> +  {
> +    pthread_barrierattr_t attr;
> +    xpthread_barrierattr_init (&attr);
> +    xpthread_barrierattr_setpshared (&attr, PTHREAD_PROCESS_SHARED);
> +    xpthread_barrier_init (&shared->barrier, &attr,
> +                           array_length (sigusr1_sender_pids) + 1);
> +    xpthread_barrierattr_destroy (&attr);
> +  }
> +  for (size_t i = 0; i < array_length (sigusr1_sender_pids); ++i)
> +    {
> +      sigusr1_sender_pids[i] = fork ();
> +      if (sigusr1_sender_pids[i] == 0)
> +        signal_sender (SIGUSR1, false);
> +    }
>  
>    void *objects[malloc_objects] = {};
> -  unsigned signals = 0;
> +  unsigned int fork_signals = 0;
> +  unsigned int free_signals = 0;
> +  unsigned int malloc_signals = 0;
>    unsigned seed = 1;
> -  time_t last_report = 0;
> -  while (signals < signal_count)
> +  for (int i = 0; i < iterations; ++i)
>      {
>        progress_indicator = 1;
>        int slot = rand_r (&seed) % malloc_objects;
>        size_t size = rand_r (&seed) % malloc_maximum_size;
> -      if (kill (sigusr1_sender_pid, SIGCONT) != 0)
> +
> +      /* Occasionally do a fork first, to catch deadlocks there as
> +         well (see bug 24161).  */
> +      bool do_fork = (rand_r (&seed) % 7) == 0;
> +
> +      xpthread_barrier_wait (&shared->barrier);
> +      if (do_fork)
>          {
> -          printf ("error: kill (SIGCONT): %m\n");
> -          signal (SIGUSR1, SIG_IGN);
> -          kill (sigusr1_sender_pid, SIGKILL);
> -          kill (sigusr2_sender_pid, SIGKILL);
> -          return 1;
> +          sigusr1_received = 0;
> +          pid_t pid = xfork ();
> +          if (sigusr1_received)
> +            ++fork_signals;
> +          if (pid == 0)
> +            _exit (0);
> +          int status;
> +          int ret = TEMP_FAILURE_RETRY (waitpid (pid, &status, 0));
> +          if (ret < 0)
> +            FAIL_EXIT1 ("waitpid: %m");
> +          TEST_COMPARE (status, 0);
>          }
> -      sigusr1_received = false;
> +      sigusr1_received = 0;
>        free (objects[slot]);
> +      if (sigusr1_received)
> +        ++free_signals;
> +      sigusr1_received = 0;
>        objects[slot] = malloc (size);
>        if (sigusr1_received)
> -        {
> -          ++signals;
> -          time_t current = time (0);
> -          if (current != last_report)
> -            {
> -              printf ("info: SIGUSR1 signal count: %u\n", signals);
> -              last_report = current;
> -            }
> -        }
> +        ++malloc_signals;
> +      xpthread_barrier_wait (&shared->barrier);
> +
>        if (objects[slot] == NULL)
>          {
>            printf ("error: malloc: %m\n");
> -          signal (SIGUSR1, SIG_IGN);
> -          kill (sigusr1_sender_pid, SIGKILL);
> +          for (size_t i = 0; i < array_length (sigusr1_sender_pids); ++i)
> +            kill (sigusr1_sender_pids[i], SIGKILL);
>            kill (sigusr2_sender_pid, SIGKILL);
>            return 1;
>          }
> @@ -201,11 +225,20 @@ do_test (void)
>    for (int slot = 0; slot < malloc_objects; ++slot)
>      free (objects[slot]);
>  
> -  /* Terminate the signal-sending subprocess.  The SIGUSR1 handler
> -     should no longer run because it uses sigusr1_sender_pid.  */
> -  signal (SIGUSR1, SIG_IGN);
> -  kill (sigusr1_sender_pid, SIGKILL);
> +  for (size_t i = 0; i < array_length (sigusr1_sender_pids); ++i)
> +    kill (sigusr1_sender_pids[i], SIGKILL);
>    kill (sigusr2_sender_pid, SIGKILL);
>  
> +  printf ("info: signals received during fork: %u\n", fork_signals);
> +  printf ("info: signals received during free: %u\n", free_signals);
> +  printf ("info: signals received during malloc: %u\n", malloc_signals);
> +
> +  /* Do not destroy the barrier because of the SIGKILL above, which
> +     may have left the barrier in an inconsistent state.  */
> +  support_shared_free (shared);
> +
>    return 0;
>  }
> +
> +#define TIMEOUT 100
> +#include <support/test-driver.c>
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]