sem_init() fails (when used in a certain way)

Tue Mar 29 13:41:00 GMT 2011

The python regression test randomly fails in test_multiprocessing, e.g (with
the python-test package installed).

$ python -m test.regrtest test_multiprocessing
test_multiprocessing
sem_init: Device or resource busy
[...]

Note that with cygwin versions prior to 1.7.9, sem_init() doesn't set errno
[1], so the error is misreported, normally as 'Resource temporarily unavailable'.

I've also seen this failure with the piglit OpenGL test suite, and google
seems to find some similar reports with other python programs.

After some debugging, I reduced the issue to the following testcase:

$ cat sem_init_malloc_testcase.c

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <semaphore.h>

#define CHECK_STATUS(name)  if (status != 0) { perror(name); }

#define SEMAPHORE_SIZE 36  /* size of cygwin's internal class semaphore */

int
main()
{
  int status;

  {
    void **lock1 = (void **)malloc(sizeof(void *));
    void *sem1 = malloc(SEMAPHORE_SIZE);
    void **lock2 = (void **)malloc(sizeof(void *));
    void **sem2 = (void **)malloc(SEMAPHORE_SIZE);

    *sem2 = sem1;

    free(sem2);
    free(lock2);
    free(sem1);
    free(lock1);
  }

  // The above primes the random data in the heap, so that the
  // memory we allocate for lock2 just happens to already point
  // to the internal class semaphore object which was allocated
  // when we sem_init-ed lock1

  {
    sem_t *lock1 = (sem_t *)malloc(sizeof(sem_t));
    status = sem_init(lock1,0,1);
    CHECK_STATUS("sem_init");
  }

  {
    sem_t *lock2 = (sem_t *)malloc(sizeof(sem_t));
    // uncommenting this makes test succeed, but isn't required
    // by the standard
    // memset(lock2, 0, sizeof(sem_t));
    status = sem_init(lock2,0,1);
    CHECK_STATUS("sem_init");
  }

  return 0;
}

$ gcc sem_init_malloc_testcase.c -o sem_init_malloc_testcase.exe

$ ./sem_init_malloc_testcase
sem_init: Device or resource busy

Obviously, this test case depends on both the internal implementation of sem_t
on cygwin (as a pointer to a heap object of class semaphore), and the
implementation details of malloc() and how it recycles free()d blocks.

Note that we are not initializing the same semaphore twice, which is
undefined.  We are allocating a second sem_t on the heap, but it just happens
to contain an initial value which sem_init() doesn't like.

I would suggest this is non-conformant as the specification of sem_init() does
not put any constraints on the contents of the sem_t is is handed.

This failure is not quite as unlikely as perhaps it seems, it's quite common
to put pointers to malloc()ed memory in other bits of malloc()ed memory, and
python does that a lot :-)

I'm not sure how to fix this:

Changing sem_t from a pointer to an instance of class semaphore is not a good
idea as it would change a lot of code, and a non-starter as it breaks ABI by
changing sizeof(sem_t), and I have to assume there is a reason it was
implemented using a pointer in the first place.

Removing the is_good_object() check from semaphore::init() (and thus changing
the undefined behaviour when a sem_init() is used twice from 'return EBUSY' to
'leak some memory') would work.  Perhaps downgrading the error to strace
output "potential repeated semaphore initialization"?

Since all the failures I'm aware of apart from my test case above occur with
python code, it might be ok just to patch python to zero-initialize the
malloc()ed memory before using sem_init().  But I don't think that is normal,
so the same failure may well still exist with other programs.

I hope someone has some better ideas?

[1] http://cygwin.com/ml/cygwin-patches/2011-q1/msg00069.html