Created attachment 8128 [details] A minimal working example illustrating the issue Description: Since my distributions update from glibc 2.20 to 2.21 a mixed 32bit/64bit application using posix semaphores (sem_open, sem_post, ...) started aborting in the code of sem_post. Mixed in this case means, that there is a 32bit application setting up a named semaphore using sem_open and waiting for input of the 64bit-application, which calls sem_post upon completion of a certain task. stracing the executable reveals the following futex-call being the reason (glibc calls abort() after the failed syscall): futex(0x7f5281f83000, 0xf7784081 /* FUTEX_??? */, 1) = -1 ENOSYS (Function not implemented) Unfortunately the code for this binary (as well as the build-chain) is a bit convoluted and illsuited for debugging (if it is necessary, I can provide it though, as well as building instructions). A bit of debugging reveals that this issue persists with other 32bit/64bit program-combinations, so I produced a minimal working example attached below. It does not crash as the original binary, but simply fails to work as expected: Steps to Reproduce: * Unpack the tar (tar xf minimal-working-example.tar.gz) * Build all binarys (make all) * execute ./mwe and ./mwec in two terminals for the reference run * One will receive an interleaved execution: mwe sleeps/waits once for each call of mwec-unlock * execute ./mwe_cleanup to reset/delete the used named semaphore * execute ./mwe-32 in one terminal and ./mwec in another one * The semaphore-unlock is not recogniced by ./mwe-32 Looking into the sources, my best bet is that the performance optimization for "new_sem" in 2.21 changed the struct-layout of new_sem in "sysdeps/nptl/internaltypes.h" for 64bit binaries (as AMD64 provides "__HAVE_64B_ATOMICS"), but the 32bit version does not provide those, so the struct-layouts differ. In the case of named semaphores those structs seem to be written to a file, which is then mmapped into the processes address-spaces by sem_open. As those struct-layouts now differ, 32bit and 64bit-binaries are not interoperable anymore. I've produced a patch that seems to fix the issue for me by reordering struct-members (see file 0001-Amortize-layout-of-struct-new_sem.patch attached). Please note that this patch is more a "proof of concept", as it does only take AMD64 into account and most likely will not play nicely with big-endian architectures... The problem persists with: * glibc-2.21 * git-master (latest commit is 3f293d614c9e641a0d96d347df5c1c5ee687762f) Note: Bugzilla seems to only allow one attachement per bugreport, so I will attach the patch with a comment below
Created attachment 8129 [details] A patch reordering the members of "struct new_sem" for better interoperability
The issue has been introduced by the following commit: commit 042e1521c794a945edc43b5bfa7e69ad70420524 Author: Carlos O'Donell <carlos@systemhalted.org> Date: Wed Jan 21 00:46:16 2015 -0500 Fix semaphore destruction (bug 12674). This commit fixes semaphore destruction by either using 64b atomic operations (where available), or by using two separate fields when only 32b atomic operations are available. In the latter case, we keep a conservative estimate of whether there are any waiting threads in one bit of the field that counts the number of available tokens, thus allowing sem_post to atomically both add a token and determine whether it needs to call futex_wake. See: https://sourceware.org/ml/libc-alpha/2014-12/msg00155.html It is still reproducible with glibc 2.23 or current git master branch.
The type sem_t has never been compatible between 32-bit and 64-bit processes, neither size nor alignment match.
(In reply to Andreas Schwab from comment #3) > The type sem_t has never been compatible between 32-bit and 64-bit > processes, neither size nor alignment match. http://pubs.opengroup.org/onlinepubs/9699919799/utilities/c99.html "Conforming applications shall not attempt to link together object files compiled for different programming models. Applications shall also be aware that binary data placed in shared memory or in files might not be recognized by applications built for other programming models." As Andreas says, this was never supported and POSIX supports this.
In my opinion it is worth to reconsider this issue. Public semaphores are "interprocess communication objects". The underlying concept of "process" does not take into account possible differences in internal implementation of each instance. Most operating systems that support multiple subsystems hide wery well the particular subsystem of processes; many people have worked hard to achieve this goal. Processes can usually access system's resources and interoperate regardless their internal details. E.g. on Linux all processes use filesystem and networking the same way; they can communicate via sockets, pipes, files or shared memory without any restrictions. They can share file distriptors too, by inheriting them from their parent in perfect transparence. The same happens on Windows. Distributing only one binary for both 32 and 64 bits platforms is a very frequent operation. There may be many good reasons to mantain a single binary, and sometimes it is the only possible option. If that binary exposes a public interface by the mean of "standard" interprocess communication objects, one would expect that this public interface is accessible without any artificial restrictions. It should be enough to say in documentation "wait for posix semaphore A, write data to file B, release semaphore A". Glibc's implementation of posix semaphores breaks this rule in exchange for a modest performance gain (my guess). At least on x86_64 platforms, it is easy to arrange things in such a way that both 32 and 64 bit processes interoperate. I don't know if on other CPUs things are as easy, but if there is an acceptable trade off between performance and interoperability, interoperability should be implemented. Another problem with current implementation of posix semaphores is that not only does an access from the wrong subsystem fail, it also mess up the object state. And this is due to a "legitimate" API call. In conclusion, please, reconsider this issue.