There is a recent regression when using running a Linux/m68k userland on qemu-user with applications crashing with futex errors: dh_md5sums -a -O--builddirectory=build -O--dbgsym-migration=aptitude-dbg dh_builddeb -a -O--builddirectory=build -O--dbgsym-migration=aptitude-dbg dpkg-deb: building package 'aptitude-dbgsym' in '../aptitude-dbgsym_0.8.13-4_m68k.deb'. dpkg-deb: building package 'aptitude' in '../aptitude_0.8.13-4_m68k.deb'. The futex facility returned an unexpected error code. qemu: uncaught target signal 6 (Aborted) - core dumped dpkg-deb: error: <compress> from tar -cf subprocess was killed by signal (Aborted) dh_builddeb: error: dpkg-deb --root-owner-group --build debian/.debhelper/aptitude/dbgsym-root .. returned exit code 2 dh_builddeb: error: Aborting due to earlier error make: *** [debian/rules:22: binary-arch] Error 2 The error was reported to QEMU upstream [1] but eventually turned out to be a glibc problem [2]. According to the comment by Richard Henderson, glibc must enforce the alignment of __libc_lock_t to be 32 bits. > [1] https://gitlab.com/qemu-project/qemu/-/issues/957 > [2] https://gitlab.com/qemu-project/qemu/-/issues/957#note_1081526226
I can confirm that the following change fixes the problem for me: --- glibc-2.34.orig/sysdeps/nptl/libc-lockP.h +++ glibc-2.34/sysdeps/nptl/libc-lockP.h @@ -34,7 +34,7 @@ #include <tls.h> /* Mutex type. */ -typedef int __libc_lock_t; +typedef int __libc_lock_t __attribute__((aligned(4))); typedef struct { pthread_mutex_t mutex; } __rtld_lock_recursive_t; typedef pthread_rwlock_t __libc_rwlock_t; However, I am not sure whether it will have any negative side effects like breaking the ABI.
It seems a real issue, but I am puzzled why we have not see any issue so far. I take mostly runs were done in single-core, where hardware did not enforce 32-bit alignment with atomic operations. A better change would be to use: diff --git a/sysdeps/nptl/libc-lockP.h b/sysdeps/nptl/libc-lockP.h index d3a6837fd2..9efe962588 100644 --- a/sysdeps/nptl/libc-lockP.h +++ b/sysdeps/nptl/libc-lockP.h @@ -34,7 +34,7 @@ #include <tls.h> /* Mutex type. */ -typedef int __libc_lock_t; +typedef int __libc_lock_t __LOCK_ALIGNMENT; typedef struct { pthread_mutex_t mutex; } __rtld_lock_recursive_t; typedef pthread_rwlock_t __libc_rwlock_t; Since __LOCK_ALIGNMENT is defined per architecture if required. The HPPA also requires a 16-byte alignment for locks, although it is just a historical artifact to keep compatibility with old implementation.
(In reply to Adhemerval Zanella from comment #2) > It seems a real issue, but I am puzzled why we have not see any issue so > far. I take mostly runs were done in single-core, where hardware did not > enforce 32-bit alignment with atomic operations. > > A better change would be to use: > > diff --git a/sysdeps/nptl/libc-lockP.h b/sysdeps/nptl/libc-lockP.h > index d3a6837fd2..9efe962588 100644 > --- a/sysdeps/nptl/libc-lockP.h > +++ b/sysdeps/nptl/libc-lockP.h > @@ -34,7 +34,7 @@ > #include <tls.h> > > /* Mutex type. */ > -typedef int __libc_lock_t; > +typedef int __libc_lock_t __LOCK_ALIGNMENT; > typedef struct { pthread_mutex_t mutex; } __rtld_lock_recursive_t; > typedef pthread_rwlock_t __libc_rwlock_t; Does this impact any externally visible ABIs? > Since __LOCK_ALIGNMENT is defined per architecture if required. The HPPA > also requires a 16-byte alignment for locks, although it is just a > historical artifact to keep compatibility with old implementation. If an architecture needs higher alignment for locks then I strongly suggest out-of-line locking in the kernel as we did for parisc. We have an array of locks that we use to scale the futex locking operations. We pick a lock based on the low-bit hash of the futex address, and we use that consistently in our kernel helper (kernel aided emulation of compare-and-swap) and inside the kernel. This yields a consistent behaviour on SMP where the userspace CAS uses the same set of lock words for the address as the kernel-side futex CAS. Those lock words are 16-byte aligned because only load-and-clear-word (ldcw) exists on parisc and requires the extra alignment for the hardware atomicity to hold.
The futex alignment is artificial only, it is not a hardware requirement.
(In reply to Carlos O'Donell from comment #3) > (In reply to Adhemerval Zanella from comment #2) > > It seems a real issue, but I am puzzled why we have not see any issue so > > far. I take mostly runs were done in single-core, where hardware did not > > enforce 32-bit alignment with atomic operations. > > > > A better change would be to use: > > > > diff --git a/sysdeps/nptl/libc-lockP.h b/sysdeps/nptl/libc-lockP.h > > index d3a6837fd2..9efe962588 100644 > > --- a/sysdeps/nptl/libc-lockP.h > > +++ b/sysdeps/nptl/libc-lockP.h > > @@ -34,7 +34,7 @@ > > #include <tls.h> > > > > /* Mutex type. */ > > -typedef int __libc_lock_t; > > +typedef int __libc_lock_t __LOCK_ALIGNMENT; > > typedef struct { pthread_mutex_t mutex; } __rtld_lock_recursive_t; > > typedef pthread_rwlock_t __libc_rwlock_t; > > Does this impact any externally visible ABIs? > > > Since __LOCK_ALIGNMENT is defined per architecture if required. The HPPA > > also requires a 16-byte alignment for locks, although it is just a > > historical artifact to keep compatibility with old implementation. > > If an architecture needs higher alignment for locks then I strongly suggest > out-of-line locking in the kernel as we did for parisc. We have an array of > locks that we use to scale the futex locking operations. We pick a lock > based on the low-bit hash of the futex address, and we use that consistently > in our kernel helper (kernel aided emulation of compare-and-swap) and inside > the kernel. This yields a consistent behaviour on SMP where the userspace > CAS uses the same set of lock words for the address as the kernel-side futex > CAS. Those lock words are 16-byte aligned because only load-and-clear-word > (ldcw) exists on parisc and requires the extra alignment for the hardware > atomicity to hold. I was not aware of HPPA limitation, it seems that we will need to propagate it to internal locks as well. The m68k issues, afaiu, is different because int has a 2-byte alignment on m68k: $ cat test.c _Static_assert (_Alignof (int) == 4, "_Alignof (int) != 4"); $ x86_64-linux-gnu-gcc test.c -c $ /home/azanella/toolchain/install/compilers/m68k-linux-gnu/bin/m68k-glibc-linux-gnu-gcc test.c -c test.c:1:1: error: static assertion failed: "_Alignof (int) != 4" 1 | _Static_assert (_Alignof (int) == 4, "_Alignof (int) != 4"); | ^~~~~~~~~~~~~~ And futex syscall enforces 4-bytes alignments for the input addresses (as Andreas noted).
(In reply to Adhemerval Zanella from comment #5) > (In reply to Carlos O'Donell from comment #3) > > (In reply to Adhemerval Zanella from comment #2) > > > It seems a real issue, but I am puzzled why we have not see any issue so > > > far. I take mostly runs were done in single-core, where hardware did not > > > enforce 32-bit alignment with atomic operations. > > > > > > A better change would be to use: > > > > > > diff --git a/sysdeps/nptl/libc-lockP.h b/sysdeps/nptl/libc-lockP.h > > > index d3a6837fd2..9efe962588 100644 > > > --- a/sysdeps/nptl/libc-lockP.h > > > +++ b/sysdeps/nptl/libc-lockP.h > > > @@ -34,7 +34,7 @@ > > > #include <tls.h> > > > > > > /* Mutex type. */ > > > -typedef int __libc_lock_t; > > > +typedef int __libc_lock_t __LOCK_ALIGNMENT; > > > typedef struct { pthread_mutex_t mutex; } __rtld_lock_recursive_t; > > > typedef pthread_rwlock_t __libc_rwlock_t; > > > > Does this impact any externally visible ABIs? > > > > > Since __LOCK_ALIGNMENT is defined per architecture if required. The HPPA > > > also requires a 16-byte alignment for locks, although it is just a > > > historical artifact to keep compatibility with old implementation. > > > > If an architecture needs higher alignment for locks then I strongly suggest > > out-of-line locking in the kernel as we did for parisc. We have an array of > > locks that we use to scale the futex locking operations. We pick a lock > > based on the low-bit hash of the futex address, and we use that consistently > > in our kernel helper (kernel aided emulation of compare-and-swap) and inside > > the kernel. This yields a consistent behaviour on SMP where the userspace > > CAS uses the same set of lock words for the address as the kernel-side futex > > CAS. Those lock words are 16-byte aligned because only load-and-clear-word > > (ldcw) exists on parisc and requires the extra alignment for the hardware > > atomicity to hold. > > I was not aware of HPPA limitation, it seems that we will need to propagate > it to internal locks as well. Internal locks in hppa go through a custom atomic.h which uses the kernel CAS helper so the alignment requirement is related to the natural ABI of the architecture word. So nothing needs to be done here, and I did this on purpose to avoid exposing the oddities directly to userspace. All usersapce on hppa has to use the atomic kernel helpers to carry out CAS on naturally aligned words. > The m68k issues, afaiu, is different because int has a 2-byte alignment on > m68k: Yes, the natural alignment is lower, and so we need to raise this to the expected kernel alignment. For hppa the alignment is way higher, and I hide that requirement inside the kernel CAS helper. > $ cat test.c > _Static_assert (_Alignof (int) == 4, "_Alignof (int) != 4"); > $ x86_64-linux-gnu-gcc test.c -c > $ > /home/azanella/toolchain/install/compilers/m68k-linux-gnu/bin/m68k-glibc- > linux-gnu-gcc test.c -c > test.c:1:1: error: static assertion failed: "_Alignof (int) != 4" > 1 | _Static_assert (_Alignof (int) == 4, "_Alignof (int) != 4"); > | ^~~~~~~~~~~~~~ > > And futex syscall enforces 4-bytes alignments for the input addresses (as > Andreas noted).
(In reply to Adhemerval Zanella from comment #2) > It seems a real issue, but I am puzzled why we have not see any issue so > far. I take mostly runs were done in single-core, where hardware did not > enforce 32-bit alignment with atomic operations. > > A better change would be to use: > > diff --git a/sysdeps/nptl/libc-lockP.h b/sysdeps/nptl/libc-lockP.h > index d3a6837fd2..9efe962588 100644 > --- a/sysdeps/nptl/libc-lockP.h > +++ b/sysdeps/nptl/libc-lockP.h > @@ -34,7 +34,7 @@ > #include <tls.h> > > /* Mutex type. */ > -typedef int __libc_lock_t; > +typedef int __libc_lock_t __LOCK_ALIGNMENT; > typedef struct { pthread_mutex_t mutex; } __rtld_lock_recursive_t; > typedef pthread_rwlock_t __libc_rwlock_t; > > Since __LOCK_ALIGNMENT is defined per architecture if required. The HPPA > also requires a 16-byte alignment for locks, although it is just a > historical artifact to keep compatibility with old implementation. I've added this to glibc-2.35, recompiled and reinstalled glibc, and was then able to * update my chroot to newest packages * and have it rebuild itself at MAKEOPTS="-j17" without any issues. (This means, whereas building python always failed before, now I built python-3.10 and python-3.11 each twice, without problems.) So, LGTM.
(In reply to Carlos O'Donell from comment #3) > > > Since __LOCK_ALIGNMENT is defined per architecture if required. The HPPA > > also requires a 16-byte alignment for locks, although it is just a > > historical artifact to keep compatibility with old implementation. > > If an architecture needs higher alignment for locks then I strongly suggest > out-of-line locking in the kernel as we did for parisc. We have an array of > locks that we use to scale the futex locking operations. We pick a lock > based on the low-bit hash of the futex address, and we use that consistently > in our kernel helper (kernel aided emulation of compare-and-swap) and inside > the kernel. This yields a consistent behaviour on SMP where the userspace > CAS uses the same set of lock words for the address as the kernel-side futex > CAS. Those lock words are 16-byte aligned because only load-and-clear-word > (ldcw) exists on parisc and requires the extra alignment for the hardware > atomicity to hold. I just want to say that the parisc CAS implementation is subtle and it took years to get it right. Page faults need to be disabled around the CAS operation. Some mechanism is needed to handle COW breaks in the kernel outside the critical region. Fortunately, parisc has a stby,e instruction. When the effective address addresses the leftmost byte of a word, it does not write to the effective address but it triggers all the exceptions that a write would. Thus, it can be used to trigger a COW break without actually writing to memory. Having a 16-byte alignment for locks is somewhat problematic. Malloc data is over aligned and this causes problems for some packages which check alignment.
Fixed on 2.37.
(In reply to Adhemerval Zanella from comment #9) > Fixed on 2.37. Great, thank you! Would it be possible to backport the fix to 2.35 and 2.36 or would that be a breaking change?
Alright, I will do.(In reply to John Paul Adrian Glaubitz from comment #10) > (In reply to Adhemerval Zanella from comment #9) > > Fixed on 2.37. > > Great, thank you! > > Would it be possible to backport the fix to 2.35 and 2.36 or would that be a > breaking change? Alright, I will do.