Bug 32151 - pthread_cancel causes deadlock in syslog function
Summary: pthread_cancel causes deadlock in syslog function
Status: WAITING
Alias: None
Product: glibc
Classification: Unclassified
Component: nptl (show other bugs)
Version: 2.26
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-09-08 02:12 UTC by 邢 启阳
Modified: 2024-09-14 14:35 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2024-09-09 00:00:00
fweimer: security-


Attachments
pthread_cancel causes deadlock in syslog test demo (759 bytes, text/x-csrc)
2024-09-08 02:12 UTC, 邢 启阳
Details

Note You need to log in before you can comment on or make changes to this bug.
Description 邢 启阳 2024-09-08 02:12:50 UTC
Created attachment 15695 [details]
pthread_cancel causes deadlock in syslog test demo

Descriptions:
Create two threads, A and B, each calling the syslog function, which will probabilistically cause a deadlock in the syslog function when thread B is cancelled by the pthread_cancel function.Test demo Please refer to the attachment "syslog_pthreadcancel.c". Test demo gcc compilation command as follows.
/opt/Petalinux/2018.3/tools/linux-i386/aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc -o syslog.out  syslog_pthreadcancel.c  -g -lpthread 

Reproduction Environment:
AARCH64 and AARCH32

GCC version :
root@20f5ea71ff60:~# /opt/Petalinux/2018.3/tools/linux-i386/aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc -v
Using built-in specs.
COLLECT_GCC=/opt/Petalinux/2018.3/tools/linux-i386/aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
COLLECT_LTO_WRAPPER=/opt/Petalinux/2018.3/tools/linux-i386/aarch64-linux-gnu/bin/../libexec/gcc/aarch64-linux-gnu/7.3.1/lto-wrapper
Target: aarch64-linux-gnu
Configured with: /proj/esdt_sdk/gnu_abe/ABE/builds/lin/arm/arm-linux/snapshots/gcc-linaro-7.3-2018.04-rc3/configure SHELL=/bin/sh --with-bugurl=https://bugs.linaro.org --with-mpc=/proj/esdt_sdk/gnu_abe/ABE/builds/lin/aarch64/aarch64-linux/builds/destdir/x86_64-unknown-linux-gnu --with-mpfr=/proj/esdt_sdk/gnu_abe/ABE/builds/lin/aarch64/aarch64-linux/builds/destdir/x86_64-unknown-linux-gnu --with-gmp=/proj/esdt_sdk/gnu_abe/ABE/builds/lin/aarch64/aarch64-linux/builds/destdir/x86_64-unknown-linux-gnu --with-gnu-as --with-gnu-ld --disable-libstdcxx-pch --disable-libmudflap --with-cloog=no --with-ppl=no --with-isl=no --disable-nls --enable-c99 --enable-gnu-indirect-function --disable-multilib --with-arch=armv8-a --enable-fix-cortex-a53-835769 --enable-fix-cortex-a53-843419 --enable-multiarch CFLAGS=-O2 CXXFLAGS=-O2 --disable-silent-rules --enable-libquadmath --enable-libg2c --enable-symvers=gnu --enable-libstdcxx-pch --enable-libssp --disable-libmudflap --enable-checking=release --enable-cheaders=c_global --enable-poison-system-directories --enable-clocale=generic --enable-shared --enable-threads=posix --disable-multilib --enable-c99 --without-local-prefix --enable-lto --disable-bootstrap --with-linker-hash-style=gnu --with-ppl=no --with-cloog=no --without-isl --enable-nls --enable-__cxa_atexit --with-build-sysroot=/proj/esdt_sdk/gnu_abe/ABE/builds/lin/aarch64/aarch64-linux/builds/destdir/x86_64-unknown-linux-gnu/aarch64-linux-gnu/libc --enable-plugins --enable-linker-build-id --enable-long-long --enable-shared --with-sysroot=/proj/esdt_sdk/gnu_abe/ABE/builds/lin/aarch64/aarch64-linux/builds/destdir/x86_64-unknown-linux-gnu/aarch64-linux-gnu/libc --enable-languages=c,c++,lto --enable-checking=yes --disable-bootstrap --with-bugurl=https://bugs.linaro.org --build=x86_64-unknown-linux-gnu --host=x86_64-unknown-linux-gnu --target=aarch64-linux-gnu --prefix=/proj/esdt_sdk/gnu_abe/ABE/builds/lin/aarch64/aarch64-linux/builds/destdir/x86_64-unknown-linux-gnu
Thread model: posix
gcc version 7.3.1 20180314 (Linaro GCC 7.3-2018.04-rc3)
root@20f5ea71ff60:~#

I debugged with GDB and found a deadlock inside the syslog function, the call stack is as follows.

(gdb) thread apply all bt

Thread 2 (Thread 0xb6da7460 (LWP 749) "syslog.out"):
#0  0xb6e837e8 in __lll_lock_wait_private () from /lib/libc.so.6
#1  0xb6e728bc in __vsyslog_chk () from /lib/libc.so.6
#2  0xb6e729a4 in syslog () from /lib/libc.so.6
#3  0x000107e4 in run (arg=0x0) at syslog_pthread_cancel.c:40
#4  0xb6eecd7c in start_thread () from /lib/libpthread.so.0
#5  0xb6e763e8 in ?? () from /lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xb6fa6010 (LWP 748) "syslog.out"):
#0  0xb6e83820 in __lll_lock_wait_private () from /lib/libc.so.6
#1  0xb6e728bc in __vsyslog_chk () from /lib/libc.so.6
#2  0xb6e729a4 in syslog () from /lib/libc.so.6
#3  0x00010abc in main () at syslog_pthread_cancel.c:103
(gdb)

I checked the syslog function source code and found that the cancel_handler function is registered via the __libc_cleanup_push macro using __attribute__ ((__cleanup__ ())) before calling __libc_lock_lock to get the lock.
Inside the cancel_handler function the __libc_lock_unlock is called to release the lock, but I don't know why the cancel_handler function is not called after the cancellation point.
Comment 1 Florian Weimer 2024-09-09 09:21:38 UTC
Looks like this may have been fixed as part of the changes for bug 26100?

commit c4e4b2e149705559d28b16a9b47ba2f6142d6a6c
Author: Andreas Schwab <schwab@suse.de>
Date:   Tue Jun 23 12:55:49 2020 +0200

    Correct locking and cancellation cleanup in syslog functions (bug 26100)
    
    Properly serialize the access to the global state shared between the
    syslog functions, to avoid races in multithreaded processes.  Protect a
    local allocation in the __vsyslog_internal function from leaking during
    cancellation.

Would you please check if this fixes the issue for you? Thanks.
Comment 2 邢 启阳 2024-09-10 12:54:12 UTC
Based on glibc-2.26, I merged part of the commit c4e4b2e149705559d28b16a9b47ba2f6142d6a6c into syslog.c . but after testing, I found that it still deadlocks. 

code changes are as follows
=================================================
diff --git a/misc/syslog.c b/misc/syslog.c
index 6922ad6..78d7021 100644
--- a/misc/syslog.c
+++ b/misc/syslog.c
@@ -91,13 +91,19 @@ struct cleanup_arg
 static void
 cancel_handler (void *ptr)
 {
-#ifndef NO_SIGPIPE
-  /* Restore the old signal handler.  */
-  struct cleanup_arg *clarg = (struct cleanup_arg *) ptr;
-
-  if (clarg != NULL && clarg->oldaction != NULL)
-    __sigaction (SIGPIPE, clarg->oldaction, NULL);
-#endif
+   /* Restore the old signal handler.  */
+   struct cleanup_arg *clarg = (struct cleanup_arg *) ptr;
+
+   if (clarg != NULL)
+     {
+ #ifndef NO_SIGPIPE
+       if (clarg->oldaction != NULL)
+         __sigaction (SIGPIPE, clarg->oldaction, NULL);
+ #endif
+
+       /* Free the memstream buffer,  */
+       free (clarg->buf);
+     }

   /* Free the lock.  */
   __libc_lock_unlock (syslog_lock);
@@ -155,9 +161,15 @@ __vsyslog_chk(int pri, int flag, const char *fmt, va_list ap)
                pri &= LOG_PRIMASK|LOG_FACMASK;
        }

+    struct cleanup_arg clarg;
+    clarg.buf = NULL;
+    clarg.oldaction = NULL;
+    __libc_cleanup_push (cancel_handler, &clarg);
+    __libc_lock_lock (syslog_lock);
+
        /* Check priority against setlogmask values. */
        if ((LOG_MASK (LOG_PRI (pri)) & LogMask) == 0)
-               return;
+               goto out;

        /* Set default facility if none specified. */
        if ((pri & LOG_FACMASK) == 0)
@@ -224,6 +236,7 @@ __vsyslog_chk(int pri, int flag, const char *fmt, va_list ap)
            /* Close the memory stream; this will finalize the data
               into a malloc'd buffer in BUF.  */
            fclose (f);
+        clarg.buf = buf;
          }

        /* Output to stderr if requested. */
@@ -240,22 +253,11 @@ __vsyslog_chk(int pri, int flag, const char *fmt, va_list ap)
                    v->iov_base = (char *) "\n";
                    v->iov_len = 1;
                  }
-
-               __libc_cleanup_push (free, buf == failbuf ? NULL : buf);
-
-               /* writev is a cancellation point.  */
+
+        /* writev is a cancellation point.  */
                (void)__writev(STDERR_FILENO, iov, v - iov + 1);
-
-               __libc_cleanup_pop (0);
-       }
-
-       /* Prepare for multiple users.  We have to take care: open and
-          write are cancellation points.  */
-       struct cleanup_arg clarg;
-       clarg.buf = buf;
-       clarg.oldaction = NULL;
-       __libc_cleanup_push (cancel_handler, &clarg);
-       __libc_lock_lock (syslog_lock);
+
+    }

 #ifndef NO_SIGPIPE
        /* Prepare for a broken connection.  */
@@ -308,7 +310,7 @@ __vsyslog_chk(int pri, int flag, const char *fmt, va_list ap)
        if (sigpipe == 0)
                __sigaction (SIGPIPE, &oldaction, (struct sigaction *) NULL);
 #endif
-
+out:
        /* End of critical section.  */
        __libc_cleanup_pop (0);
        __libc_lock_unlock (syslog_lock);
@@ -428,9 +430,11 @@ int
 setlogmask (int pmask)
 {
        int omask;
-
+    /* Protect against multiple users.  */
+    __libc_lock_lock (syslog_lock);
        omask = LogMask;
        if (pmask != 0)
                LogMask = pmask;
+    __libc_lock_unlock (syslog_lock);
        return (omask);
 }
root@MSI:~/workproject/glibc-2.26_build/glibc-2.26#
Comment 3 Adhemerval Zanella 2024-09-10 14:10:50 UTC
Instead of using a backport, could you check with the master branch?
Comment 4 邢 启阳 2024-09-12 08:03:32 UTC
I'm not testing with the new glibc version on the mainline right now。
But some new clues were discovered today based on version 2.26.
I debugging this issue using GDB, and set breakpoint at unwind_stop function.
this is the stack trace we can see:
(gdb) thread apply all bt

Thread 3 (Thread 0x7fbed561f0 (LWP 2546)):
#0  unwind_stop (version=1, actions=26, exc_class=0, exc_obj=0x7fbed56660, context=0x7fbed55370, stop_parameter=0x7fbed55a28) at unwind.c:43
#1  0x0000007fbe541008 in ?? () from /lib/libgcc_s.so.1
#2  0x0000007fbe541364 in _Unwind_ForcedUnwind () from /lib/libgcc_s.so.1
#3  0x0000007fbf6b5fd8 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#4  0x0000007fbf65ac68 in __pthread_unwind (buf=<optimized out>) at forward.c:206
#5  0x0000007fbf62daf8 in __do_cancel () at pthreadP.h:297
#6  __libc_enable_asynccancel () at ../nptl/cancellation.c:49
#7  0x0000007fbf622054 in __libc_send (fd=3, buf=0x7fb0001000, len=73, flags=-2313, flags@entry=16384) at ../sysdeps/unix/sysv/linux/send.c:28
#8  0x0000007fbf61d24c in _GI___vsyslog_chk (pri=<optimized out>, flag=flag@entry=-1, fmt=0x400f88 "this is %u process 6\n", ap=...) at ../misc/syslog.c:281
#9  0x0000007fbf61d43c in syslog (pri=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:123
#10 0x0000000000400a64 in cancel_run (arg=0x0) at syslog_pthreadcancel.c:42
#11 0x0000007fbf6acf70 in start_thread (arg=0x7ffffff6f6) at pthread_create.c:465
#12 0x0000007fbf62117c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 (Thread 0x7fbf5561f0 (LWP 2545)):
#0  0x0000007fbf62da34 in __lll_lock_wait_private (futex=0x7fbf6a2ea4 <syslog_lock>) at ./lowlevellock.c:33
#1  0x0000007fbf61d374 in _GI___vsyslog_chk (pri=6, flag=flag@entry=-1, fmt=0x400fb8 "this is %u process 8\n", ap=...) at ../misc/syslog.c:168
#2  0x0000007fbf61d43c in syslog (pri=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:123
#3  0x0000000000400bf8 in run (arg=0x0) at syslog_pthreadcancel.c:70
#4  0x0000007fbf6acf70 in start_thread (arg=0x7ffffff6ff) at pthread_create.c:465
#5  0x0000007fbf62117c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0x7fbf6f8010 (LWP 2542)):
#0  0x0000007fbf62da5c in __lll_lock_wait_private (futex=0x7fbf6a2ea4 <syslog_lock>) at ./lowlevellock.c:30
#1  0x0000007fbf61d374 in _GI___vsyslog_chk (pri=6, flag=flag@entry=-1, fmt=0x4010f0 "%u process send cancel signal\n", ap=...) at ../misc/syslog.c:168
#2  0x0000007fbf61d43c in syslog (pri=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:123
#3  0x0000000000400d98 in main () at syslog_pthreadcancel.c:110
(gdb) b __longjmp
(gdb) c
Continuing.
Thread 3 "syslog.out" hit Breakpoint 3, __longjmp () at ../sysdeps/aarch64/__longjmp.S:51
51              ../sysdeps/aarch64/__longjmp.S: No such file or directory.
(gdb) thread apply all bt

Thread 3 (Thread 0x7fbed561f0 (LWP 2546)):
#0  __longjmp () at ../sysdeps/aarch64/__longjmp.S:51
#1  0x0592482a06d21a88 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0x7fbf5561f0 (LWP 2545)):
#0  0x0000007fbf62da34 in __lll_lock_wait_private (futex=0x7fbf6a2ea4 <syslog_lock>) at ./lowlevellock.c:33
#1  0x0000007fbf61d374 in _GI___vsyslog_chk (pri=6, flag=flag@entry=-1, fmt=0x400fb8 "this is %u process 8\n", ap=...) at ../misc/syslog.c:168
#2  0x0000007fbf61d43c in syslog (pri=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:123
#3  0x0000000000400bf8 in run (arg=0x0) at syslog_pthreadcancel.c:70
#4  0x0000007fbf6acf70 in start_thread (arg=0x7ffffff6ff) at pthread_create.c:465
#5  0x0000007fbf62117c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0x7fbf6f8010 (LWP 2542)):
#0  0x0000007fbf62da5c in __lll_lock_wait_private (futex=0x7fbf6a2ea4 <syslog_lock>) at ./lowlevellock.c:30
#1  0x0000007fbf61d374 in _GI___vsyslog_chk (pri=6, flag=flag@entry=-1, fmt=0x4010f0 "%u process send cancel signal\n", ap=...) at ../misc/syslog.c:168
#2  0x0000007fbf61d43c in syslog (pri=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:123
#3  0x0000000000400d98 in main () at syslog_pthreadcancel.c:110
(gdb) n
52         in ../sysdeps/aarch64/__longjmp.S
(gdn) n
.
.
.
start_thread (arg=0x7fbed561f0) at pthread_create.c:438
438 pthread_create.c:No such file or directory
(gdb) n
472       in pthread_create.c
(gdb) n
475       in pthread_create.c
(gdb) n
478       in pthread_create.c
(gdb) n
483       in pthread_create.c
(gdb) n
488       in pthread_create.c
(gdb) n
519       in pthread_create.c
(gdb) n
556       in pthread_create.c
(gdb) n
559       in pthread_create.c 
(gdb) n
560       in pthread_create.c
(gdb) n
562       in pthread_create.c
(gdb) n
581       in pthread_create.c
(gdb) n
584       in pthread_create.c
(gdb) n
608       in pthread_create.c
(gdb) n
[Thread 0x7fbed561f0 (LWP 2546) exited]


In the GDB backtrace we can see it unwinds through the pthread cancel signal frame,__libc_enable_asynccancel all the way to our thread routine cancel_run thread. because more function in libgcc_s.so.1 can’t be displayed in gdb, to really ensure the GCC unwinder is working as excepted,let GDB run continue,make sure to run to the breakpoint unwind_stop function every time 。In debugging i found that if the thread is canceled after calling the __libc_send function, the unwinder is bailing out early.
when looking at the glibc build I noticed the  .eh_frame section is missing from socket/send.o
I modified the socket/Makefile as follow, rebuild, no deadlock after testing
root@MSI:~/workproject/glibc-2.26_build/glibc-2.26# git diff diff --git a/socket/Makefile b/socket/Makefile index 1e2555d..a019e24 100644 --- a/socket/Makefile +++ b/socket/Makefile @@ -36,3 +36,5 @@ tests := tst-accept4 aux := sa_len include ../Rules +CFLAGS-send.c = -fexceptions -fasynchronous-unwind-tables + root@MSI:~/workproject/glibc-2.26_build/glibc-2.26#
Comment 5 邢 启阳 2024-09-12 08:12:38 UTC
root@MSI:~/workproject/glibc-2.26_build/glibc-2.26# git diff 
diff --git a/socket/Makefile b/socket/Makefile 
index 1e2555d..a019e24 100644 --- a/socket/Makefile 
+++ b/socket/Makefile 
@@ -36,3 +36,5 @@ tests := tst-accept4 
aux := sa_len 
include ../Rules 
+CFLAGS-send.c = -fexceptions -fasynchronous-unwind-tables 
+ 
root@MSI:~/workproject/glibc-2.26_build/glibc-2.26#
Comment 6 邢 启阳 2024-09-14 14:35:45 UTC
I've checked the master branch code and commit logs , found that this issue may have been fixed in the ccommit: 78d1724d53e23fcbbf4df832015aae269b98249c "nptl: Remove send from libpthread", which was included in glibc 2.34, and I'll be testing glibc 2.33 and 2.34 separately later.