This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug nptl/12674] New: sem_post/sem_wait race causing sem_post to return EINVAL
- From: "dhatch at ilm dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sources dot redhat dot com
- Date: Thu, 14 Apr 2011 06:33:27 +0000
- Subject: [Bug nptl/12674] New: sem_post/sem_wait race causing sem_post to return EINVAL
- Auto-submitted: auto-generated
http://sourceware.org/bugzilla/show_bug.cgi?id=12674
Summary: sem_post/sem_wait race causing sem_post to return
EINVAL
Product: glibc
Version: unspecified
Status: NEW
Severity: critical
Priority: P2
Component: nptl
AssignedTo: drepper.fsp@gmail.com
ReportedBy: dhatch@ilm.com
Created attachment 5671
--> http://sourceware.org/bugzilla/attachment.cgi?id=5671
the test program, to be run in gdb as described
There appears to be a race in the implementation of sem_post/sem_wait on AMD64
(nptl/sysdeps/unix/sysv/linux/x86_64/sem_post.S in the source code)
which sometimes causes sem_post to access freed memory
and to fail with EINVAL.
In a nutshell, if sem_post happens to go to sleep
right after it increments sem->value
but before it looks at sem->nwaiters,
another thread can sail through a sem_wait without blocking
and destroy the semaphore,
so that when the sem_post thread wakes up and looks at sem->nwaiters,
it is looking at already-freed (and possibly unmapped) memory.
The bug was originally filed as gentoo bug 93366
( http://bugs.gentoo.org/show_bug.cgi?id=93366 ).
It's extremely hard to reproduce,
and I don't have a simple program that can demonstrate the problem reliably
by just running it (for less than a million years).
But it can be reproduced consistently
either by hacking up the sem_post source code
and adding a sleep() at a crucial point,
or by carefully stopping and resuming the threads
in a debugger with thread-specific breakpoints.
I'll include instructions for doing the latter using gdb >=7.1.
We're observing the problem on an AMD64 machine
running RHEL5.3 Linux,
with glibc-2.5-34.el5_3.1
and gcc-4.1.2-44.el5,
which I know is ancient
but I also downloaded the most current glibc source code today
and compiled the sem_post.S and sem_wait.S from it,
and I can still reproduce the problem using those.
Here are the instructions for reproducing the problem
using gdb 7.1 or 7.2 on the attached program
(gdb 7.0.1 and earlier fail with a supposed syntax error
on the "b *(sem_post+18) thread 3").
% gcc -Wall -g semtest.c -lpthread -o semtest
% gdb ./semtest
# per http://sourceware.org/gdb/onlinedocs/gdb/Non_002dStop-Mode.html ...
# Enable the async interface.
set target-async 1
# If using the CLI, pagination breaks non-stop.
set pagination off
# Finally, turn it on!
set non-stop on
b waiter
b poster
r
# thread 2 stops in waiter
# thread 3 stops in poster
t 2
b sem_wait thread 2
c
# thread 2 (waiter) stops at the beginning of sem_wait(varsem)
disas sem_post
# look for the "cmpq $0x0,0x8(%rdi)" and put a breakpoint there.
# in older versions it's sem_post+4;
# in newer versions it's sem_post+18.
t 3
b *(sem_post+18) thread 3 <-- or sem_post+4 or whatever
c
# thread 3 (poster) stops at the breakpoint inside sem_post,
# after incrementing varsem->value (4-byte value 0 bytes into the
object)
# but before looking at varsem->nwaiters (8-byte value 8 bytes into the
object)
t 2
b free thread 2
c
# thread 2 (waiter) sails through the sem_wait without blocking,
# calls sem_destroy(varsem),
# trashes the memory,
# and stops at the beginning of free
t 3
c
# thread 3 (poster) resumes in the middle of sem_post,
# looks at varsem->nwaiters and sees it's nonzero (trash)
# so it makes the FUTEX_WAKE syscall which returns EINVAL,
# the program exits with error message
# "sem_post() in poster: Invalid argument"
I hope I am not overinflating this bug's severity by calling
it "critical" ("major" would feel more appropriate to me,
but there seems to be no "major" option, only "normal" and "critical").
Although failure is rare,
we are about to be forced to implement our own semaphores
rather than using the posix semaphores because of this bug,
so it does seem rather severe.
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.