[ECOS] Deadlock scenario on close() in io/fileio

Fri Dec 12 00:38:00 GMT 2008

Rutger Hofman <rutger@cs.vu.nl> writes:

> Good morning list,
> 
> I ran into a deadlock with io/fileio on invoking close(fd) in
> io/fileio/current/src/fd.cxx.
> 
> The problem arises because close() grabs fd_lock, then calls
> cyg_fd_free() which in turn calls the file system's fo_close(), which
> will flush any pending data; this flush invokes write() which will
> again try to grab fd_lock. Twice the same lock in one thread ->
> deadlock.
> 
> The stack trace on deadlock clearly shows this:
> 
> close()
>    cyg_fd_free()
>      lock(fdlock)       <======================================
>      fd_close()
>        fp_ucount_dec()
>          cyg_yaffs_fo_close()
>            yaffs_close()
>              yaffs_FlushFile()
>                yaffs_UpdateObjectHeader()
>                  readwritev()
>                    cyg_fp_get()
>                      lock(fdlock)  <===========================
> 
> The code of close() even has a comment in the call to cyg_fd_free()
> that points out that the file's fo_close may be called.

Is YAFFS recursively calling back into its own entrypoints? Or is it
calling a different filesytem? I hope the latter, it should be using
it's own internal interfaces to do anything else.


> I think there are a few possible solutions:
> 1) a full-fledged continuation mechanism where all locks are released
> when a layer is left;
> 2) allow fdlock to be grabbed recursively, as in Java-style
> synchronized locking;

These both look horrible. Recursive mutexes are evil and are to be
resisted at all costs -- they are generally a quick workaround for
something that should be solved with a bit more thought.


> 3) check if this is the only occurrence of this deadlock scenario, and
> check if the lock can be released in fp_ucount_dec without impairing
> atomicity

As it happens I fixed this, and other similar problems, in the
eCosCentric sources some time ago. I'll make a proper contribution
later, but for now try this out and see if it solves your problem:


Index: ChangeLog
===================================================================
RCS file: /cvs/ecos/ecos/packages/io/fileio/current/ChangeLog,v
retrieving revision 1.68
diff -u -5 -r1.68 ChangeLog

--- ChangeLog	1 May 2008 09:31:09 -0000	1.68
+++ ChangeLog	11 Dec 2008 14:25:41 -0000
@@ -1,5 +1,12 @@
+2008-12-11  Nick Garnett  <nickg@ecoscentric.com>
+
+	* src/fd.cxx (fp_ucount_dec, fd_close): Fix locking strategy so
+	that the fdlock is free when filesystem close entry is
+	called. This allows other threads to peform descriptor operations
+	during the close.
+
 2008-04-02 Xinghua Yang <yxinghua@sunnorth.com.cn>
 	   Andrew Lunn <andrew.lunn@ascom.ch>
 
 	* cdl/fileio.cdl: Use CYGPKG_FILEIO_DIRENT_DTYPE to enable/disable
 	d_type field of struct dirent.
Index: src/fd.cxx
===================================================================
RCS file: /cvs/ecos/ecos/packages/io/fileio/current/src/fd.cxx,v
retrieving revision 1.7
diff -u -5 -r1.7 fd.cxx
--- src/fd.cxx	7 Apr 2004 11:18:54 -0000	1.7
+++ src/fd.cxx	11 Dec 2008 14:25:41 -0000
@@ -147,24 +147,45 @@
 // These must all be called with the fdlock already locked.
 
 //--------------------------------------------------------------------------
 // Decrement the use count on a file object and if it goes to zero,
 // close the file and deallocate the file object.
+//
+// A word on locking here: It is necessary for the filesystem
+// fo_close() function to be called with the file lock claimed, but
+// the fdlock released, to permit other threads to perform fd-related
+// operations. The original code here took the file lock and released
+// the fdlock before the call and then locked the fdlock and released
+// the file lock after. The idea was that there was no point at which
+// a lock of some sort was not held. However, if two threads are
+// running through this code simultaneously, this could lead to
+// deadlock, particularly if the filesystem's syncmode specifies fstab
+// or mtab level locking. So the code now unlocks the file lock before
+// reclaiming the fdlock. This leaves a small window where no locks
+// are held, where in theory some other thread could jump in and mess
+// things up. However, this is benign; if the other thread is
+// accessing some other file object there will be no conflict and by
+// definition no other thread can access this file object since we are
+// executing here because no file descriptors point to this file
+// object any longer. Additionally, the file object is only marked
+// free, by zeroing the f_flag field, once the fdlock has been
+// reclaimed.
 
 static int fp_ucount_dec( cyg_file *fp )
 {
     int error = 0;
     if( (--fp->f_ucount) <= 0 )
     {        
         cyg_file_lock( fp, fp->f_syncmode );
+        FILEIO_MUTEX_UNLOCK(fdlock);
         
         error = fp->f_ops->fo_close(fp);
 
         cyg_file_unlock( fp, fp->f_syncmode );
+        FILEIO_MUTEX_LOCK(fdlock);        
             
-        if( error == 0 )
-            fp->f_flag = 0;
+        fp->f_flag = 0;
     }
 
     return error;
 }
 
@@ -178,21 +199,20 @@
     cyg_file *fp;
 
     CYG_ASSERT(((0 <= fd) && (fd<CYGNUM_FILEIO_NFD)), "fd out of range");    
 
     fp = desc[fd];
+    desc[fd] = FD_ALLOCATED;
 
     if( fp != FD_ALLOCATED && fp != NULL)
     {
         // The descriptor is occupied, decrement its usecount and
         // close the file if it goes zero.
 
         error = fp_ucount_dec( fp );
     }
 
-    desc[fd] = FD_ALLOCATED;
-
     return error;
 }
 
 
 //==========================================================================



-- 
Nick Garnett                                      eCos Kernel Architect
eCosCentric Limited    http://www.eCosCentric.com      The eCos experts
Barnwell House, Barnwell Drive, Cambridge, UK.     Tel: +44 1223 245571
Registered in England and Wales:                        Reg No: 4422071


-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss