Created attachment 6778 [details]
demonstration of the bug
Per POSIX, close() is valid on a file descriptor with pending AIO operations:
"When there is an outstanding cancelable asynchronous I/O operation against fildes when close() is called, that I/O operation may be canceled. An I/O operation that is not canceled completes as if the close() operation had not yet occurred. All operations that are not canceled shall complete as if the close() blocked until the operations completed. The close() operation itself need not block awaiting such I/O completion. Whether any I/O operation is canceled, and which I/O operation may be canceled upon close(), is implementation-defined."
My reading of this text is that you cannot assume anything about the integrity of data pending for write on a given file descriptor if you close that file descriptor, but that the behavior of calling close in this situation is not undefined, and certainly is not permitted to corrupt other files.
However, as the attached test program shows, glibc's AIO implementation DOES corrupt other files when close is called on a file descriptor with pending AIO operations and the file descriptor number gets reused. I've used pipes to control the timing in this example (and sometimes it still requires a few tries to hit the bug), but it could happen just as well with regular files.
As long as AIO is being implemented with threads on top of regular POSIX file operations, rather than via direct kernel support, I believe one of the following two solutions must be used:
1. Modify close() to attempt to cancel any pending AIO requests and block until they have all successfully completed or cancelled. This is very difficult, since close() is required to be async-signal-safe.
2. Have the AIO implementation duplicate any file descriptor it's going to work with, using fcntl with F_DUPFD_CLOEXEC, and always use the duplicate. In this case, close() must still be responsible for dissociating the file descriptor number from its AIO work queue so that AIO requests on a new file descriptor don't get appended to the old work queue but instead result in a new one. This still sounds difficult to do in a way that's async-signal-safe, however.
I just noticed that in my test program, I accidentally passed the same AIO control buffer to aio_write twice; this is invalid. Using two copies of the control buffer does not, however, fix the problem. I can submit a revised test case if desired.
Third possibility is generate unique identifier for each open and cancel when identifier mismatches.
Open is not the only way a file descriptor can be obtained; generating the unique id at each fd-creation seems almost impossible. For example, an fd can be inherited across exec, or could be passed over a unix socket or fifo, or could come to exist from a multitude of syscalls.
Also, I think my F_DUPFD_CLOEXEC approach is flawed in that it breaks fcntl locking. Closing the duplicate file descriptor will release any locks on the original file descriptor due to the idiotic design of fcntl locking...
I do see a possible new approach to AIO: have AIO threads call unshare(CLONE_FILES) so that calling close from elsewhere in the program cannot affect the pending AIO operations.