This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
- From: Florian Weimer <fweimer at redhat dot com>
- To: Roland McGrath <roland at hack dot frob dot com>, Rich Felker <dalias at libc dot org>
- Cc: Paul Eggert <eggert at cs dot ucla dot edu>, libc-alpha at sourceware dot org, "Carlos O'Donell" <carlos at redhat dot com>, Mark Wielaard <mjw at redhat dot com>
- Date: Mon, 18 May 2015 12:43:06 +0200
- Subject: Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
- Authentication-results: sourceware.org; auth=none
- References: <20150424134516 dot 6795441F484D0 at oldenburg dot str dot redhat dot com> <554927F9 dot 7080509 at redhat dot com> <5549C097 dot 50505 at redhat dot com> <554A9A46 dot 2050806 at cs dot ucla dot edu> <20150506233055 dot GQ17573 at brightrain dot aerifal dot cx> <20150507181942 dot E71202C3B93 at topped-with-meat dot com> <554BB77D dot 1040805 at redhat dot com>
On 05/07/2015 09:05 PM, Florian Weimer wrote:
> On 05/07/2015 08:19 PM, Roland McGrath wrote:
>>> If I'm not mistaken ftruncate could still reduce the file size if it
>>> races with another operation that would extend the file. This is also
>>> a data loss bug.
>>
>> I concur.
>
> It happens with length == 0. We could error out with EINVAL instead of
> calling ftruncate.
>
> Daniel Berrange pointed me to these bugs:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=17322
> https://bugzilla.redhat.com/show_bug.cgi?id=1140250
> https://bugzilla.redhat.com/show_bug.cgi?id=1077068
Another very recent example is here:
https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
> This suggests that people actually rely on the current allocation
> behavior. Combined with my previous analysis that applications will
> start to fail if we remove the fallback and return EINVAL, I now think
> we need to keep the allocation loop.
This is the patch I currently have. It fixes the avoidable bugs. I
still think we are in a bad situation here, that even a compatibility
symbol cannot fix.
--
Florian Weimer / Red Hat Product Security
From f25f46c0e0223d9f6e9e8ead27a37caa34f33631 Mon Sep 17 00:00:00 2001
Message-Id: <f25f46c0e0223d9f6e9e8ead27a37caa34f33631.1431945166.git.fweimer@redhat.com>
From: Florian Weimer <fweimer@redhat.com>
Date: Mon, 18 May 2015 11:32:44 +0100
Subject: [PATCH] posix_fallocate: Emulation fixes and documentation [BZ
#15661]
To: libc-alpha@sourceware.org
Handle signed integer overflow correctly. Detect and reject O_APPEND.
Document drawbacks of emulation.
This does not completely address bug 15661, but improves the situation
somewhat.
---
ChangeLog | 9 ++++
manual/filesys.texi | 94 +++++++++++++++++++++++++++++++++++++++
sysdeps/posix/posix_fallocate.c | 67 ++++++++++++++++++++--------
sysdeps/posix/posix_fallocate64.c | 67 ++++++++++++++++++++--------
4 files changed, 199 insertions(+), 38 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 4de8a25..603847b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,12 @@
+2015-05-18 Florian Weimer <fweimer@redhat.com>
+
+ [BZ #15661]
+ * sysdeps/posix/posix_fallocate64.c (__posix_fallocate64_l64):
+ Check for overflow properly. Check for O_APPEND. Ignore large
+ file system block sizes. Add comments about problems.
+ * sysdeps/posix/posix_fallocate.c (posix_fallocate): Likewise.
+ * manual/filesys.texi (Storage Allocation): New node.
+
2015-05-18 Arjun Shankar <arjun.is@lostca.se>
* include/stdio.h: Define __need_wint_t.
diff --git a/manual/filesys.texi b/manual/filesys.texi
index 7d55b43..0f2e3dc 100644
--- a/manual/filesys.texi
+++ b/manual/filesys.texi
@@ -1723,6 +1723,7 @@ modify the attributes of a file.
access a file.
* File Times:: About the time attributes of a file.
* File Size:: Manually changing the size of a file.
+* Storage Allocation:: Allocate backing storage for files.
@end menu
@node Attribute Meanings
@@ -3233,6 +3234,99 @@ is a requirement of @code{mmap}. The program has to keep track of the
real size, and when it has finished a final @code{ftruncate} call should
set the real size of the file.
+@node Storage Allocation
+@subsection Storage Allocation
+@cindex allocating file storage
+@cindex file allocation
+@cindex storage allocating
+
+@cindex file fragmentation
+@cindex fragmentation of files
+@cindex sparse files
+@cindex files, sparse
+Most file systems support allocating large files in a non-contiguous
+fashion: the file is split into @emph{fragments} which are allocated
+sequentially, but the fragments themselves can be scattered across the
+disk. File systems generally try to avoid such fragmentation because it
+decreases performance, but if a file gradually increases in size, there
+might be no other option than to fragment it. In addition, many file
+systems support @emph{sparse files} with @emph{holes}: regions of null
+bytes for which no backing storage has been allocated by the file
+system. When the holes are finally overwritten with data, fragmentation
+can occur as well.
+
+Explicit allocation of storage for yet-unwritten parts of the file can
+help the system to avoid fragmentation. Additionally, if storage
+pre-allocation fails, it is possible to report the out-of-disk error
+early, often without filling up the entire disk. However, due to
+deduplication, copy-on-write semantics, and file compression, such
+pre-allocation may not reliably prevent the out-of-disk-space error from
+occurring later. Checking for write errors is still required, and
+writes to memory-mapped regions created with @code{mmap} can still
+result in @code{SIGBUS}.
+
+@deftypefun int posix_fallocate (int @var{fd}, off_t @var{offset}, off_t @var{length})
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+@c If the file system does not support allocation,
+@c @code{posix_fallocate} has a race with file extension (if
+@c @var{length} is zero) or with concurrent writes of non-NUL bytes (if
+@c @var{length} is positive).
+
+Allocate backing store for the region of @var{length} bytes starting at
+byte @var{offset} in the file for the descriptor @var{fd}. The file
+length is increased to @samp{@var{length} + @var{offset}} if necessary.
+
+@var{fd} must be a regular file opened for writing, or @code{EBADF} is
+returned. If there is insufficient disk space to fulfill the allocation
+request, @code{ENOSPC} is returned.
+
+@strong{Note:} If @code{fallocate} is not available (because the file
+system does not support it), @code{posix_fallocate} is emulated, which
+has the following drawbacks:
+
+@itemize @bullet
+@item
+It is very inefficient because all file system blocks in the requested
+range need to be examined (even if they have been allocated before) and
+potentially rewritten. In contrast, with proper @code{fallocate}
+support (see below), the file system can examine the internal file
+allocation data structures and eliminate holes directly, maybe even
+using unwritten extents (which are pre-allocated but uninitialized on
+disk).
+
+@item
+There is a race condition if another thread or process modifies the
+underlying file in the to-be-allocated area. Non-null bytes could be
+overwritten with null bytes.
+
+@item
+If @var{fd} has been opened with the @code{O_APPEND} flag, the function
+will fail with an @code{errno} value of @code{EBADF}.
+
+@item
+If @var{length} is zero, @code{ftruncate} is used to increase the file
+size as requested, without allocating file system blocks. There is a
+race condition which means that @code{ftruncate} can accidentally
+truncate the file if it has been extended concurrently.
+@end itemize
+
+On Linux, if an application does not benefit from emulation or if the
+emulation is harmful due to its inherent race conditions, the
+application can use the Linux-specific @code{fallocate} function, with a
+zero flag argument. For the @code{fallocate} function, @theglibc{} does
+not perform allocation emulation if the file system does not support
+allocation. Instead, an @code{EOPNOTSUPP} is returned to the caller.
+
+@end deftypefun
+
+@deftypefun int posix_fallocate64 (int @var{fd}, off64_t @var{length}, off64_t @var{offset})
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+
+This function is a variant of @code{posix_fallocate64} which accepts
+64-bit file offsets on all platforms.
+
+@end deftypefun
+
@node Making Special Files
@section Making Special Files
@cindex creating special files
diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c
index d15d603..e7fe201 100644
--- a/sysdeps/posix/posix_fallocate.c
+++ b/sysdeps/posix/posix_fallocate.c
@@ -18,26 +18,36 @@
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
+#include <stdint.h>
+#include <sys/fcntl.h>
#include <sys/stat.h>
#include <sys/statfs.h>
-/* Reserve storage for the data of the file associated with FD. */
+/* Reserve storage for the data of the file associated with FD. This
+ emulation is far from perfect, but the kernel cannot do not much
+ better for network file systems, either. */
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
struct stat64 st;
- struct statfs f;
- /* `off_t' is a signed type. Therefore we can determine whether
- OFFSET + LEN is too large if it is a negative value. */
if (offset < 0 || len < 0)
return EINVAL;
- if (offset + len < 0)
+
+ /* Perform overflow check. The outer cast relies on a GCC
+ extension. */
+ if ((__off_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
return EFBIG;
- /* First thing we have to make sure is that this is really a regular
- file. */
+ /* pwrite below will not do the right thing in O_APPEND mode. */
+ {
+ int flags = __fcntl (fd, F_GETFL, 0);
+ if (flags < 0 || (flags & O_APPEND) != 0)
+ return EBADF;
+ }
+
+ /* We have to make sure that this is really a regular file. */
if (__fxstat64 (_STAT_VER, fd, &st) != 0)
return EBADF;
if (S_ISFIFO (st.st_mode))
@@ -47,6 +57,8 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
if (len == 0)
{
+ /* This is racy, but there is no good way to satisfy a
+ zero-length allocation request. */
if (st.st_size < offset)
{
int ret = __ftruncate (fd, offset);
@@ -58,19 +70,36 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
return 0;
}
- /* We have to know the block size of the filesystem to get at least some
- sort of performance. */
- if (__fstatfs (fd, &f) != 0)
- return errno;
-
- /* Try to play safe. */
- if (f.f_bsize == 0)
- f.f_bsize = 512;
-
- /* Write something to every block. */
- for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
+ /* Minimize data transfer for network file systems, by issuing
+ single-byte write requests spaced by the file system block size.
+ (Most local file systems have fallocate support, so this fallback
+ code is not used there.) */
+
+ unsigned increment;
+ {
+ struct statfs64 f;
+
+ if (__fstatfs64 (fd, &f) != 0)
+ return errno;
+ if (f.f_bsize == 0)
+ increment = 512;
+ else if (f.f_bsize < 4096)
+ increment = f.f_bsize;
+ else
+ /* NFS does not propagate the block size of the underlying
+ storage and may report a much larger value which would still
+ leave holes after the loop below, so we cap the increment at
+ 4096. */
+ increment = 4096;
+ }
+
+ /* Write a null byte to every block. This is racy; we currently
+ lack a better option. Compare-and-swap against a file mapping
+ might additional local races, but requires interposition of a
+ signal handler to catch SIGBUS. */
+ for (offset += (len - 1) % increment; len > 0; offset += increment)
{
- len -= f.f_bsize;
+ len -= increment;
if (offset < st.st_size)
{
diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c
index b845df7..ee32679 100644
--- a/sysdeps/posix/posix_fallocate64.c
+++ b/sysdeps/posix/posix_fallocate64.c
@@ -18,26 +18,36 @@
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
+#include <stdint.h>
+#include <sys/fcntl.h>
#include <sys/stat.h>
#include <sys/statfs.h>
-/* Reserve storage for the data of the file associated with FD. */
+/* Reserve storage for the data of the file associated with FD. This
+ emulation is far from perfect, but the kernel cannot do not much
+ better for network file systems, either. */
int
__posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
{
struct stat64 st;
- struct statfs64 f;
- /* `off64_t' is a signed type. Therefore we can determine whether
- OFFSET + LEN is too large if it is a negative value. */
if (offset < 0 || len < 0)
return EINVAL;
- if (offset + len < 0)
+
+ /* Perform overflow check. The outer cast relies on a GCC
+ extension. */
+ if ((__off64_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
return EFBIG;
- /* First thing we have to make sure is that this is really a regular
- file. */
+ /* pwrite64 below will not do the right thing in O_APPEND mode. */
+ {
+ int flags = __fcntl (fd, F_GETFL, 0);
+ if (flags < 0 || (flags & O_APPEND) != 0)
+ return EBADF;
+ }
+
+ /* We have to make sure that this is really a regular file. */
if (__fxstat64 (_STAT_VER, fd, &st) != 0)
return EBADF;
if (S_ISFIFO (st.st_mode))
@@ -47,6 +57,8 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
if (len == 0)
{
+ /* This is racy, but there is no good way to satisfy a
+ zero-length allocation request. */
if (st.st_size < offset)
{
int ret = __ftruncate64 (fd, offset);
@@ -58,19 +70,36 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
return 0;
}
- /* We have to know the block size of the filesystem to get at least some
- sort of performance. */
- if (__fstatfs64 (fd, &f) != 0)
- return errno;
-
- /* Try to play safe. */
- if (f.f_bsize == 0)
- f.f_bsize = 512;
-
- /* Write something to every block. */
- for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
+ /* Minimize data transfer for network file systems, by issuing
+ single-byte write requests spaced by the file system block size.
+ (Most local file systems have fallocate support, so this fallback
+ code is not used there.) */
+
+ unsigned increment;
+ {
+ struct statfs64 f;
+
+ if (__fstatfs64 (fd, &f) != 0)
+ return errno;
+ if (f.f_bsize == 0)
+ increment = 512;
+ else if (f.f_bsize < 4096)
+ increment = f.f_bsize;
+ else
+ /* NFS clients do not propagate the block size of the underlying
+ storage and may report a much larger value which would still
+ leave holes after the loop below, so we cap the increment at
+ 4096. */
+ increment = 4096;
+ }
+
+ /* Write a null byte to every block. This is racy; we currently
+ lack a better option. Compare-and-swap against a file mapping
+ might address local races, but requires interposition of a signal
+ handler to catch SIGBUS. */
+ for (offset += (len - 1) % increment; len > 0; offset += increment)
{
- len -= f.f_bsize;
+ len -= increment;
if (offset < st.st_size)
{
--
2.1.0