Bug 18515 - posix_fallocate disastrous fallback behavior is no longer mandated by POSIX and should be fixed
Summary: posix_fallocate disastrous fallback behavior is no longer mandated by POSIX a...
Status: RESOLVED WONTFIX
Alias: None
Product: glibc
Classification: Unclassified
Component: libc (show other bugs)
Version: 2.21
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-11 01:28 UTC by Pat
Modified: 2015-10-02 16:20 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pat 2015-06-11 01:28:47 UTC
The fallback behavior of posix_fallocate() is a disaster. The entire point of this function is to improve system performance by allowing applications to preallocate large extents, helping the O/S to minimize on-disk fragmentation.

But the behavior of this call on file systems lacking unwritten extents ultimately results in two writes to every "preallocated" block: Once to "preallocate", and once to populate with actual data. This is the exact opposite of a performance improvement.

Since there is no portable way to tell when glibc will fall back to the performance-killing mode, there is no way to use this function without risking making the performance worse precisely when you were trying to make it better. Conventional wisdom is thus to avoid this call altogether and use a mishmash of platform-specific #ifdef-selected variants instead (see e.g. http://stackoverflow.com/q/14063046/).

Earlier versions of POSIX did not provide any way for this call to fail due to lack of file system support, so glibc's unfortunate fallback behavior was effectively mandated by the spec.

However, the current version of POSIX (http://pubs.opengroup.org/stage7tc1/functions/posix_fallocate.html#tag_16_366_05) has extended the meaning of EINVAL to include "...or the underlying file system does not support this operation". So it is now possible for glibc to do the sane thing and simply return EINVAL when the fallocate() system call returns ENOTSUPPORTED.

Best of all, this improvement can be made in the best possible way: By deleting a bunch of code.
Comment 1 Carlos O'Donell 2015-06-12 02:21:32 UTC
(In reply to Pat from comment #0)
> The fallback behavior of posix_fallocate() is a disaster. The entire point
> of this function is to improve system performance by allowing applications
> to preallocate large extents, helping the O/S to minimize on-disk
> fragmentation.

While it is true that there is a performance aspect to calling posix_fallocate, I have never read anywhere that the purpose of the function was to improve system performance (preallocation).

The standards description of posix_fallocate talks explicitly about making sure the space you need is present, and that is the key thing here, the backing store is allocated. Therefore subsequent writes don't fail, and mmap followed by memory reads and writes don't SIGBUS.

Having said that, I agree that the fallback code is racy, and seeing Issue 7 of POSIX provide a way out is good.

Are you not worried that removing the fallback code will simply push the problem into the application? Where you will get lots of bespoke attempts to ensure the file has backing store allocated by doing writes? Should glibc provide a posix_fallocate_ng that continues to use the fallback so applications that need the fallback can use it?
Comment 2 Florian Weimer 2015-10-01 12:55:50 UTC
(In reply to Pat from comment #0)
> The fallback behavior of posix_fallocate() is a disaster. The entire point
> of this function is to improve system performance by allowing applications
> to preallocate large extents, helping the O/S to minimize on-disk
> fragmentation.
> 
> But the behavior of this call on file systems lacking unwritten extents
> ultimately results in two writes to every "preallocated" block: Once to
> "preallocate", and once to populate with actual data. This is the exact
> opposite of a performance improvement.

Callers who want to avoid double-writes can use fallocate instead.  This is explained in the glibc manual.

We analyzed this issue recently, and we decided that we have to preserve the fallback (note that I completely changed my opinion after reviewing how application code uses posix_fallocate right now):

https://sourceware.org/ml/libc-alpha/2015-04/msg00309.html
https://sourceware.org/ml/libc-alpha/2015-05/msg00058.html

So we really can't fix this bug, even though fallback is problematic.
Comment 3 Carlos O'Donell 2015-10-02 03:17:48 UTC
(In reply to Florian Weimer from comment #2)
> (In reply to Pat from comment #0)
> > The fallback behavior of posix_fallocate() is a disaster. The entire point
> > of this function is to improve system performance by allowing applications
> > to preallocate large extents, helping the O/S to minimize on-disk
> > fragmentation.
> > 
> > But the behavior of this call on file systems lacking unwritten extents
> > ultimately results in two writes to every "preallocated" block: Once to
> > "preallocate", and once to populate with actual data. This is the exact
> > opposite of a performance improvement.
> 
> Callers who want to avoid double-writes can use fallocate instead.  This is
> explained in the glibc manual.
> 
> We analyzed this issue recently, and we decided that we have to preserve the
> fallback (note that I completely changed my opinion after reviewing how
> application code uses posix_fallocate right now):
> 
> https://sourceware.org/ml/libc-alpha/2015-04/msg00309.html
> https://sourceware.org/ml/libc-alpha/2015-05/msg00058.html
> 
> So we really can't fix this bug, even though fallback is problematic.

Agreed. I'll submit an update to the linux kernel man pages to mention this there also.

c.
Comment 4 Pat 2015-10-02 16:20:41 UTC
It is unfortunate that this one interface tries to serve two distinct purposes:

 - reducing fragmentation in high-performance applications
 - guaranteeing future writes do not fail for lack of space

As a consequence of the original POSIX mis-specification -- the ultimate cause of the WONTFIX disposition for this bug -- posix_fallocate() is now unusable in portable code for any purpose at all.

People like me who want high performance can never use posix_fallocate() because glibc will never change its slow fallback behavior.

People who want to guarantee future writes do not fail also cannot use this interface in portable code, because current POSIX allows EINVAL when the underlying O/S does not support pre-allocation.

Although I disagree with the glibc maintainers' decision here, I understand and respect it. In any case, the blame belongs to the original POSIX spec.

Thanks.