This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]


On 05/05/2015 10:28 PM, Carlos O'Donell wrote:
> On 04/24/2015 08:53 AM, Florian Weimer wrote:
>> The previous implementation could result in silent data corruption,
>> and this has been observed to happen with application code.
> 
> In principle I agree with the removal of all of the fallback fallocate
> code, it simply can't work reliably, and a reliable solution is ridiculously
> expensive (see Rich's comments in the BZ about CAS over all the mmap'd pages).

It's also not covered by the memory model, I think.

> The bug with O_APPEND files is real, and yet another reason to remove the
> fallback code.

We should handle that better at the very least.

We could clear O_APPEND, but only in single-threaded mode; I don't think
it's worth the effort.  Re-opening the descriptor through /proc/self/fd
does not work because closing that descriptor would release POSIX
advisory locks.

> What worries me though is that this change could break existing systems
> that relied on this emulation to do something sensible for filesystems
> that don't support fallocate. These binaries could easily be single threaded
> systems with no other process touching their files and writing to filesystems
> that don't support fallocate. If that is a sensible class of users, then we
> need to version the interface, with the old version continuing to call the
> fallback code and the new version not calling the fallback code.

After sleeping over your comments, I actually did my homework.  The gist
is that we cannot remove fallback, I think not even with the
compatibility symbol.

Various file systems do not support fallocate.  This includes NFS, where
even the most recent version makes it optional to implement in the server.

SQLite ignores the posix_fallocate return value, but MariaDB does not.
A recompiled MariaDB would suddenly start to fail, and the DBA would
have to disable pre-allocation in the configuration.  If I read the
source correctly, systemd-journald will stop logging, and there is no
knob to turn off fallocate.  Same for libvirt, it will fail to create
backing files for storage devices.

Both MariaDB and libvirt are often run on NFS storage, so a glibc change
which removes fallback would actually affect them.  For the code we
ship, we can move the fallback to the applications, but there is no good
way to make sure that happens with third-party applications.  I do not
believe the compatibility symbol mechanism is a good alternative because
the breakage will be file-system-dependent and may not be noticed during
testing.  (I'm generally skeptical of using compatibility symbols this way.)

Maybe we could remove the write loop and perform only an ftruncate call
which (hopefully) increases the file size.  This would take care of the
O_APPEND issue and remove most of the races.  Using posix_fallocate to
avoid ENOSPC later would not work, but with thin provisioning,
deduplicating storage and compression going around these days, I don't
think writing zero blocks has that effect in practice anyway
(particularly not on NFS).  I'll ask around.

-- 
Florian Weimer / Red Hat Product Security


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]