This is the mail archive of the libc-hacker@sources.redhat.com mailing list for the glibc project.

Note that libc-hacker is a closed list. You may look at the archives of this list, but subscription and posting are not open.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

libio mmap changes


I have just checked in some changes to libio that I think address all of
the concerns recently raised here.  There are two changes to the mmap plan
that are actually independent.

First, the decision to mmap is no longer done at the open, but instead at
the time of the first read.  The benefits of this are: 

* Save the overhead of stat+mmap if you fopen the file and never read it.
* Don't perceive the lag if you fopen, wait while the file changes,
  and then read (not that you could ever have relied on it before, but
  now it behaves the same with mmap or no).
* No spurious atime update (on Linux, via mmap call) at open time.
* The atime is set on first read under the full range of conforming mmap
  behaviors without any extra read syscall (since the mmap is not done
  until first read, when it's immediately followed by the actual access).

I implemented this in what seemed the clean way, which was to have a
separate "maybe_mmap" jump table that is used when a possibly-mmap'able
file is first opened.  The functions in this jump table used for reading
perform the stat and mmap attempt, switch the jump tables to either the
mmap ones or the plain file ones depending on whether mmap could be used,
and then punt to the chosen flavor's real read routines.


The second set of changes is to the behavior on fflush and reaching EOF.
I made it quite generous, but with what I think is a small amount of
overhead.  Basically, any time you try to read more than it thinks is
there, it does a stat to re-check the file size and remap the file if
necessary.  After an fflush, it does that check on the next read
attempt.  If the file grows too big, or if mmap stops working for some
reason, it will quit using mmap and switch to regular file methods
(though it will never switch back after that, should the file get
smaller or whatever). 

Given the POSIX.1 8.2.3 rules for synchronizing the file position, and
the "underlying functions" clause with its monumentally vague phrase
"certain traits", and a strong urge to fly, one can make the tenuous
leap to analogs of the read vs write guarantees of what's visible when
after appropriate stdio synchronization; i.e., that once you're
guaranteed the file positions are synchronized, you're guaranteed that
the next stdio read will behave as read does vis a vis a prior write.
Now, I am not going to try to argue that this is what the standard
requires.  But it definitely describes the behavior all stdios
heretofore have always had.

My implementation is even a bit more generous than that, in that the
8.2.3 rules mention an implicit synchronization point when feof() is
true while I also provide one when you have read exactly all of the file
and not incurred a stdio EOF condition.  This faithfully replicates the
observed behavior of doing just that when mmap is not used (i.e. reading
exactly the full contents but not hitting an EOF condition, then having
someone else extend the file, then continuing the read without fflush or
anything else first).

I have added a few test programs that exercise some of these cases.  I
don't claim these programs strictly conform to POSIX, but they do
demonstrate the assumptions that the reasoning above leads to.  I think
it is important to realize when thinking about these cases that the
modifications done through the separate file descriptor could just as
well be done by some unrelated process on the machine and the results,
especially in the cases using fflush or having hit EOF, show what people
certainly expect to happen when they try to read random files on the system.

I am still concerned by cases involving truncation (or IO error, but
that is sufficiently rare to ignore), such as what happens in the
tst-mmap2-eofsync program when you remove the final fflush call.  Here
is an example equivalent to a simple reader of some file when another
user might possibly come along and overwrite it.  The behavior without
mmap is to return stale data instead of EOF.  The behavior now is to
crash in fgetc with SIGBUS.  The former state of affairs is what
everyone always presumed when they used stdio to read files without
outside synchronization: no guarantees about synchronization, but you
will get either data that was once at that offset in the file or you
will get EOF and there are no other outcomes.  The new state of affairs
might be rather distressing.  Should cat or less or whatever program
that uses stdio to read a file have the possibility to crash if it
happens to be trying to read the wrong part at the same time another
user is truncating the file?  In the absence of MAP_COPY (or a non-POSIX
filesystem that is only written with atomic-supercede semantics), there
is no way to avoid the possibility of this fault signal.  Myself, I
would be perfectly happy to have libc check the signal for faults in its
mapped files, turn them into a C++-style exception, and have every data
access (including the getc macro) prepared to handle the exception.  But
this is not something we can make happen today, and methods of fault
handling other than C++-style exception annotations are too costly
(since they have overhead every time instead of just the once in a blue
moon when a fault actually happens).  Moreover, old binaries using the
getc macro will never have fault handling for its buffer accesses.

Incidentally, it occurs to me that we should probably tune the heuristic
with some performance tests.  I imagine that for files smaller than a
page or two, doing stat + mmap + munmap might be worse than the normal
case where it's a single read call with a small amount of data copying
(vs MMU twiddling overhead) and you're done.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]