Bug 5227

Summary: O_ATOMICLOOKUP vs O_CLOEXEC problems with RHEL4 and RHEL5 kernels
Product: glibc Reporter: John Salmon <john>
Component: libcAssignee: Ulrich Drepper <drepper.fsp>
Severity: normal CC: Axel.Thimm, glibc-bugs
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: a new test for opendir

Description John Salmon 2007-10-28 01:21:38 UTC
First, I'd like to point out that O_CLOEXEC is not mentioned in the glibc-2.7
manuals.  Nor are the fopen mode flags 'm', 'c' or 'e'.  Should the
manuals be updated to reflect these new features?

I'm experiencing problems with O_CLOEXEC and opendir in glibc-2.7.
There are probably also problems with other glibc functions that use
O_CLOEXEC, but I haven't explored them in detail.

I have two machines with different kernels:
salmonj@drdws032.nyc$ cat /proc/version
Linux version (root@drdwsfe0.nyc.deshaw.com) (gcc version 3.4.4
20050721 (Red Hat 3.4.4-2)) #1 SMP Tue Feb 27 15:02:11 EST 2007

salmonj@drda0047.nyc$ cat /proc/version
Linux version 2.6.9-5.ELsmp (bhcompile@thor.perf.redhat.com) (gcc version 3.4.3
20041212 (Red Hat 3.4.3-9.EL4)) #1 SMP Wed Jan 5 19:29:47 EST 2005

On the first, open silently ignores the O_CLOEXEC bit.  I.e., it
doesn't "work", but it doesn't cause any problems either.  It looks
like glibc-2.7's use of O_CLOEXEC has been coded with this
possibility in mind.  I configured and built glibc-2.7 on this machine.

On the second, calling open with the O_CLOEXEC bit set causes open to
return -1 with errno=530.  This can cause unexpected failures wherever
glibc uses O_CLOEXEC.  For all I know, this is a kernel bug.  Even so,
the glibc maintainers should make a conscious decision about whether
glibc should cleanly work around it or not.

The one case I looked at in detail is opendir.  I believe it fails
because of this code:

#ifdef O_CLOEXEC
  flags |= O_CLOEXEC;
  int fd = open_not_cancel_2 (name, flags);
  if (__builtin_expect (fd, 0) < 0)
    return NULL;

Here's some code that demonstrates the problem:

salmonj@drda0047.nyc$ cat tstopendir.c
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>
#include <fcntl.h>

int main(int argc, char **argv){
    DIR *d;
    int fd;
    d = opendir(argv[1]);
    if(d == 0)
        printf("opendir(\"%s\") OK\n", argv[1]);

    return 0;
salmonj@drda0047.nyc$ xmk tstopendir
cc -std=c99 -I/proj/desres/root/Linux/x86_64/glibc/2.7-01/include
tstopendir.c   -o tstopendir
salmonj@drda0047.nyc$ ls -ld junk
drwxrwsr-x  2 salmonj salmonj 4096 Oct 27 20:46 junk/
salmonj@drda0047.nyc$ ./tstopendir junk
opendir: Unknown error 530
### This is very strange.  "." and ".." don't have the same problem:
salmonj@drda0047.nyc$ ./tstopendir .
opendir(".") OK

# And here's evidence that O_CLOEXEC is the underlying problem.
salmonj@drda0047.nyc$ cat tstcloexec.c
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int main(int argc, char **argv){
    int fd = open(argv[0], O_RDONLY|O_CLOEXEC);
        printf("open(%s, O_RDONLY|O_CLOEXEC) OK\n");
    return 0;
salmonj@drda0047.nyc$ salmonj@drda0047.nyc$ xmk tstcloexec
cc -std=c99 -I/proj/desres/root/Linux/x86_64/glibc/2.7-01/include
tstcloexec.c   -o tstcloexec
salmonj@drda0047.nyc$ tstcloexec
open: Unknown error 530

John Salmon
Comment 1 Ulrich Drepper 2007-10-28 04:48:23 UTC
There will be no work-around to kernel bugs.
Comment 2 John Salmon 2007-10-29 03:26:24 UTC
Created attachment 2064 [details]
a new test  for opendir

Fails when run with RHEL kernel.  Passes when run with 2.6.18 kernel.
Comment 3 John Salmon 2007-10-29 03:28:06 UTC
Fair enough - glibc doesn't work around kernel bugs.

How about widely deployed kernel "enhancements"?  It turns out that the problem
is that the machine on which opendir fails has a RedHat EL kernel with TUX
enhancements, and that kernel (and presumably thousands like it) was compiled
with the following in fcntl.h:

#define O_NOATIME	01000000
#define O_ATOMICLOOKUP	02000000 /* TUX */

So my kernel thinks that the 02000000 bit of OFLAGS is a request for an
ATOMICLOOKUP, but glibc (and the linux main line kernels since 2.6.something)
thinks that it's a request to set the close-on-exec bit.  Wonderful.

I can understand if glibc mainenance team simply refuses to deal with
non-standard kernels.  You have to draw the line somewhere.  

But at least let's have a test so people who run 'make check' won't think
they've got a working library when they don't.  I've attached 'opendir-tst2.c'
that fails on my RHEL system but that works fine on my 2.6.18 system.  Note that
the first opendir succeeds - we just created tmpXXXX, so it's very likely in the
dentry_cache, and hence O_ATOMICLOOKUP has no problem.  Trying to
opendir("tmpXXX/doesnotexist") on the other hand is a miss in the dentry cache
and fails the ATOMICLOOKUP test, leading to a non-standard errno which we can
test for. Lucky for us that the TUX patches set errno to the crazy value of 530
when the dentry cache lookup fails.  If not for that, I wouldn't know how to
reliably reproduce the problem.
Comment 4 Ulrich Drepper 2007-10-29 04:25:58 UTC
Stop reopening.  This is no bug.  There will be no support for nonstandard kernels.
Comment 5 John Salmon 2007-10-29 04:41:05 UTC
"glibc does not support nonstandard kernels".

Is that a reason to ignore a straightforward test, modeled after the other tests
in dirent/, that passes when glibc is working correctly and that fails on some
systems which happen to be unsupported?
Comment 6 Axel Thimm 2007-12-28 01:53:51 UTC
Note that this is still the case with RHEL5 as well.

I agree that vendors undefining or redefining constants is a very bad thing, but
at this point this kind of setup is really widely deployed. Given that Ulrich is
even working for that vendor, could there be some solution/workaround for
RHEL4/RHEL5 users? Maybe something like the *ASSUME_KERNEL environment setting?

A consequence of this bug is that all build systems based on RHEL5 carrying
Fedora chroots or for that matter any glibc 2.7 system are randomly breaking
with unknown error 530 (actually I don't understand why this is random and not
always, but that's probably another story). Other systems seem to suffer in a
similar way and browsing through google's "unknown error 530" hits one sees that
 no user is even close to suspecting a kernel/glibc ABI incompatibility.

Is there any way to keep those Fedora 8/9 chroots running on a RHEL5 kernel? I'm
reopening not to pin that as a glibc bug, but as a request for a workaround or
advice for action to take. People desperately googling for "unknown error 530"
(like I did) will eventually find this report and would like to see what they
can do to fix it. Help! :)

I also filed this bug against the vendor as he should also take action to fix
these problems with the next kernel release (which according to the vendor's
schedule would happen the earliest in three months after a submission):


Thanks for any advice in advance!
Comment 7 Ulrich Drepper 2007-12-28 03:45:07 UTC
Stop reopening the bug.  This is entirely a kernel problem.  Just use correct
Comment 8 Axel Thimm 2007-12-28 09:18:37 UTC
(In reply to comment #7)
> Stop reopening the bug.  This is entirely a kernel problem.  Just use correct
> kernels.

So the recommendation is to not use RHEL???
Comment 9 Jakub Jelinek 2007-12-28 09:35:26 UTC
Using correct kernels doesn't imply that.  You can get fixed kernels for RHEL5
e.g. from http://people.redhat.com/dzickus/el5/62.el5/ and it will surely
eventually make it into official updates.
Comment 10 Axel Thimm 2007-12-28 19:53:28 UTC
Thanks Jakub! I updated the bugzilla.redhat.com entry with that information and
will knock myself out with the 62 kernels :)