First, I'd like to point out that O_CLOEXEC is not mentioned in the glibc-2.7 manuals. Nor are the fopen mode flags 'm', 'c' or 'e'. Should the manuals be updated to reflect these new features? I'm experiencing problems with O_CLOEXEC and opendir in glibc-2.7. There are probably also problems with other glibc functions that use O_CLOEXEC, but I haven't explored them in detail. I have two machines with different kernels: salmonj@drdws032.nyc$ cat /proc/version Linux version 2.6.18.1-5smp (root@drdwsfe0.nyc.deshaw.com) (gcc version 3.4.4 20050721 (Red Hat 3.4.4-2)) #1 SMP Tue Feb 27 15:02:11 EST 2007 salmonj@drda0047.nyc$ cat /proc/version Linux version 2.6.9-5.ELsmp (bhcompile@thor.perf.redhat.com) (gcc version 3.4.3 20041212 (Red Hat 3.4.3-9.EL4)) #1 SMP Wed Jan 5 19:29:47 EST 2005 On the first, open silently ignores the O_CLOEXEC bit. I.e., it doesn't "work", but it doesn't cause any problems either. It looks like glibc-2.7's use of O_CLOEXEC has been coded with this possibility in mind. I configured and built glibc-2.7 on this machine. On the second, calling open with the O_CLOEXEC bit set causes open to return -1 with errno=530. This can cause unexpected failures wherever glibc uses O_CLOEXEC. For all I know, this is a kernel bug. Even so, the glibc maintainers should make a conscious decision about whether glibc should cleanly work around it or not. The one case I looked at in detail is opendir. I believe it fails because of this code: int flags = O_RDONLY|O_NDELAY|EXTRA_FLAGS|O_LARGEFILE; #ifdef O_CLOEXEC flags |= O_CLOEXEC; #endif int fd = open_not_cancel_2 (name, flags); if (__builtin_expect (fd, 0) < 0) return NULL; Here's some code that demonstrates the problem: salmonj@drda0047.nyc$ cat tstopendir.c #include <sys/types.h> #include <dirent.h> #include <stdio.h> #include <fcntl.h> int main(int argc, char **argv){ DIR *d; int fd; d = opendir(argv[1]); if(d == 0) perror("opendir"); else printf("opendir(\"%s\") OK\n", argv[1]); return 0; } salmonj@drda0047.nyc$ salmonj@drda0047.nyc$ xmk tstopendir cc -std=c99 -I/proj/desres/root/Linux/x86_64/glibc/2.7-01/include -Wl,-dynamic-linker=/proj/desres/root/Linux/x86_64/glibc/2.7-01/lib/ld-2.7.so tstopendir.c -o tstopendir salmonj@drda0047.nyc$ ls -ld junk drwxrwsr-x 2 salmonj salmonj 4096 Oct 27 20:46 junk/ salmonj@drda0047.nyc$ ./tstopendir junk opendir: Unknown error 530 ### This is very strange. "." and ".." don't have the same problem: salmonj@drda0047.nyc$ ./tstopendir . opendir(".") OK salmonj@drda0047.nyc$ # And here's evidence that O_CLOEXEC is the underlying problem. salmonj@drda0047.nyc$ cat tstcloexec.c #define _GNU_SOURCE #include <fcntl.h> #include <unistd.h> #include <stdio.h> int main(int argc, char **argv){ int fd = open(argv[0], O_RDONLY|O_CLOEXEC); if(fd<0) perror("open"); else printf("open(%s, O_RDONLY|O_CLOEXEC) OK\n"); return 0; } salmonj@drda0047.nyc$ salmonj@drda0047.nyc$ xmk tstcloexec cc -std=c99 -I/proj/desres/root/Linux/x86_64/glibc/2.7-01/include -Wl,-dynamic-linker=/proj/desres/root/Linux/x86_64/glibc/2.7-01/lib/ld-2.7.so tstcloexec.c -o tstcloexec salmonj@drda0047.nyc$ tstcloexec open: Unknown error 530 John Salmon
There will be no work-around to kernel bugs.
Created attachment 2064 [details] a new test for opendir Fails when run with RHEL kernel. Passes when run with 2.6.18 kernel.
Fair enough - glibc doesn't work around kernel bugs. How about widely deployed kernel "enhancements"? It turns out that the problem is that the machine on which opendir fails has a RedHat EL kernel with TUX enhancements, and that kernel (and presumably thousands like it) was compiled with the following in fcntl.h: #define O_NOATIME 01000000 #define O_ATOMICLOOKUP 02000000 /* TUX */ So my kernel thinks that the 02000000 bit of OFLAGS is a request for an ATOMICLOOKUP, but glibc (and the linux main line kernels since 2.6.something) thinks that it's a request to set the close-on-exec bit. Wonderful. I can understand if glibc mainenance team simply refuses to deal with non-standard kernels. You have to draw the line somewhere. But at least let's have a test so people who run 'make check' won't think they've got a working library when they don't. I've attached 'opendir-tst2.c' that fails on my RHEL system but that works fine on my 2.6.18 system. Note that the first opendir succeeds - we just created tmpXXXX, so it's very likely in the dentry_cache, and hence O_ATOMICLOOKUP has no problem. Trying to opendir("tmpXXX/doesnotexist") on the other hand is a miss in the dentry cache and fails the ATOMICLOOKUP test, leading to a non-standard errno which we can test for. Lucky for us that the TUX patches set errno to the crazy value of 530 when the dentry cache lookup fails. If not for that, I wouldn't know how to reliably reproduce the problem.
Stop reopening. This is no bug. There will be no support for nonstandard kernels.
"glibc does not support nonstandard kernels". Is that a reason to ignore a straightforward test, modeled after the other tests in dirent/, that passes when glibc is working correctly and that fails on some systems which happen to be unsupported?
Note that this is still the case with RHEL5 as well. I agree that vendors undefining or redefining constants is a very bad thing, but at this point this kind of setup is really widely deployed. Given that Ulrich is even working for that vendor, could there be some solution/workaround for RHEL4/RHEL5 users? Maybe something like the *ASSUME_KERNEL environment setting? A consequence of this bug is that all build systems based on RHEL5 carrying Fedora chroots or for that matter any glibc 2.7 system are randomly breaking with unknown error 530 (actually I don't understand why this is random and not always, but that's probably another story). Other systems seem to suffer in a similar way and browsing through google's "unknown error 530" hits one sees that no user is even close to suspecting a kernel/glibc ABI incompatibility. Is there any way to keep those Fedora 8/9 chroots running on a RHEL5 kernel? I'm reopening not to pin that as a glibc bug, but as a request for a workaround or advice for action to take. People desperately googling for "unknown error 530" (like I did) will eventually find this report and would like to see what they can do to fix it. Help! :) I also filed this bug against the vendor as he should also take action to fix these problems with the next kernel release (which according to the vendor's schedule would happen the earliest in three months after a submission): https://bugzilla.redhat.com/show_bug.cgi?id=426890 Thanks for any advice in advance!
Stop reopening the bug. This is entirely a kernel problem. Just use correct kernels.
(In reply to comment #7) > Stop reopening the bug. This is entirely a kernel problem. Just use correct > kernels. So the recommendation is to not use RHEL???
Using correct kernels doesn't imply that. You can get fixed kernels for RHEL5 e.g. from http://people.redhat.com/dzickus/el5/62.el5/ and it will surely eventually make it into official updates.
Thanks Jakub! I updated the bugzilla.redhat.com entry with that information and will knock myself out with the 62 kernels :)