[PATCH 1/3] Linux: Add close_range

Wed Dec 23 12:33:21 GMT 2020

On Tue, Dec 22, 2020 at 12:41:50PM +0100, Florian Weimer via Libc-alpha wrote:
> * Adhemerval Zanella:
> 
> >> I think we generally use int for file descriptors, following POSIX.
> >
> > The Linux interface uses unsigned integer, meaning that negative values
> > won't really trigger an invalid usage (kernel will return EINVAL if 
> > second argument is larger than first one though).
> >
> > I would prefer for syscall wrapper to be close as possible of the kernel
> > interface, otherwise it would require to add additional semantic to
> > handle potential pitfall cases.
> >
> > On this case, if we go for 'int' as argument should we return EBADF 
> > for invalid handles?
> 
> Hmm.  I think unsigned int is needed for ~0U to work, which is what you
> used in the tests.  If that's what applications use today when issuing
> the syscall directly, I think we need to stick to unsigned int.

For the record, the initial reason we chose unsigned int for
close_range() was that the close() syscall uses unsigned int too:

SYSCALL_DEFINE1(close, unsigned int, fd)
SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
		unsigned int, flags)

Furthermore, the kernel doesn't care whether the fd is negative. It
simply checks whether the passed in fd number falls within the possible
range of the fdtable, i.e. fd < fdt->max_fds. If the passed-in fd falls
within the possible range that the fdtable can handle it will check
whether there's a file open at that position in the fdtable, i.e. from
the kernel's perspective "negative" fd values aren't special in any way.

But from what I can tell a lot of (non-libc) userspace isn't aware of
the fact that the kernel uses unsigned int which can lead to some
confusion. A little while ago I had to talk to Lennart about this when
they added support for close_range() which they had been waiting for.
Their syscall wrapper is now documented with:

/* Kernel-side the syscall expects fds as unsigned integers (just like close() actually), while
 * userspace exclusively uses signed integers for fds. We don't know just yet how glibc is going to
 * wrap this syscall, but let's assume it's going to be similar to what they do for close(),
 * i.e. make the same unsigned → signed type change from the raw kernel syscall compared to the
 * userspace wrapper. There's only one caveat for this: unlike for close() there's the special
 * UINT_MAX fd value for the 'end_fd' argument. Let's safely map that to -1 here. And let's refuse
 * any other negative values. */
https://github.com/systemd/systemd/commit/441e0fdb900b49888fb6d7901a2b5aa92c0a2017

Christian