systemd uses pidfd_spawn/posix_spawn these days for process invocation. But there's one more thing we are missing: we want to pass ambient capabilities to invoked processes. This is a bit messy right now, because we have to raise the ambient caps *before* we invoke pidfd_spawn/posix_spawn where they will effect the original process, even if we really don't want that. We'd much rather have them only effect the invoked process, i.e. raise between the clone() + execev() in the child. Related to this: https://github.com/systemd/systemd/pull/32937
What are the relevant interfaces you need? I'm not sure if the systemd PR shows all of them. I just see capability_ambient_set_apply. That is implemented using: PR_CAPBSET_READ PR_CAP_AMBIENT/PR_CAP_AMBIENT_RAISE PR_CAP_AMBIENT/PR_CAP_AMBIENT_IS_SET PR_CAP_AMBIENT/PR_CAP_AMBIENT_IS_SET and involves conditional execution. This is way beyond the limits we have in the current spawn file actions.
Yeah, it's mostly about PR_CAP_AMBIENT_RAISE for us (the other ones related calls we do we probably can do ahead of time). I guess glibc probably would have to be a bit more generic, probably needs to lower/clear the ambient flags set too... And yes, I am fully aware that glibc doesn't really do caps stuff at all right now, it's all in libcap/libcap-ng. But I figure the correct place for this really has to be glibc if we live in a pidfd_spawn()/posix_spawn() world, given it has to be done between clone() and execve() if the parent shall not be affected by these things.
We could add a way to do generic prctl calls in the new process. Would that help? Obviously, the call would be unconditionally. If the staged prctl fails, the whole operation posix_spawn operation fails, without a good way to identify whether it was due to that or some other Linux extension (but that's not a new problem).
Hmm, yeah, being able to schedule prctl calls would probably suffice for our primary usecase, i.e. PR_CAP_AMBIENT_RAISE. But I wonder how feature proof that is. The thing is that various prctl() calls have to be executed in some well-defined order. I.e. let's say you use this to seal the "secure bits" of a process (PR_SET_SECUREBITS), or set PR_SET_NO_NEW_PRIVS or drop caps from the bounding set: this things need to be scheduled in a precise order, so that you don't end up dropping the permissions necessary to execute the next prctl operation. Now, it's probably fine to outsource this ordering to the caller. However, this becomes a complete mess once in future pidfd_spawn/posix_spawn is extended to cover more stuff, for example setresuid() or so. Suddenly, the order in which prctls are executed also must be matched up with the time where setresuid() is invoked: if you use prctl to drop privs before changing uid, the uid will fail. Hence, yes, would certainly cover our specific usecase, and I guess would be simple to add to glibc, but I sense this is going to become a mess later on if you do this, since ordering these operations against what glibc itself wants to do is going to be problematic.
We are side-stepping this issue for now, by adding things to the file actions list instead of spawn attributes. File actions are sequenced. If we add setresuid, it would be a file action in the current model, just like chdir. In both cases, subsequent open file actions depend on the previous non-file actions. There are certainly some inconsistencies (chdir/open are expected to copy the paths, but we can't copy arbitrary prctl arguments). But the real challenge will be data flow (do this first and then use the result in another action), conditional execution, and error handling. I don't think it makes sense to run the file action list into some sort of programming language. We need to find a way to expose a fork-style or vfork-style clone more directly, with clear guidance to programmers how to use it safely, and without the need for managing stacks.
There are projects like rsyscall [1], which essentially exposes clone as a first-order interface and makes process creation a server-like interface where the caller issues the syscall commands to be executed in the child. This allows for great flexibility (since it bypass libc for process creation), but it is somewhat complex and I am not sure of the implication of trying to adjust the idea for a libc-like interface. Another approach is a fork-like interface (where you can add some flags to define what is shared or to gate the clone flags) along with a callback. It has the issue of what kind of function the callback would be possible to run (and historically users over-abused it to overcome its limitations). [1] http://catern.com/rsys21.pdf
(I figure iouring is exactly that: a buffer of interconnected, ordered syscalls to execute.)