31784 – RFE: ability to control ambient caps in posix_spawn/pidfd_spawn

Bug 31784 - RFE: ability to control ambient caps in posix_spawn/pidfd_spawn

Summary: RFE: ability to control ambient caps in posix_spawn/pidfd_spawn

Status:	UNCONFIRMED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	libc (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-05-22 08:46 UTC by Lennart Poettering
Modified:	2024-05-23 13:52 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Lennart Poettering 2024-05-22 08:46:06 UTC

systemd uses pidfd_spawn/posix_spawn these days for process invocation. But there's one more thing we are missing: we want to pass ambient capabilities to invoked processes. This is a bit messy right now, because we have to raise the ambient caps *before* we invoke pidfd_spawn/posix_spawn where they will effect the original process, even if we really don't want that. We'd much rather have them only effect the invoked process, i.e. raise between the clone() + execev() in the child.

Related to this: https://github.com/systemd/systemd/pull/32937

Comment 1 Florian Weimer 2024-05-22 09:43:48 UTC

What are the relevant interfaces you need?

I'm not sure if the systemd PR shows all of them. I just see capability_ambient_set_apply. That is implemented using:

  PR_CAPBSET_READ
  PR_CAP_AMBIENT/PR_CAP_AMBIENT_RAISE
  PR_CAP_AMBIENT/PR_CAP_AMBIENT_IS_SET
  PR_CAP_AMBIENT/PR_CAP_AMBIENT_IS_SET

and involves conditional execution. This is way beyond the limits we have in the current spawn file actions.

Comment 2 Lennart Poettering 2024-05-22 09:55:10 UTC

Yeah, it's mostly about PR_CAP_AMBIENT_RAISE for us (the other ones related calls we do we probably can do ahead of time). I guess glibc probably would have to be a bit more generic, probably needs to lower/clear the ambient flags set too...

And yes, I am fully aware that glibc doesn't really do caps stuff at all right now, it's all in libcap/libcap-ng. But I figure the correct place for this really has to be glibc if we live in a pidfd_spawn()/posix_spawn() world, given it has to be done between clone() and execve() if the parent shall not be affected by these things.

Comment 3 Florian Weimer 2024-05-22 17:23:14 UTC

We could add a way to do generic prctl calls in the new process. Would that help? Obviously, the call would be unconditionally.  If the staged prctl fails, the whole operation posix_spawn operation fails, without a good way to identify whether it was due to that or some other Linux extension (but that's not a new problem).

Comment 4 Lennart Poettering 2024-05-22 21:12:05 UTC

Hmm, yeah, being able to schedule prctl calls would probably suffice for our primary usecase, i.e. PR_CAP_AMBIENT_RAISE.

But I wonder how feature proof that is. The thing is that various prctl() calls have to be executed in some well-defined order. I.e. let's say you use this to seal the "secure bits" of a process (PR_SET_SECUREBITS), or set PR_SET_NO_NEW_PRIVS or drop caps from the bounding set: this things need to be scheduled in a precise order, so that you don't end up dropping the permissions necessary to execute the next prctl operation. Now, it's probably fine to outsource this ordering to the caller. However, this becomes a complete mess once in future pidfd_spawn/posix_spawn is extended to cover more stuff, for example setresuid() or so. Suddenly, the order in which prctls are executed also must be matched up with the time where setresuid() is invoked: if you use prctl to drop privs before changing uid, the uid will fail. 

Hence, yes, would certainly cover our specific usecase, and I guess would be simple to add to glibc, but I sense this is going to become a mess later on if you do this, since ordering these operations against what glibc itself wants to do is going to be problematic.

Comment 5 Florian Weimer 2024-05-22 21:24:08 UTC

We are side-stepping this issue for now, by adding things to the file actions list instead of spawn attributes. File actions are sequenced. If we add setresuid, it would be a file action in the current model, just like chdir. In both cases, subsequent open file actions depend on the previous non-file actions. There are certainly some inconsistencies (chdir/open are expected to copy the paths, but we can't copy arbitrary prctl arguments). But the real challenge will be data flow (do this first and then use the result in another action), conditional execution, and error handling.

I don't think it makes sense to run the file action list into some sort of programming language. We need to find a way to expose a fork-style or vfork-style clone more directly, with clear guidance to programmers how to use it safely, and without the need for managing stacks.

Comment 6 Adhemerval Zanella 2024-05-23 12:31:42 UTC

There are projects like rsyscall [1], which essentially exposes clone as a first-order interface and makes process creation a server-like interface where the caller issues the syscall commands to be executed in the child.  This allows for great flexibility (since it bypass libc for process creation), but it is somewhat complex and I am not sure of the implication of trying to adjust the idea for a libc-like interface.

Another approach is a fork-like interface (where you can add some flags to define what is shared or to gate the clone flags) along with a callback.  It has the issue of what kind of function the callback would be possible to run (and historically users over-abused it to overcome its limitations).

[1] http://catern.com/rsys21.pdf

Comment 7 Lennart Poettering 2024-05-23 13:52:09 UTC

(I figure iouring is exactly that: a buffer of interconnected, ordered syscalls to execute.)