Bug 29115 - vfork()-based posix_spawn() has more failure modes than fork()-based one
Summary: vfork()-based posix_spawn() has more failure modes than fork()-based one
Status: RESOLVED WONTFIX
Alias: None
Product: glibc
Classification: Unclassified
Component: libc (show other bugs)
Version: 2.35
: P2 normal
Target Milestone: ---
Assignee: Adhemerval Zanella
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-02 12:08 UTC by Alexey Izbyshev
Modified: 2022-08-08 15:37 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Izbyshev 2022-05-02 12:08:51 UTC
Modern vfork()-based posix_spawn() can be used as an efficient alternative to fork()/exec() to avoid performance and overcommit issues. A common expectation is that whenever posix_spawn() feature set is sufficient for application needs of tweaking the child attributes, it can be used instead of fork()/exec().

However, it turns out that vfork() can have failure modes than fork() doesn't have. One such case is due to Linux not allowing processes in different time namespaces to share address space.

$ cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <spawn.h>
#include <unistd.h>

int main(int argc, char *argv[], char *envp[]) {
  if (getenv("TEST_FORK")) {
    pid_t pid = fork();
    if (pid < 0) {
        perror("fork");
        return 127;
    }
    if (pid == 0) {
        execve(argv[1], argv + 1, envp);
        perror("execve");
        return 127;
    }
  } else {
      int err = posix_spawn(0, argv[1], 0, 0, argv + 1, envp);
      if (err) {
        printf("posix_spawn: %s\n", strerror(err));
        return 127;
      }
  }
  printf("OK\n");
  return 0;
}

$ gcc test.c

$ unshare -UrT ./a.out /bin/true
posix_spawn: Operation not permitted

(The actual clone() error is EINVAL, but it's reported incorrectly due to bug 29109).

$ TEST_FORK=1 unshare -UrT ./a.out /bin/true
OK

I'm not aware of other failure modes, but more might appear in the future.

Does this qualify as a glibc bug? Should glibc's posix_spawn() implementation, for example, retry with fork() on vfork() failure (which would require a redesign of error reporting from the child process because it currently relies on address space sharing)?

Or do applications are expected to deal with that somehow? In this case, what is the recommended way to do that, given that it's not possible to reliably detect "retriable" posix_spawn() failures?
Comment 1 Adhemerval Zanella 2022-05-02 16:17:22 UTC
(In reply to Alexey Izbyshev from comment #0)
> Modern vfork()-based posix_spawn() can be used as an efficient alternative
> to fork()/exec() to avoid performance and overcommit issues. A common
> expectation is that whenever posix_spawn() feature set is sufficient for
> application needs of tweaking the child attributes, it can be used instead
> of fork()/exec().
> 
> However, it turns out that vfork() can have failure modes than fork()
> doesn't have. One such case is due to Linux not allowing processes in
> different time namespaces to share address space.
> 
> $ cat test.c
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <spawn.h>
> #include <unistd.h>
> 
> int main(int argc, char *argv[], char *envp[]) {
>   if (getenv("TEST_FORK")) {
>     pid_t pid = fork();
>     if (pid < 0) {
>         perror("fork");
>         return 127;
>     }
>     if (pid == 0) {
>         execve(argv[1], argv + 1, envp);
>         perror("execve");
>         return 127;
>     }
>   } else {
>       int err = posix_spawn(0, argv[1], 0, 0, argv + 1, envp);
>       if (err) {
>         printf("posix_spawn: %s\n", strerror(err));
>         return 127;
>       }
>   }
>   printf("OK\n");
>   return 0;
> }
> 
> $ gcc test.c
> 
> $ unshare -UrT ./a.out /bin/true
> posix_spawn: Operation not permitted
> 
> (The actual clone() error is EINVAL, but it's reported incorrectly due to
> bug 29109).
> 
> $ TEST_FORK=1 unshare -UrT ./a.out /bin/true
> OK
> 
> I'm not aware of other failure modes, but more might appear in the future.
> 
> Does this qualify as a glibc bug? Should glibc's posix_spawn()
> implementation, for example, retry with fork() on vfork() failure (which
> would require a redesign of error reporting from the child process because
> it currently relies on address space sharing)?
> 
> Or do applications are expected to deal with that somehow? In this case,
> what is the recommended way to do that, given that it's not possible to
> reliably detect "retriable" posix_spawn() failures?

It is really annoying that kernel does not allow clone (CLONE_VM | CLONE_VFORK)
with time namespace, however I am not the implications of allowing it (neither
if this is feasible on current kernel architecture).  

In any case, adding fork+exec fallback seems feasible, the only annoying case is
if glibc should detect a clone transient failure (for instance due some resource
exhaustion) from a namespace filtering. We can always retry in case of clone
failure, it should be really an exception and retrying will most likely succeed
in both cases.
Comment 2 Adhemerval Zanella 2022-05-02 16:26:49 UTC
Another issue is with fork+exec fallback it would require additional resources to communicate the possible error code from the helper process while running the prepare phase (as covered by tst-spawn3.c).
Comment 3 Alexey Izbyshev 2022-05-02 16:55:59 UTC
> It is really annoying that kernel does not allow clone (CLONE_VM |
> CLONE_VFORK)
> with time namespace, however I am not the implications of allowing it
> (neither
> if this is feasible on current kernel architecture).  
> 
I suspect this restriction is due to the conflict of the shared address space and the need to provide different VDSOs (for clock_gettime()) for processes in separate time namespaces, but I haven't looked closely.

> In any case, adding fork+exec fallback seems feasible, the only annoying
> case is
> if glibc should detect a clone transient failure (for instance due some
> resource
> exhaustion) from a namespace filtering. We can always retry in case of clone
> failure, it should be really an exception and retrying will most likely
> succeed
> in both cases.

I think it would be great if glibc provided such a fallback. I agree that retrying once with fork() in case of *any* clone(CLONE_VM | CLONE_VFORK) failure shouldn't hurt, but it should probably also be OK to skip retry on ENOMEM and (paradoxically) EAGAIN because the caller has to deal with them in any case.

> Another issue is with fork+exec fallback it would require additional
> resources to communicate the possible error code from the helper process
> while running the prepare phase (as covered by tst-spawn3.c).

Yes, I'm aware that glibc currently relies on address space sharing to pass the error code, so adding an alternative error reporting would constitute most of the fix.

One benefit of the alternative error reporting is that it would also work correctly in environments where vfork() system call acts as fork() (i.e. doesn't provide address space sharing), such as qemu-user. So if it's in place, glibc could add some knob to always enable it for users that need good posix_spawn() error reporting in such environments.
Comment 4 Adhemerval Zanella 2022-05-02 17:17:00 UTC
(In reply to Alexey Izbyshev from comment #3)
> > It is really annoying that kernel does not allow clone (CLONE_VM |
> > CLONE_VFORK)
> > with time namespace, however I am not the implications of allowing it
> > (neither
> > if this is feasible on current kernel architecture).  
> > 
> I suspect this restriction is due to the conflict of the shared address
> space and the need to provide different VDSOs (for clock_gettime()) for
> processes in separate time namespaces, but I haven't looked closely.
> 
> > In any case, adding fork+exec fallback seems feasible, the only annoying
> > case is
> > if glibc should detect a clone transient failure (for instance due some
> > resource
> > exhaustion) from a namespace filtering. We can always retry in case of clone
> > failure, it should be really an exception and retrying will most likely
> > succeed
> > in both cases.
> 
> I think it would be great if glibc provided such a fallback. I agree that
> retrying once with fork() in case of *any* clone(CLONE_VM | CLONE_VFORK)
> failure shouldn't hurt, but it should probably also be OK to skip retry on
> ENOMEM and (paradoxically) EAGAIN because the caller has to deal with them
> in any case.

It makes sense indeed.

> 
> > Another issue is with fork+exec fallback it would require additional
> > resources to communicate the possible error code from the helper process
> > while running the prepare phase (as covered by tst-spawn3.c).
> 
> Yes, I'm aware that glibc currently relies on address space sharing to pass
> the error code, so adding an alternative error reporting would constitute
> most of the fix.
> 
> One benefit of the alternative error reporting is that it would also work
> correctly in environments where vfork() system call acts as fork() (i.e.
> doesn't provide address space sharing), such as qemu-user. So if it's in
> place, glibc could add some knob to always enable it for users that need
> good posix_spawn() error reporting in such environments.

We discussed this before and we moved to use shared memory to report errors exactly to avoid requiring another resource allocation (which adds even more failure paths).  I am not very found in restoring it.
Comment 5 Carlos O'Donell 2022-05-02 20:38:04 UTC
Either the kernel supports vfork or it doesn't.

A time namespace, or a seccomp filter are all the same problems, and we should return the error the userspace.

Adding code which will only be exercised in the event that a time namespace is in use is going to result in increased long-term maintenance costs.

It also results in unexpected surprise behaviour when the developer runs under a time namespace e.g. more memory usage, different code paths taken etc.

Rather than add long-term maintenance and surprise developers my suggestion is to fail the posix_spawn.

Users can take this up with the kernel to add support for vfork in time namespaces, or with their sandboxing technology provider.

There might be exceptional cases where we need to add fallbacks, but I can't see that this is one of those cases. For example clone3 to clone2 fallback is sensible.
Comment 6 Florian Weimer 2022-05-02 20:43:08 UTC
CLONE_NEWTIME is as specified today fundamentally incompatible with real vfork and the vDSO. It just does not work. Entering the new namespace requires a new vDSO data mapping, and that conflicts with vfork using the same address space.

CLONE_NEWTIME should have been specified to apply only after execve (which remaps the vDSO anyway), but it it's what we've got.
Comment 7 Alexey Izbyshev 2022-05-02 20:56:47 UTC
(In reply to Carlos O'Donell from comment #5)
> Either the kernel supports vfork or it doesn't.
> 
> A time namespace, or a seccomp filter are all the same problems, and we
> should return the error the userspace.
> 
> Adding code which will only be exercised in the event that a time namespace
> is in use is going to result in increased long-term maintenance costs.
> 
> It also results in unexpected surprise behaviour when the developer runs
> under a time namespace e.g. more memory usage, different code paths taken
> etc.
> 
> Rather than add long-term maintenance and surprise developers my suggestion
> is to fail the posix_spawn.
> 
posix_spawn() failing and fork()/exec() not failing is also a surprise for developers. Note that if users are expected to deal with this posix_spawn() failure, all language frameworks/libraries providing high level process creation APIs will have to implement knobs to opt-out from posix_spawn(). It's not clear to me that it's better than a potential performance problem due to fork() when time namespaces are used.

We also don't know what other vfork() failure modes that fork() doesn't have may appear in the future. A fallback would cover them.
Comment 8 Carlos O'Donell 2022-05-02 21:02:09 UTC
(In reply to Florian Weimer from comment #6)
> CLONE_NEWTIME is as specified today fundamentally incompatible with real
> vfork and the vDSO. It just does not work. Entering the new namespace
> requires a new vDSO data mapping, and that conflicts with vfork using the
> same address space.

The kernel already has per-cpu data in the vDSO.

The vDSO doesn't follow any concept of a single address space for the process.

The vDSO is not a part of POSIX and so doesn't have to follow any vfork semantic requirements.

What prevents the kernel from making a new vDSO data mapping?
 
> CLONE_NEWTIME should have been specified to apply only after execve (which
> remaps the vDSO anyway), but it it's what we've got.

It can remain that way, we just have to remap the vDSO at vfork time?
Comment 9 Florian Weimer 2022-05-02 21:06:11 UTC
(In reply to Carlos O'Donell from comment #8)
> (In reply to Florian Weimer from comment #6)
> > CLONE_NEWTIME is as specified today fundamentally incompatible with real
> > vfork and the vDSO. It just does not work. Entering the new namespace
> > requires a new vDSO data mapping, and that conflicts with vfork using the
> > same address space.
> 
> The kernel already has per-cpu data in the vDSO.

Uh, since when? I thought that Linux didn't do per-CPU page tables.

> The vDSO doesn't follow any concept of a single address space for the
> process.
> 
> The vDSO is not a part of POSIX and so doesn't have to follow any vfork
> semantic requirements.
> 
> What prevents the kernel from making a new vDSO data mapping?

It requires creating a new VM for the vfork process, while preserving existing shared VM semantics in other regards. That seems difficult?
Comment 10 Carlos O'Donell 2022-05-02 21:15:24 UTC
(In reply to Alexey Izbyshev from comment #7)
> (In reply to Carlos O'Donell from comment #5)
> > Either the kernel supports vfork or it doesn't.
> > 
> > A time namespace, or a seccomp filter are all the same problems, and we
> > should return the error the userspace.
> > 
> > Adding code which will only be exercised in the event that a time namespace
> > is in use is going to result in increased long-term maintenance costs.
> > 
> > It also results in unexpected surprise behaviour when the developer runs
> > under a time namespace e.g. more memory usage, different code paths taken
> > etc.
> > 
> > Rather than add long-term maintenance and surprise developers my suggestion
> > is to fail the posix_spawn.
> > 
> posix_spawn() failing and fork()/exec() not failing is also a surprise for
> developers. Note that if users are expected to deal with this posix_spawn()
> failure, all language frameworks/libraries providing high level process
> creation APIs will have to implement knobs to opt-out from posix_spawn().
> It's not clear to me that it's better than a potential performance problem
> due to fork() when time namespaces are used.
> 
> We also don't know what other vfork() failure modes that fork() doesn't have
> may appear in the future. A fallback would cover them.

That is a slipper slope fallacy. Those other failure modes haven't materialized and so they do not matter to the conversation at hand. When we have other failure modes, and fork() can fail badly also as it consumes more memmory, maybe triggering OOM, we have other problems.

Performance and expected semantics are an important part of an interface. Library and applications authors would not only have to change posix_spawn() as a choice but also system() which may use vfork(), and maybe even clone (if used with the right flags).

All of this makes me suspect that blocking vfork is the wrong semantic. It needs to be enabled in the kernel otherwise the CRIU use case is *not met*.

We can't add CLONE_NEWTIME and yet require all of userspace to move away from vfork/clone which is the fastest and least-memory intensive way to clone a process.

This change adds significant code to the implementation. Please involve the CRIU developers and see if this can't be solved in the kernel first. I haven't seen any justification that there are blockers to this in the kernel.
Comment 11 Carlos O'Donell 2022-05-02 21:24:00 UTC
(In reply to Florian Weimer from comment #9)
> (In reply to Carlos O'Donell from comment #8)
> > (In reply to Florian Weimer from comment #6)
> > > CLONE_NEWTIME is as specified today fundamentally incompatible with real
> > > vfork and the vDSO. It just does not work. Entering the new namespace
> > > requires a new vDSO data mapping, and that conflicts with vfork using the
> > > same address space.
> > 
> > The kernel already has per-cpu data in the vDSO.
> 
> Uh, since when? I thought that Linux didn't do per-CPU page tables.

So, this is a stretch, but on x86 you use GDT to get the per-CPU data.

Is this not what we could call per-cpu data in a distinct address space?

> > The vDSO doesn't follow any concept of a single address space for the
> > process.
> > 
> > The vDSO is not a part of POSIX and so doesn't have to follow any vfork
> > semantic requirements.
> > 
> > What prevents the kernel from making a new vDSO data mapping?
> 
> It requires creating a new VM for the vfork process, while preserving
> existing shared VM semantics in other regards. That seems difficult?

I don't know until a kernel developer tells me this is difficult :-)
Comment 12 Adhemerval Zanella 2022-05-02 21:51:01 UTC
(In reply to Carlos O'Donell from comment #10)
> All of this makes me suspect that blocking vfork is the wrong semantic. It
> needs to be enabled in the kernel otherwise the CRIU use case is *not met*.
> 

We need the CLONE_VFORK semantic as a QoI. Otherwise, it would require synchronizing with a pipe or similar facility and thus require additional resources (with might fail under some constraint environments).

> We can't add CLONE_NEWTIME and yet require all of userspace to move away
> from vfork/clone which is the fastest and least-memory intensive way to
> clone a process.
> 
> This change adds significant code to the implementation. Please involve the
> CRIU developers and see if this can't be solved in the kernel first. I
> haven't seen any justification that there are blockers to this in the kernel.

I tend to agree it adds maintainability, but I think since we have some kernel with timestamp support not having a fallback or if kernel developers decide to not fix it, it will make posix_spawn an unappealing API.
Comment 13 Adhemerval Zanella 2022-08-08 14:08:33 UTC
It has now been fixed on Linux kernel (133e2d3e81de5d9706cab2dd1d52d231c27382e5), and based on previous discussions where only fixing glibc does not help direct user cases of clone (CLONE_VM | CLONE_FORK), I will drop my fix [1].  We will need to ask for kernel backport of kernel fix to support this usage.

[1] https://patchwork.sourceware.org/project/glibc/list/?series=9797
Comment 14 Alexey Izbyshev 2022-08-08 15:37:58 UTC
Great to see it fixed in the kernel, thanks for everybody involved!