This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 1/3] Use reliable sem_wait interruption in nptl/tst-sem6.
- From: Torvald Riegel <triegel at redhat dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: GLIBC Devel <libc-alpha at sourceware dot org>
- Date: Tue, 09 Dec 2014 19:24:49 +0100
- Subject: Re: [PATCH 1/3] Use reliable sem_wait interruption in nptl/tst-sem6.
- Authentication-results: sourceware.org; auth=none
- References: <1417804668 dot 22797 dot 108 dot camel at triegel dot csb> <1417805577 dot 25868 dot 4 dot camel at triegel dot csb> <20141206135040 dot GA16212 at domone> <1418038997 dot 25868 dot 34 dot camel at triegel dot csb> <20141208222857 dot GB13499 at domone> <1418120160 dot 25868 dot 132 dot camel at triegel dot csb> <20141209165033 dot GA20499 at domone>
On Tue, 2014-12-09 at 17:50 +0100, OndÅej BÃlka wrote:
> On Tue, Dec 09, 2014 at 11:16:00AM +0100, Torvald Riegel wrote:
> > On Mon, 2014-12-08 at 23:28 +0100, OndÅej BÃlka wrote:
> > > On Mon, Dec 08, 2014 at 12:43:17PM +0100, Torvald Riegel wrote:
> > > > On Sat, 2014-12-06 at 14:50 +0100, OndÅej BÃlka wrote:
> > > > > On Fri, Dec 05, 2014 at 07:52:57PM +0100, Torvald Riegel wrote:
> > > > > > alarm (1);
> > > > > >
> > > > > > int res = sem_wait (&s);
> > > > > > - if (res == 0)
> > > > > > - {
> > > > > > - puts ("wait succeeded");
> > > > > > - return 1;
> > > > > > - }
> > > > > > - if (res != -1 || errno != EINTR)
> > > > > > + /* We accept all allowed behavior: Implementations that return EINTR and
> > > > > > + those that rely on correct interruption through signals to use sem_post
> > > > > > + in the signal handler. */
> > > > > > + if (res != 0 && !(res == -1 && errno == EINTR))
> > > > > > {
> > > > > > - puts ("wait didn't fail with EINTR");
> > > > > > + puts ("wait neiter succeeded nor failed with EINTR");
> > > > > > return 1;
> > > > > > }
> > > > > >
> > > > >
> > > > > That does change test logic as it originally failed when wait succeeded.
> > > >
> > > > Yes, but why do you think that this is inconsistent? The previous test
> > > > didn't add a token in the signal handler, so if wait succeeded, then the
> > > > test should fail.
> > > >
> > > > However, the correct way to interrupt the semaphore with a signal is to
> > > > add a token. My patch does that. Second, if we do not want to make
> > > > timing assumptions (which the existing test would do if we add a token
> > > > to the semaphore in the signal handler), then we need to accept that the
> > > > (first) signal handler execution might happen before sem_wait actually
> > > > executes. Therefore, we must not fail in this case.
> > > >
> > > > We have to correctly interrupt with signals because as the futex
> > > > documentation stands (allowing return of EINTR on spurious wake-ups),
> > > > there's no way to implement the specific behavior of the existing
> > > > implementation (which assumes EINTR is returned *only* on signals).
> > > >
> > > > IOW, the existing test does white-box testing with timing assumptions;
> > > > with this patch, we do make a slightly different black-box test with no
> > > > timing assumptions.
> > >
> > > Which misses point of test which is to find bugs.
> >
> > Please read the POSIX spec. It allows both outcomes, and without timing
> > assumptions etc., we can't drive executions towards just one of the
> > outcomes.
> >
> Which does not answer my objection. What extra bugs could this test catch,
> compared to say tst-sem2? If there is no such bug you could just delete
> that file.
tst-sem2 tests that spurious wake-ups and such don't return anything but
-1 and errno==EINTR, in particular that 0 isn't returned.
After the patch, tst-sem6 tests that a signal handler that posts a token
will make sem_wait return. It *also* allows for sem_wait to return -1
and errno==EINTR in that case.
Thus, one possible error that the patched tst-sem6 will catch is if the
sem_wait itself just retries the futex_wait after the futex_wait
returned EINTR, instead of looking for whether there is an available
token.
>
> > > What if in new fooarch
> > > assembly one forget to return -1 on interrupt which breaks user application
> > > which will assume that semaphore is locked,
> >
> > I can strengthen the test on res==0, checking whether there is no token
> > left. I don't think it buys us much though.
> >
> > Another option would be to disallow failure; however, this only works if
> > sem_assume_only_signals_cause_futex_EINTR remains 0 and not set to a
> > different value by a distribution (see Patch 3/3).
> >
> > > one can deal with rare false
> > > positives in tests.
> >
> > I think you'd be creating false negatives with what you have in mind.
> > The false positive would be not failing when 0 is returned, incorrectly,
> > ie your example.
> >
> Depends whats your default. Usually false positive means that test shows
> there is disease/bug but not in reality.
I disagree regarding "usually", and would say that a negative result for
a test is if the test fails, so false negative being a reportedly failed
test that was not caused by an actual fault.
Nonetheless, we seem to be on the same page regarding what we want to
avoid.
> > False negatives are a pain. You can deal with them of course, but it
> > stands in the way of doing continuous integration and such. what we de
> > facto do is just ignore all tests that we know can have false negatives,
> > which doesn't make the test useful at all.
> >
> > > One does not need justify this fact by forward progress/fairness
> > > assumptions. Just assumption that OS which keeps all CPU idle for one
> > > second while there is suitable task is simply broken.
> >
> > Well, that *is* a timing assumption. There is nothing broken about this
> > in general. Remember that we do have tests failing now and then just
> > because of that. Which is awful for testing.
> >
> Without that assumption there is no guarantee that it will take more
> than week to run test suite. That would make it useless.
There are timing assumptions that state something like "after a second,
the thread will have been scheduled". Those are bad, because they state
a concrete bound (1s in this example).
The forward progress guarantees that make sense for us, in particular in
non-real-time systems, are instead of the form "something good will
*eventually* happen". In other words, there's no concrete bound
promised, but that there is a (finite) bound.
That's an important difference.
> > Also, please review the actual background for this change. See patch
> > 3/3 for that. The change to the test is not ideal, but please see the
> > trade-offs.
>
> The rarity of problem bugged me, as to trigger that behaviour one would
> need run also highly parallel program so it looked unlikely that anybody
> would report that. As it couldn't detect some bugs it previously could
> its hard to see what is better.
Let me try to summarize the background behind this change again:
1) Linux documents futex_wait to return EINTR on signals *or* on
spurious wake-ups.
2) If we treat 1) as true -- which we should to unless getting
confirmation otherwise -- sem_wait must not return EINTR to the caller
anymore if futex_wait returned EINTR.
3) Because of 2), the behavior that is tested in tst-sem6 before my
patch cannot be implemented anymore.
Thus, we need to do *something*. I proposed this patch, and variations.
If you don't see other alternatives, then I guess we'll have to pick
from the options I gave.
If you disagree with 2), then please comment on Patch 3/3, because
that's where this belongs.