This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: memcmp-sse4.S EqualHappy bug

From: Torvald Riegel <triegel at redhat dot com>
To: Simo Sorce <ssorce at redhat dot com>
Cc: Andrea Arcangeli <aarcange at redhat dot com>, "Carlos O'Donell" <carlos at redhat dot com>, "Dr. David Alan Gilbert" <dgilbert at redhat dot com>, Szabolcs Nagy <nsz at port70 dot net>, libc-alpha at sourceware dot org, "H.J. Lu" <hongjiu dot lu at intel dot com>
Date: Fri, 19 Jun 2015 17:32:25 +0200
Subject: Re: memcmp-sse4.S EqualHappy bug
Authentication-results: sourceware.org; auth=none
References: <1434621040 dot 5250 dot 212 dot camel at localhost dot localdomain> <20150618124900 dot GD14955 at redhat dot com> <1434637415 dot 5250 dot 271 dot camel at localhost dot localdomain> <20150618145202 dot GG14955 at redhat dot com> <1434642635 dot 5250 dot 292 dot camel at localhost dot localdomain> <20150618161943 dot GN14955 at redhat dot com> <20150618172231 dot GS14955 at redhat dot com> <1434649785 dot 30819 dot 37 dot camel at localhost dot localdomain> <20150618181219 dot GL2248 at work-vm> <55841B6B dot 10104 at redhat dot com> <20150619140710 dot GA14955 at redhat dot com> <1434724946 dot 2716 dot 88 dot camel at willson dot usersys dot redhat dot com>

On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > I agree with aborting, but only as long as the hot path's performance
> > > is not impacted and I haven't thought about how to do that.
> > > 
> > > Per "Bugs in the user program" convention:
> > > https://sourceware.org/glibc/wiki/Style_and_Conventions#Bugs_in_the_user_program
> > > 
> > > ---
> > > 9.2. Bugs in the user program
> > > 
> > > If it's user code invoking undefined behavior, then it should fail early and catastrophically so that developers don't get the false impression that their code is OK when it happens not to break the use cases they test adequately. (Said another way, so that we avoid giving developers an excuse to complain when a future implementation change "breaks" their programs that were always broken, but theretofore ignorably so.) That too trades off against any runtime cost of detecting the case. I'd say the allowance for cost of detection is marginally higher than in the case of library bugs, because we expect user bugs to be more common that library bugs. But it's still not much, since correct programs performing better is more important to us than buggy programs being easier to debug. 
> > > ---
> > 
> > Aborting should be a satisfactory measure too and I think this issue
> > fits the worthwhile tradeoff described by the above document, as the
> > cost should be negligible. It's just a matter of propagating rdx in
> > addition to rdi/rsi and it won't affect the sse4 unrolled loops fast
> > paths.
> > 
> > If the testcase reproduces the bug frequently the abort would give a
> > better chance to fix the bug so the app becomes fully C compliant.
> > 
> > Aborting memcmp should still leave memcmp usable for security purposes
> > if the app can safely detect such a failure and shutdown (failing
> > early and catastrophically sounds good to me but then not everyone may
> > agree, if it's canary value that is being checked it may already have
> > a proper notification failure path that may not get invoked). Overall
> > as long as the current false negative is not returned, it should be
> > enough as far as security is concerned.
> > 
> > If the race condition is very rarely triggered and it ends up being
> > triggering in production, an abort wouldn't be welcome, but it's still
> > preferable and safer than the current behavior.
> > 
> > The cost of detecting the UB and aborting is practically identical to
> > the cost of making the memcmp not return 0. The only difference is
> > that instead of jumping at the start of the function and continuing
> > evaluating the rest of the buffer, we'd abort.
> 
> My 2c,
> although it is true that technically the spec says memcmp can return
> arbitrary values if the tested areas change concurrently, it would be
> really hostile to abort() in such a case.
> The only real problem is returning an equal result when the underlying
> memory never was, nor could be equal.

This last sentence you wrote makes it look simple, but defining this
properly is more complicated.  The condition Andrea gave is simpler,
because it requires that there are no data races on some parts of the
checked memory, and those parts are guaranteed to be different.

In contrast, "never was nor could be equal" is not simple to define in
weak memory models; what I mean is that an intuitive understanding of
this can easily differ from what would be possible in general for
executions on a particular hardware.

> I do not think it matters what value is returned in that case, as long
> as it is not 0.
> If returning anything but 0 can be accomplished w/o sacrificing speed
> then that is something that should be done.
> 
> There are many programs that use memory checks to handle lock-less
> barriers.

What kind of barriers are you referring to?  Fences like an "mfence"
instruction on x86?  If so, what do you mean by "handle"?

> For those programs it doesn't matter at all what (or when)
> changed). All that matters is that the memory region were different
> during the test, and then they'll go on their business. If you were to
> require locking in order to do lockless checks, then the whole purpose
> would be defeated.

The discussion is about whether we can indeed rely on data race freedom,
or whether we want to handle data races gracefully in some situations.
That applies to both code using locks and code synchronizing
differently.

> Clearly people are better using atomic comparisons on canary values
> instead, but it seem easy to avoid false positives (returning 0 when
> memory is clearly different) and keep these things working, so why not
> do it ?

I see two separate issues here.  First, where do we draw the line, and
what do we guarantee.  I strongly believe that programs must not have
data races, and that they should use atomics or other synchronization
properly (this doesn't mean locks, but relaxed memory order atomics, for
example).
Second, do we try to keep buggy programs working.  If it has no cost to
do so (e.g., like it might be true in this case), then doing that can
help to trigger less errors.  But that doesn't mean the buggy programs
should get fixed eventually.

Follow-Ups:
- Re: memcmp-sse4.S EqualHappy bug
  - From: Dr. David Alan Gilbert
- Re: memcmp-sse4.S EqualHappy bug
  - From: Simo Sorce

References:
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel
- Re: memcmp-sse4.S EqualHappy bug
  - From: Andrea Arcangeli
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel
- Re: memcmp-sse4.S EqualHappy bug
  - From: Andrea Arcangeli
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel
- Re: memcmp-sse4.S EqualHappy bug
  - From: Andrea Arcangeli
- Re: memcmp-sse4.S EqualHappy bug
  - From: Andrea Arcangeli
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel
- Re: memcmp-sse4.S EqualHappy bug
  - From: Dr. David Alan Gilbert
- Re: memcmp-sse4.S EqualHappy bug
  - From: Carlos O'Donell
- Re: memcmp-sse4.S EqualHappy bug
  - From: Andrea Arcangeli
- Re: memcmp-sse4.S EqualHappy bug
  - From: Simo Sorce

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]