This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: memcmp-sse4.S EqualHappy bug

From: "Dr. David Alan Gilbert" <dgilbert at redhat dot com>
To: Torvald Riegel <triegel at redhat dot com>
Cc: Simo Sorce <simo at redhat dot com>, Andrea Arcangeli <aarcange at redhat dot com>, "Carlos O'Donell" <carlos at redhat dot com>, Szabolcs Nagy <nsz at port70 dot net>, libc-alpha at sourceware dot org, "H.J. Lu" <hongjiu dot lu at intel dot com>
Date: Tue, 23 Jun 2015 10:02:25 +0100
Subject: Re: memcmp-sse4.S EqualHappy bug
Authentication-results: sourceware.org; auth=none
References: <1434727945 dot 3061 dot 51 dot camel at localhost dot localdomain> <20150619155907 dot GF2147 at work-vm> <1434733019 dot 3061 dot 68 dot camel at localhost dot localdomain> <1434734186 dot 2716 dot 100 dot camel at willson dot usersys dot redhat dot com> <1434734325 dot 2716 dot 102 dot camel at willson dot usersys dot redhat dot com> <1434735651 dot 3061 dot 87 dot camel at localhost dot localdomain> <1434736520 dot 2716 dot 106 dot camel at willson dot usersys dot redhat dot com> <1434804224 dot 3061 dot 109 dot camel at localhost dot localdomain> <20150622174356 dot GE2174 at work-vm> <1435048584 dot 25759 dot 48 dot camel at localhost dot localdomain>

* Torvald Riegel (triegel@redhat.com) wrote:
> On Mon, 2015-06-22 at 18:43 +0100, Dr. David Alan Gilbert wrote:
> > * Torvald Riegel (triegel@redhat.com) wrote:
> > > On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote:
> > > > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote:
> > > > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> > > > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > > > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > > > > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > > > > > > do it ?
> > > > > > > > > > 
> > > > > > > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > > > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > > > > > > data races, and that they should use atomics or other synchronization
> > > > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > > > > > > example).
> > > > > > > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > > > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > > > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > > > > > > should get fixed eventually.
> > > > > > > > > 
> > > > > > > > > I find it difficult to understand the boundaries of what the C library
> > > > > > > > > is allowed to do in this type of optimisation.
> > > > > > > > > 
> > > > > > > > > For example, consider the following:
> > > > > > > > > 
> > > > > > > > >     char a[128];
> > > > > > > > >     char b[128];
> > > > > > > > > 
> > > > > > > > >     put some static data in a[0-63]
> > > > > > > > >     put some static data in b[0-63]
> > > > > > > > >     a[64]=0;
> > > > > > > > >     b[64]=0;
> > > > > > > > >     start a thread doing stuff in a[65..]
> > > > > > > > >     start a thread doing stuff in b[65..]
> > > > > > > > >     
> > > > > > > > >     if (!strcmp(a,b)) {
> > > > > > > > >       /* Do something */
> > > > > > > > >     }
> > > > > > > > > 
> > > > > > > > >    a) Is that behaviour defined?
> > > > > > > > 
> > > > > > > > Good question.  I think it should be.  This depends on both the data
> > > > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > > > > > > the abstract machine.
> > > > > > > > 
> > > > > > > > The data race definition uses memory locations as granularity, which is
> > > > > > > > in 3.14 in C11.  Separate characters in an array should be separate
> > > > > > > > memory locations.
> > > > > > > > 
> > > > > > > > C11 isn't very specific regarding strcmp, and just says that it
> > > > > > > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > > > > > > of the strings are determined, and then a strncmp is run using the
> > > > > > > > smaller of the two lengths.
> > > > > > > > 
> > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be
> > > > > > > > accessed by the abstract machine.  So no data race with the stuff going
> > > > > > > > on in [65..).
> > > > > > > > 
> > > > > > > > >    b) Is it defined with strncmp(a,b,64) ?
> > > > > > > > 
> > > > > > > > Yes.
> > > > > > > > 
> > > > > > > > >    c) Is it defined with strncmp(a,b,128)?
> > > > > > > > 
> > > > > > > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > > > > > > characters that follow the a null character aren't compared either.
> > > > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > > > > > > particular case.
> > > > > > > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > > > > > > (conceptually) and then compares, but the substring copying has to
> > > > > > > > determine length of the string and then subtracting the max length.
> > > > > > > > This would do a strlen first, which wouldn't access past index 64.
> > > > > > > > Thus, should be fine too.
> > > > > > > > 
> > > > > > > > >    d) Is it defined with memcmp(a,b,64)?
> > > > > > > > 
> > > > > > > > No data race, IMO.
> > > > > > > > 
> > > > > > > > >    e) Is it defined with memcmp(a,b,128)?
> > > > > > > > 
> > > > > > > > Data race.  Undefined behavior.
> > > > > > > > 
> > > > > > > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > > > > > > >       it matter?
> > > > > > > > 
> > > > > > > > No difference as far as the standard is concerned.
> > > > > > > > 
> > > > > > > > > I can imagine there may be lots of things that terminate
> > > > > > > > > a string and let other stuff happen in the allocated space
> > > > > > > > > after the end of the string in the belief that at the end
> > > > > > > > > of that string all is unknown.   Andrea's case is a bit different
> > > > > > > > > in that it's the later data that's static, but that doesn't
> > > > > > > > > sound like it should change the answer as to what's allowed.
> > > > > > > > 
> > > > > > > > I think it does, because the question is whether there is a data race on
> > > > > > > > the memory locations that the abstract machine would access.
> > > > > > > 
> > > > > > > Well given we are making examples.
> > > > > > > 
> > > > > > > Assume 2 structures like this:
> > > > > > > 
> > > > > > > struct test {
> > > > > > >     void *pointer;
> > > > > > >     char start[16];
> > > > > > >     char end[240];
> > > > > > > }
> > > > > > > 
> > > > > > > and
> > > > > > > 
> > > > > > > struct test a;
> > > > > > > struct test b;
> > > > > > > 
> > > > > > > memset(a.end, 1, 240);
> > > > > > > memset(b.end, 2, 240);
> > > > > > > 
> > > > > > > In what case it would be expected/legal for 
> > > > > > > memcmp(a.start, b.start, 256); to ever return 0 ?
> > > > > > 
> > > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start
> > > > > > I see absolutely no problem in stopping short and returning a
> > > > > > positive/negative value. I just do not see how returning 0 would ever
> > > > > > make sense.
> > > > > 
> > > > > It would be legal if the invocation of memcmp and at least one of the
> > > > > invocations of memset would be concurrent (ie, not ordered by
> > > > > happens-before).
> > > > 
> > > > Nope, sorry , there is a misunderstanding, consider the memsets part of
> > > > initialization, *never* run in parallel with memcmp.
> > > 
> > > Ah, I see.  I'm not really sure what the rules are in this case; my
> > > guess would be that this is not UB, because the access is not beyond the
> > > object that is struct test.
> > > 
> > > Perhaps this survey is interesting for you:  https://goo.gl/2TY0H5
> > > 
> > > > > I do see that this wouldn't be an intuitive outcome if one reasons based
> > > > > on assuming some simplified (or straightforward, or natural, or whatever
> > > > > the personal opinion might be...) implementation.  But a data race is
> > > > > undefined behavior, and that's where we're drawing the line.
> > > > 
> > > > Sure, I can understand that is the *whole* are under consideration is
> > > > racing. But here we are talking about a large chunk of the area that is
> > > > different and *never* raced upon.
> > > 
> > > Well, undefined behavior is not constrained to some "neighborhood" or
> > > such around the cause of the undefined behavior.
> > > It makes sense that the standard doesn't constrain it, because then it
> > > would have to guarantee the constraints and those would require
> > > reasoning about a particular implementation, which a language standard
> > > is meant to prevent :)
> > > Consider a hypothetical GPU that wants to run C code.  It accepts code
> > > in a virtual ISA that requires using special atomic operations for all
> > > memory accesses that would be data races otherwise.  The implementation
> > > of this GPU uses a fast JIT compiler to transform to the native ISA of
> > > the GPU.  Now this JIT ocmpiler detects in a program that a whole page
> > > is compared using memcmp (or a custom memcmp using non-atomics).  It
> > > decides to move this page into local memory, assuming because of the
> > > data-race-freedom requirement that no other thread will write to the
> > > page.  After the memcmp, it will move it back.  But your program isn't
> > > data-race-free, and another thread writes to one byte in the page.  This
> > > byte is then lost when the page is moved back from local memory.
> > > 
> > > This is a made-up example (although such a virtual ISA is not
> > > unrealistic) and maybe this implementation would try to fail gracefully
> > > instead of just losing the update.  But I think it helps show that it
> > > isn't quite clear in general which constraints would be realistic for
> > > undefined behavior.
> > 
> > Isn't that's illegal since the parameters to memcmp are 'const void *',
> > so it shouldn't be written?
> 
> That's what your program guarantees, but it's not automatically a
> constraint on what the implementation does.  If the compiler can detect
> that the memory is completely managed and provided by the implementation
> (e.g., never accessed through volatile accesses, never escaping to
> functions the compiler isn't sure won't access it through
> volatile, ...), then the additional writes aren't visible in terms of
> observable behavior of the abstract machine.

However, this is a library interface; the implentation of memcmp in the
library can't get that much information from the compiler, so the standard
memcmp implementation can't do that write because, given that it's const,
that the address space it's comparing may be read-only.

> > Going back to an earlier part of the thread; I came up with a reason
> > why I really wouldn't want these routines to 'assert()'.
> 
> So, put generally, you don't want to fail fast because there is code
> that is not data-race-free but seems to work right now?
> I think that this would be a convincing reason.

:-)  It's not a 'seems to work' - the design of both of these algorithms
is specifically allowing data races to happen on the intent of containing
the effect of the races.  The racy reads are expected to produce indeterminate
data but the algorithm knows that if that happens that data is always read
again in the future at a time when a race can't happen.  This is significantly
faster than trying to stop all the other threads whenever the read is
taking place.

> > In emulation/virtualisation we generally haven't got a clue what our
> > guests are doing, and we don't want to stop them most of the time.
> > So there are a couple of failry common types of structure that
> > break these rules:
> > 
> > 1) (e.g. framebuffer emulation)
> >   (Many untrusted threads writing to memory)
> >   One control thread
> >      do {
> >        read memory -> display it
> >      } while (!end)
> >      barrier;
> >      display it once more
> > 
> >   That 'display it' can be quite reasonably expected to get
> > inconsistent data, but at the higher level it will be safe because
> > we know we're always going to do one more 'display it' at the end
> > when things settle down.
> 
> Using atomic (memory_order_relaxed) accesses for reads of "memory" (and
> "end", I suppose) would clean this up.

  1a) Note that this data isn't necessarily word aligned etc - so I'm not
      sure why using atomic accesses would help here.  Also, my problem here
      is that I have no efficient C library call to use that is legal
      because by your definition of 'undefined' any library call might
      assert or do anything else - even if I don't care about the data
      they produce.
>  The barrier is am
> memory_order_acquire fence, I suppose?
  1b) Actually I've not found the barrier yet; I was just boiling these
      examples down from code I know we've got - I'm more familiar with
      the 2nd case.

> > similarly.
> > 2) (VM migration)
> > 
> >    Many untrusted threads
> >       do {
> >         n=some-value;
> >         change some data[n];
> >         write_barrier();
> >         atomically_set(dirty[n]);
> 
> Same as above, using relaxed atomic accesses for data and a
> memory_order_release store for setting dirty (or use relaxed MO and keep
> the barrier / release fence) would clean this up.

   Note the thing changing data[n] is totally out of the control of our
code; it's just normal applications, user land code; we've got no control
over it at all - it's using normal writes, it could be malicious.
 The marking of dirty[n] is something that the kernel does for us, and we
do get control over.

> >       }
> > 
> >     One control thread
> >       do {
> >         for (all i) {
> >           read_barrier();
> 
> Hmm, why is the barrier here, not between the loads from dirty and data?

Yes, that probably would be best.

> >           if (atomic_read(dirty[i])) {
> >               send(data[n]);
> >           }
> >         }
> >       } while (!end)
> >       barrier();
> >       for (all i) {
> >         if (atomic_read(dirty[i])) {
> >             send(data[n]);
> >         }
> >       }
> > 
> >    and often that 'send' process involves looking for compressable data;
> > I had a look and I don't think we're currently using libc routines
> > in there on the data for the comparison in (2), but I'm not 100% sure
> > about (1).
> > 
> > In these cases we expect the data to be inconsistent for some period of
> > time, but guarantee the consistency since we know we're going to read it
> > again at some better time.  We have no sensible way to stop all the
> > guest threads writing to memory.
> 
> The uncontrolled writes of the guest threads are not really covered by
> the memory model, but if we assume that they are effectively a
> collection of memory_order_relaxed atomic stores of some granularity,
> then you can "avoid" the data race if using relaxed atomic loads at the
> consumer side too.  And then it's just using compatible fences and
> atomic accesses for "end" and "dirty" to make this work cleanly.

Again my problem, and why I included in this thread, is use of C library
routines;  If I want to copy that data[n] as part of the send, I'd expect
to able to use a standard memcpy() call,  if I want to see if it's the
same as the previous version I sent I'd want memcmp() - hence the reason
I mention it.  (Actually the way our compression data[n] works the comparison
in this case doesn't quite fit to a memcmp or memchr)

In short I want the benefit of the highly optimised library routines,
knowing that the actual results they return for data that's changing under
them might be wrong, but with the knowledge they're not going to blow up.
Hence why I changed my mind and said assert() is unreasonable.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Follow-Ups:
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel

References:
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel
- Re: memcmp-sse4.S EqualHappy bug
  - From: Dr. David Alan Gilbert
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel
- Re: memcmp-sse4.S EqualHappy bug
  - From: Simo Sorce
- Re: memcmp-sse4.S EqualHappy bug
  - From: Simo Sorce
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel
- Re: memcmp-sse4.S EqualHappy bug
  - From: Simo Sorce
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel
- Re: memcmp-sse4.S EqualHappy bug
  - From: Dr. David Alan Gilbert
- Re: memcmp-sse4.S EqualHappy bug
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]