Bad codegen in pthread_mutex causing 100% cpu spin loop related to inline asm ilockcmpexchg

Fri May 29 01:56:00 GMT 2009

Christopher Faylor wrote:
> 
> I got these functions from some well-known source like linux, glibc,
> uclibc or freebsd.  We can just go back to the same well and be assured
> that we are getting something that is tested.
> 
> cgf

  I did do a bit of looking around before I posted.  No two sources appear to
quite agree on how to represent cmpxchgl's constraints!

  Here's glibc:

http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=sysdeps/i386/i486/bits/atomic.h

  79 # define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval) \
  80   ({ __typeof (*mem) ret;
     \
  81      __asm __volatile (LOCK_PREFIX "cmpxchgl %2, %1"
     \
  82                        : "=a" (ret), "=m" (*mem)
     \
  83                        : "r" (newval), "m" (*mem), "0" (oldval));
     \
  84      ret; })
  85 #endif

  This is from Hans Boehm's "atomic_ops" library; note that because the
operation is composed with a setz here, it ducks out of the problem of how to
use %eax both coming and going.

http://www.hpl.hp.com/research/linux/qprof/download/atomic_ops-0.5.tar.gz

/* Returns nonzero if the comparison succeeded. */
AO_INLINE int
AO_compare_and_swap_full(volatile AO_t *addr,
		  	     AO_t old, AO_t new_val)
{
  char result;
  __asm__ __volatile__("lock; cmpxchgl %2, %0; setz %1"
	    	       : "+m"(*(addr)), "=q"(result)
		       : "r" (new_val), "a"(old) : "memory");
  return (int) result;
}

  Note however that they both agree that the output operand should be the
dereferenced pointer, not the pointer itself, and that they both use 'm'
constraints.  One passes the mem as an output r-w operand; the other passes it
as an output r-o operand and also as an input operand separtely.  This should
be the same thing, according to
http://gcc.gnu.org/ml/java/2002-01/msg00249.html, but according to the
follow-up at http://gcc.gnu.org/ml/java/2002-01/msg00251.html, that's not a
good idea.  Also, one clobbers memory, the other doesn't.

  This is from linux-2.6.8-1:

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
				      unsigned long new, int size)
{
	unsigned long prev;
	switch (size) {
              .  .  .  .  .
	case 4:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	}
	return old;
}

  It treats the mem as an input but not an output, and clobbers memory.

  Here's uClibc-0.9.28:

#define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval) \
  ({ __typeof (*mem) ret;						      \
     __asm __volatile (LOCK_PREFIX "cmpxchgl %2, %1"			      \
		       : "=a" (ret), "=m" (*mem)			      \
		       : "r" (newval), "m" (*mem), "0" (oldval));	      \
     ret; })

  That one seems to agree with glibc.  Here's freebsd head:

static __inline int
atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src)
{
	u_char res;

	__asm __volatile(
	"	" MPLOCKED "		"
	"	cmpxchgl %2,%1 ;	"
	"       sete	%0 ;		"
	"1:				"
	"# atomic_cmpset_int"
	: "=a" (res),			/* 0 */
	  "=m" (*dst)			/* 1 */
	: "r" (src),			/* 2 */
	  "a" (exp),			/* 3 */
	  "m" (*dst)			/* 4 */
	: "memory");

	return (res);
}

#define	ATOMIC_STORE_LOAD(TYPE, LOP, SOP)		\
static __inline u_##TYPE				\
atomic_load_acq_##TYPE(volatile u_##TYPE *p)		\
{							\
	u_##TYPE res;					\
							\
	__asm __volatile(MPLOCKED LOP			\
	: "=a" (res),			/* 0 */		\
	  "=m" (*p)			/* 1 */		\
	: "m" (*p)			/* 2 */		\
	: "memory");					\
							\
	return (res);					\
}							\
							\
/*							\
 * The XCHG instruction asserts LOCK automagically.	\
 */							\
static __inline void					\
atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\
{							\
	__asm __volatile(SOP				\
	: "=m" (*p),			/* 0 */		\
	  "+r" (v)			/* 1 */		\
	: "m" (*p));			/* 2 */		\
}							\

ATOMIC_STORE_LOAD(int,	"cmpxchgl %0,%1",  "xchgl %1,%0");
ATOMIC_STORE_LOAD(long,	"cmpxchgl %0,%1",  "xchgl %1,%0");

  Hmm, the changelog suggests a reason for using separate input and output
operands rather than a single r/w operand for the dereferenced pointer.

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/i386/include/atomic.h

" Stop using the '+' constraint modifier with inline assembly.  The '+'
constraint is actually only allowed for register operands.  Instead, use
separate input and output memory constraints. "

and http://marc.info/?l=linux-kernel&m=107475162200773&w=2 backs it up:

"if you have matching constraints (or read-write
constrants, which are exactly short-hand for matching constraints),
then you *must* have a register alternative."

and

"Finally, the whole thing is pointless.  Given the lvalue semantics, you
get *exactly* the same effect from

	"=m"(x) : "m"(x)

Since this works for any version of gcc, at any optimization level,
on any arbitrarily complex expression, we strongly recommend (ahem)
that code be modified to use this form."

  So I guess I'll see how the generated assembly works with two separate
"=m"/"m" out/in operands.  (It would be nice to convince gcc that *head has
changed without having to clobber memory if that's possible.)

    cheers,
      DaveK