This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[rx-sim]: add cycle accuracy


This is a rather large but single-directory patch which makes the RX
simulator cycle-accurate.  Well, mostly cycle accurate anyway - it's
within a small fraction of a percent compared to real hardware, on
large benchmarks.  There's some speedups and documentation included
too.  OK to commit?

I also built it with -Wall -Werror

There doesn't seem to be an rx-specific sim maintainer.  Since I wrote
it, should I be the maintainer?


	* README.txt: New.
	* config.h (CYCLE_ACCURATE, CYCLE_STATS): New.
	* configure.in (--enable-cycle-accurate, --enable-cycle-stats):
	New.  Default to enabled.
	* configure: Regenerate.

	* cpu.h (regs_type): Add cycle tracking info.
	(reset_pipeline_stats): Declare.
	(halt_pipeline_stats): Declare.
	(pipeline_stats): Declare.
	* main.c (done): Call pipeline_stats().
	* mem.h (rx_mem_ptr): Moved to here ...
	* mem.c (mem_ptr): ... from here.  Rename throughout.
	(mem_put_byte): Move LEDs to Port A.  Add Port B to control cycle
	statistics.  Move UART to SCI4.
	(mem_put_hi): Add TPU 1-2.  TPU 1 and 2 count CPU cycles.
	* reg.c (init_regs): Set Rt reg to -1 (no reg).
	* rx.c: Add cycle counting and statistics throughout.
	(rx_get_byte): Optimize for speed.
	(decode_opcode): Likewise.
	(reset_pipeline_stats): New.
	(halt_pipeline_stats): New.
	(pipeline_stats): New.
	* trace.c (sim_disasm_one): Print cycle count.

Index: README.txt
===================================================================
RCS file: README.txt
diff -N README.txt
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ README.txt	28 Jul 2010 02:00:19 -0000
@@ -0,0 +1,121 @@
+The RX simulator offers two rx-specific configure options:
+
+--enable-cycle-accurate  (default)
+--disable-cycle-accurate
+
+If enabled, the simulator will keep track of how many cycles each
+instruction takes.  While not 100% accurate, it is very close,
+including modelling fetch stalls and register latency.
+
+--enable-cycle-stats  (default)
+--disable-cycle-stats
+
+If enabled, specifying "-v" twice on the simulator command line causes
+the simulator to print statistics on how much time was used by each
+type of opcode, and what pairs of opcodes tend to happen most
+frequently, as well as how many times various pipeline stalls
+happened.
+
+
+
+The RX simulator offers many command line options:
+
+-v - verbose output.  This prints some information about where the
+program is being loaded and its starting address, as well as
+information about how much memory was used and how many instructions
+were executed during the run.  If specified twice, pipeline and cycle
+information are added to the report.
+
+-d - disassemble output.  Each instruction executed is printed.
+
+-t - trace output.  Causes a *lot* of printed information about what
+  every instruction is doing, from math results down to register
+  changes.
+
+--ignore-*
+--warn-*
+--error-*
+
+  The RX simulator can detect certain types of memory corruption, and
+  either ignore them, warn the user about them, or error and exit.
+  Note that valid GCC code may trigger some of these, for example,
+  writing a bitfield involves reading the existing value, which may
+  not have been set yet.  The options for * are:
+
+    null-deref - memory access to address zero.  You must modify your
+      linker script to avoid putting anything at location zero, of
+      course.
+
+    unwritten-pages - attempts to read a page of memory (see below)
+      before it is written.  This is much faster than the next option.
+
+    unwritten-bytes - attempts to read individual bytes before they're
+      written.
+
+    corrupt-stack - On return from a subroutine, the memory location
+      where $pc was stored is checked to see if anything other than
+      $pc had been written to it most recently.
+
+-i -w -e - these three options change the settings for all of the
+  above.  For example, "-i" tells the simulator to ignore all memory
+  corruption.
+
+-E - end of options.  Any remaining options (after the program name)
+  are considered to be options for the simulated program, although
+  such functionality is not supported.
+
+
+
+The RX simulator simulates a small number of peripherals, mostly in
+order to provide I/O capabilities for testing and such.  The supported
+peripherals, and their limitations, are documented here.
+
+Memory
+
+Memory for the simulator is stored in a heirarchical tree, much like
+the i386's page directory and page tables.  The simulator can allocate
+memory to individual pages as needed, allowing the simulated program
+to act as if it had a full 4 Gb of RAM at its disposal, without
+actually allocating more memory from the host operating system than
+the simulated program actually uses.  Note that for each page of
+memory, there's a corresponding page of memory *types* (for tracking
+memory corruption).  Memory is initially filled with all zeros.
+
+GPIO Port A
+
+PA.DR is configured as an output-only port (regardless of PA.DDR).
+When written to, a row of colored @ and * symbols are printed,
+reflecting a row of eight LEDs being either on or off.
+
+GPIO Port B
+
+PB.DR controls the pipeline statistics.  Writing a 0 to PB.DR disables
+statistics gathering.  Writing a non-0 to PB.DR resets all counters
+and enables (even if already enabled) statistics gathering.  The
+simulator starts with statistics enabled, so writing to PB.DR is not
+needed if you want statistics on the entire program's run.
+
+SCI4
+
+SCI4.TDR is connected to the simulator's stdout.  Any byte written to
+SCI4.TDR is written to stdout.  If the simulated program writes the
+bytes 3, 3, and N in sequence, the simulator exits with an exit value
+of N.
+
+SCI4.SSR always returns "transmitter empty".
+
+
+TPU1.TCNT
+TPU2.TCNT
+
+TPU1 and TPU2 are configured as a chained 32-bit counter which counts
+machine cycles.  It always runs at "ICLK speed", regardless of the
+clock control settings.  Writing to either of these 16-bit registers
+zeros the counter, regardless of the value written.  Reading from
+these registers returns the elapsed cycle count, with TPU1 holding the
+most significant word and TPU2 holding the least significant word.
+
+Note that, much like the hardware, these values may (TPU2.CNT *will*)
+change between reads, so you must read TPU1.CNT, then TPU2.CNT, and
+then TPU1.CNT again, and only trust the values if both reads of
+TPU1.CNT were the same.
Index: config.in
===================================================================
RCS file: /cvs/src/src/sim/rx/config.in,v
retrieving revision 1.2
diff -p -U3 -r1.2 config.in
--- config.in	14 Feb 2010 07:37:11 -0000	1.2
+++ config.in	28 Jul 2010 02:00:19 -0000
@@ -105,3 +105,9 @@
 
 /* Define to 1 if you have the ANSI C header files. */
 #undef STDC_HEADERS
+
+/* --enable-cycle-accurate */
+#undef CYCLE_ACCURATE
+
+/* --enable-cycle-stats */
+#undef CYCLE_STATS
Index: configure.in
===================================================================
RCS file: /cvs/src/src/sim/rx/configure.in,v
retrieving revision 1.3
diff -p -U3 -r1.3 configure.in
--- configure.in	14 Feb 2010 07:37:11 -0000	1.3
+++ configure.in	28 Jul 2010 02:00:19 -0000
@@ -25,6 +25,36 @@ AC_CHECK_HEADERS(getopt.h)
 
 sinclude(../common/aclocal.m4)
 
+AC_ARG_ENABLE(cycle-accurate,
+[  --disable-cycle-accurate ],
+[case "${enableval}" in
+yes | no) ;;
+*)	AC_MSG_ERROR(bad value ${enableval} given for --enable-cycle-accurate option) ;;
+esac])
+
+AC_ARG_ENABLE(cycle-stats,
+[  --disable-cycle-stats ],
+[case "${enableval}" in
+yes | no) ;;
+*)	AC_MSG_ERROR(bad value ${enableval} given for --enable-cycle-stats option) ;;
+esac])
+
+echo enable_cycle_accurate is $enable_cycle_accurate
+echo enable_cycle_stats is $enable_cycle_stats
+
+if test "x${enable_cycle_accurate}" != xno; then
+AC_DEFINE([CYCLE_ACCURATE])
+
+  if test "x${enable_cycle_stats}" != xno; then
+  AC_DEFINE([CYCLE_STATS])
+  fi
+else
+  if test "x${enable_cycle_stats}" != xno; then
+  AC_ERROR([cycle-stats not available without cycle-accurate])
+  fi
+fi
+
+
 # Bugs in autoconf 2.59 break the call to SIM_AC_COMMON, hack around
 # it by inlining the macro's contents.
 sinclude(../common/common.m4)
Index: cpu.h
===================================================================
RCS file: /cvs/src/src/sim/rx/cpu.h,v
retrieving revision 1.2
diff -p -U3 -r1.2 cpu.h
--- cpu.h	1 Jan 2010 10:03:33 -0000	1.2
+++ cpu.h	28 Jul 2010 02:00:19 -0000
@@ -76,8 +76,24 @@ typedef struct
   SI r_temp;
 
   DI r_acc;
+
+#ifdef CYCLE_ACCURATE
+  /* If set, RTS/RTSD take 2 fewer cycles.  */
+  char fast_return;
+  SI link_register;
+
+  unsigned long long cycle_count;
+  /* Bits saying what kind of memory operands the previous insn had.  */
+  int m2m;
+  /* Target register for load. */
+  int rt;
+#endif
 } regs_type;
 
+#define M2M_SRC		0x01
+#define M2M_DST		0x02
+#define M2M_BOTH	0x03
+
 #define sp	0
 #define psw	16
 #define	pc	17
@@ -219,6 +235,9 @@ extern unsigned int heaptop;
 extern unsigned int heapbottom;
 
 extern int decode_opcode (void);
+extern void reset_pipeline_stats (void);
+extern void halt_pipeline_stats (void);
+extern void pipeline_stats (void);
 
 extern void trace_register_changes ();
 extern void generate_access_exception (void);
Index: main.c
===================================================================
RCS file: /cvs/src/src/sim/rx/main.c,v
retrieving revision 1.3
diff -p -U3 -r1.3 main.c
--- main.c	14 Feb 2010 07:37:11 -0000	1.3
+++ main.c	28 Jul 2010 02:00:19 -0000
@@ -82,6 +82,8 @@ done (int exit_code)
 	printf ("insns: %14s\n", comma (rx_cycles));
       else
 	printf ("insns: %u\n", rx_cycles);
+
+      pipeline_stats ();
     }
   exit (exit_code);
 }
Index: mem.c
===================================================================
RCS file: /cvs/src/src/sim/rx/mem.c,v
retrieving revision 1.2
diff -p -U3 -r1.2 mem.c
--- mem.c	1 Jan 2010 10:03:33 -0000	1.2
+++ mem.c	28 Jul 2010 02:00:19 -0000
@@ -25,6 +25,7 @@ along with this program.  If not, see <h
    1.  */
 #define RDCHECK 0
 
+#include "config.h"
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -37,7 +38,7 @@ along with this program.  If not, see <h
 
 #define L1_BITS  (10)
 #define L2_BITS  (10)
-#define OFF_BITS (12)
+#define OFF_BITS PAGE_BITS
 
 #define L1_LEN  (1 << L1_BITS)
 #define L2_LEN  (1 << L2_BITS)
@@ -70,15 +71,8 @@ init_mem (void)
   memset (mem_counters, 0, sizeof (mem_counters));
 }
 
-enum mem_ptr_action
-{
-  MPA_WRITING,
-  MPA_READING,
-  MPA_CONTENT_TYPE
-};
-
-static unsigned char *
-mem_ptr (unsigned long address, enum mem_ptr_action action)
+unsigned char *
+rx_mem_ptr (unsigned long address, enum mem_ptr_action action)
 {
   int pt1 = (address >> (L2_BITS + OFF_BITS)) & ((1 << L1_BITS) - 1);
   int pt2 = (address >> OFF_BITS) & ((1 << L2_BITS) - 1);
@@ -240,7 +234,7 @@ e ()
 static char
 mtypec (int address)
 {
-  unsigned char *cp = mem_ptr (address, MPA_CONTENT_TYPE);
+  unsigned char *cp = rx_mem_ptr (address, MPA_CONTENT_TYPE);
   return "udp"[*cp];
 }
 
@@ -254,48 +248,75 @@ mem_put_byte (unsigned int address, unsi
 
   if (trace)
     tc = mtypec (address);
-  m = mem_ptr (address, MPA_WRITING);
+  m = rx_mem_ptr (address, MPA_WRITING);
   if (trace)
     printf (" %02x%c", value, tc);
   *m = value;
   switch (address)
     {
-    case 0x00e1:
-      {
+    case 0x0008c02a: /* PA.DR */
+     {
 	static int old_led = -1;
-	static char *led_on[] =
-	  { "\033[31m O ", "\033[32m O ", "\033[34m O " };
-	static char *led_off[] = { "\033[0m · ", "\033[0m · ", "\033[0m · " };
+	int red_on = 0;
 	int i;
+
 	if (old_led != value)
 	  {
-	    fputs ("  ", stdout);
-	    for (i = 0; i < 3; i++)
+	    fputs (" ", stdout);
+	    for (i = 0; i < 8; i++)
 	      if (value & (1 << i))
-		fputs (led_off[i], stdout);
+		{
+		  if (! red_on)
+		    {
+		      fputs ("\033[31m", stdout);
+		      red_on = 1;
+		    }
+		  fputs (" @", stdout);
+		}
 	      else
-		fputs (led_on[i], stdout);
-	    fputs ("\033[0m\r", stdout);
+		{
+		  if (red_on)
+		    {
+		      fputs ("\033[0m", stdout);
+		      red_on = 0;
+		    }
+		  fputs (" *", stdout);
+		}
+
+	    if (red_on)
+	      fputs ("\033[0m", stdout);
+
+	    fputs ("\r", stdout);
 	    fflush (stdout);
 	    old_led = value;
 	  }
       }
       break;
 
-    case 0x3aa: /* uart1tx */
+#ifdef CYCLE_STATS
+    case 0x0008c02b: /* PB.DR */
       {
-	static int pending_exit = 0;
 	if (value == 0)
+	  halt_pipeline_stats ();
+	else
+	  reset_pipeline_stats ();
+      }
+#endif
+
+    case 0x00088263: /* SCI4.TDR */
+      {
+	static int pending_exit = 0;
+	if (pending_exit == 2)
 	  {
-	    if (pending_exit)
-	      {
-		step_result = RX_MAKE_EXITED(value);
-		return;
-	      }
-	    pending_exit = 1;
+	    step_result = RX_MAKE_EXITED(value);
+	    longjmp (decode_jmp_buf, 1);
 	  }
+	else if (value == 3)
+	  pending_exit ++;
 	else
-	  putchar(value);
+	  pending_exit = 0;
+
+	putchar(value);
       }
       break;
 
@@ -314,19 +335,33 @@ mem_put_qi (int address, unsigned char v
   COUNT (1, 1);
 }
 
+static int tpu_base;
+
 void
 mem_put_hi (int address, unsigned short value)
 {
   S ("<=");
-  if (rx_big_endian)
-    {
-      mem_put_byte (address, value >> 8);
-      mem_put_byte (address + 1, value & 0xff);
-    }
-  else
+  switch (address)
     {
-      mem_put_byte (address, value & 0xff);
-      mem_put_byte (address + 1, value >> 8);
+#ifdef CYCLE_ACCURATE
+    case 0x00088126: /* TPU1.TCNT */
+      tpu_base = regs.cycle_count;
+      break;
+    case 0x00088136: /* TPU2.TCNT */
+      tpu_base = regs.cycle_count;
+      break;
+#endif
+    default:
+      if (rx_big_endian)
+	{
+	  mem_put_byte (address, value >> 8);
+	  mem_put_byte (address + 1, value & 0xff);
+	}
+      else
+	{
+	  mem_put_byte (address, value & 0xff);
+	  mem_put_byte (address + 1, value >> 8);
+	}
     }
   E ();
   COUNT (1, 2);
@@ -388,7 +423,7 @@ mem_put_blk (int address, void *bufptr, 
 unsigned char
 mem_get_pc (int address)
 {
-  unsigned char *m = mem_ptr (address, MPA_READING);
+  unsigned char *m = rx_mem_ptr (address, MPA_READING);
   COUNT (0, 0);
   return *m;
 }
@@ -399,12 +434,12 @@ mem_get_byte (unsigned int address)
   unsigned char *m;
 
   S ("=>");
-  m = mem_ptr (address, MPA_READING);
+  m = rx_mem_ptr (address, MPA_READING);
   switch (address)
     {
-    case 0x3ad: /* uart1c1 */
+    case 0x00088264: /* SCI4.SSR */
       E();
-      return 2; /* transmitter empty */
+      return 0x04; /* transmitter empty */
       break;
     default: 
       if (trace)
@@ -433,15 +468,28 @@ mem_get_hi (int address)
 {
   unsigned short rv;
   S ("=>");
-  if (rx_big_endian)
-    {
-      rv = mem_get_byte (address) << 8;
-      rv |= mem_get_byte (address + 1);
-    }
-  else
+  switch (address)
     {
-      rv = mem_get_byte (address);
-      rv |= mem_get_byte (address + 1) << 8;
+#ifdef CYCLE_ACCURATE
+    case 0x00088126: /* TPU1.TCNT */
+      rv = (regs.cycle_count - tpu_base) >> 16;
+      break;
+    case 0x00088136: /* TPU2.TCNT */
+      rv = (regs.cycle_count - tpu_base) >> 0;
+      break;
+#endif
+
+    default:
+      if (rx_big_endian)
+	{
+	  rv = mem_get_byte (address) << 8;
+	  rv |= mem_get_byte (address + 1);
+	}
+      else
+	{
+	  rv = mem_get_byte (address);
+	  rv |= mem_get_byte (address + 1) << 8;
+	}
     }
   COUNT (0, 2);
   E ();
@@ -520,7 +568,7 @@ sign_ext (int v, int bits)
 void
 mem_set_content_type (int address, enum mem_content_type type)
 {
-  unsigned char *mt = mem_ptr (address, MPA_CONTENT_TYPE);
+  unsigned char *mt = rx_mem_ptr (address, MPA_CONTENT_TYPE);
   *mt = type;
 }
 
@@ -537,7 +585,7 @@ mem_set_content_range (int start_address
       if (sz + ofs > L1_LEN)
 	sz = L1_LEN - ofs;
 
-      mt = mem_ptr (start_address, MPA_CONTENT_TYPE);
+      mt = rx_mem_ptr (start_address, MPA_CONTENT_TYPE);
       memset (mt, type, sz);
 
       start_address += sz;
@@ -547,6 +595,6 @@ mem_set_content_range (int start_address
 enum mem_content_type
 mem_get_content_type (int address)
 {
-  unsigned char *mt = mem_ptr (address, MPA_CONTENT_TYPE);
+  unsigned char *mt = rx_mem_ptr (address, MPA_CONTENT_TYPE);
   return *mt;
 }
Index: mem.h
===================================================================
RCS file: /cvs/src/src/sim/rx/mem.h,v
retrieving revision 1.2
diff -p -U3 -r1.2 mem.h
--- mem.h	1 Jan 2010 10:03:33 -0000	1.2
+++ mem.h	28 Jul 2010 02:00:19 -0000
@@ -25,10 +25,25 @@ enum mem_content_type {
      MC_NUM_TYPES
 };
 
+enum mem_ptr_action
+{
+  MPA_WRITING,
+  MPA_READING,
+  MPA_CONTENT_TYPE
+};
+
 void init_mem (void);
 void mem_usage_stats (void);
 unsigned long mem_usage_cycles (void);
 
+/* rx_mem_ptr returns a pointer which is valid as long as the address
+   requested remains within the same page.  */
+#define PAGE_BITS 12
+#define PAGE_SIZE (1 << PAGE_BITS)
+#define NONPAGE_MASK (~(PAGE_SIZE-1))
+
+unsigned char *rx_mem_ptr (unsigned long address, enum mem_ptr_action action);
+
 void mem_put_qi (int address, unsigned char value);
 void mem_put_hi (int address, unsigned short value);
 void mem_put_psi (int address, unsigned long value);
Index: reg.c
===================================================================
RCS file: /cvs/src/src/sim/rx/reg.c,v
retrieving revision 1.3
diff -p -U3 -r1.3 reg.c
--- reg.c	8 Jun 2010 09:15:17 -0000	1.3
+++ reg.c	28 Jul 2010 02:00:19 -0000
@@ -19,6 +19,7 @@
    along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
 
 
+#include "config.h"
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -67,6 +68,11 @@ init_regs (void)
 {
   memset (&regs, 0, sizeof (regs));
   memset (&oldregs, 0, sizeof (oldregs));
+
+#ifdef CYCLE_ACCURATE
+  regs.rt = -1;
+  oldregs.rt = -1;
+#endif
 }
 
 static unsigned int
Index: rx.c
===================================================================
RCS file: /cvs/src/src/sim/rx/rx.c,v
retrieving revision 1.4
diff -p -U3 -r1.4 rx.c
--- rx.c	1 Jan 2010 10:03:33 -0000	1.4
+++ rx.c	28 Jul 2010 02:00:19 -0000
@@ -18,6 +18,7 @@ GNU General Public License for more deta
 You should have received a copy of the GNU General Public License
 along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
 
+#include "config.h"
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -29,12 +30,254 @@ along with this program.  If not, see <h
 #include "syscalls.h"
 #include "fpu.h"
 #include "err.h"
+#include "misc.h"
 
-#define tprintf if (trace) printf
+#ifdef CYCLE_STATS
+static const char * id_names[] = {
+  "RXO_unknown",
+  "RXO_mov",	/* d = s (signed) */
+  "RXO_movbi",	/* d = [s,s2] (signed) */
+  "RXO_movbir",	/* [s,s2] = d (signed) */
+  "RXO_pushm",	/* s..s2 */
+  "RXO_popm",	/* s..s2 */
+  "RXO_pusha",	/* &s */
+  "RXO_xchg",	/* s <-> d */
+  "RXO_stcc",	/* d = s if cond(s2) */
+  "RXO_rtsd",	/* rtsd, 1=imm, 2-0 = reg if reg type */
+
+  /* These are all either d OP= s or, if s2 is set, d = s OP s2.  Note
+     that d may be "None".  */
+  "RXO_and",
+  "RXO_or",
+  "RXO_xor",
+  "RXO_add",
+  "RXO_sub",
+  "RXO_mul",
+  "RXO_div",
+  "RXO_divu",
+  "RXO_shll",
+  "RXO_shar",
+  "RXO_shlr",
+
+  "RXO_adc",	/* d = d + s + carry */
+  "RXO_sbb",	/* d = d - s - ~carry */
+  "RXO_abs",	/* d = |s| */
+  "RXO_max",	/* d = max(d,s) */
+  "RXO_min",	/* d = min(d,s) */
+  "RXO_emul",	/* d:64 = d:32 * s */
+  "RXO_emulu",	/* d:64 = d:32 * s (unsigned) */
+  "RXO_ediv",	/* d:64 / s; d = quot, d+1 = rem */
+  "RXO_edivu",	/* d:64 / s; d = quot, d+1 = rem */
+
+  "RXO_rolc",	/* d <<= 1 through carry */
+  "RXO_rorc",	/* d >>= 1 through carry*/
+  "RXO_rotl",	/* d <<= #s without carry */
+  "RXO_rotr",	/* d >>= #s without carry*/
+  "RXO_revw",	/* d = revw(s) */
+  "RXO_revl",	/* d = revl(s) */
+  "RXO_branch",	/* pc = d if cond(s) */
+  "RXO_branchrel",/* pc += d if cond(s) */
+  "RXO_jsr",	/* pc = d */
+  "RXO_jsrrel",	/* pc += d */
+  "RXO_rts",
+  "RXO_nop",
+  "RXO_nop2",
+  "RXO_nop3",
+
+  "RXO_scmpu",
+  "RXO_smovu",
+  "RXO_smovb",
+  "RXO_suntil",
+  "RXO_swhile",
+  "RXO_smovf",
+  "RXO_sstr",
+
+  "RXO_rmpa",
+  "RXO_mulhi",
+  "RXO_mullo",
+  "RXO_machi",
+  "RXO_maclo",
+  "RXO_mvtachi",
+  "RXO_mvtaclo",
+  "RXO_mvfachi",
+  "RXO_mvfacmi",
+  "RXO_mvfaclo",
+  "RXO_racw",
+
+  "RXO_sat",	/* sat(d) */
+  "RXO_satr",
+
+  "RXO_fadd",	/* d op= s */
+  "RXO_fcmp",
+  "RXO_fsub",
+  "RXO_ftoi",
+  "RXO_fmul",
+  "RXO_fdiv",
+  "RXO_round",
+  "RXO_itof",
+
+  "RXO_bset",	/* d |= (1<<s) */
+  "RXO_bclr",	/* d &= ~(1<<s) */
+  "RXO_btst",	/* s & (1<<s2) */
+  "RXO_bnot",	/* d ^= (1<<s) */
+  "RXO_bmcc",	/* d<s> = cond(s2) */
+
+  "RXO_clrpsw",	/* flag index in d */
+  "RXO_setpsw",	/* flag index in d */
+  "RXO_mvtipl",	/* new IPL in s */
+
+  "RXO_rtfi",
+  "RXO_rte",
+  "RXO_rtd",	/* undocumented */
+  "RXO_brk",
+  "RXO_dbt",	/* undocumented */
+  "RXO_int",	/* vector id in s */
+  "RXO_stop",
+  "RXO_wait",
+
+  "RXO_sccnd",	/* d = cond(s) ? 1 : 0 */
+};
+
+static const char * optype_names[] = {
+  "    ",
+  "#Imm",	/* #addend */
+  " Rn ",	/* Rn */
+  "[Rn]",	/* [Rn + addend] */
+  "Ps++",	/* [Rn+] */
+  "--Pr",	/* [-Rn] */
+  " cc ",	/* eq, gtu, etc */
+  "Flag"	/* [UIOSZC] */
+};
+
+#define N_RXO (sizeof(id_names)/sizeof(id_names[0]))
+#define N_RXT (sizeof(optype_names)/sizeof(optype_names[0]))
+#define N_MAP 30
+
+static unsigned long long benchmark_start_cycle;
+static unsigned long long benchmark_end_cycle;
+
+static int op_cache[N_RXT][N_RXT][N_RXT];
+static int op_cache_rev[N_MAP];
+static int op_cache_idx = 0;
+
+static int
+op_lookup (int a, int b, int c)
+{
+  if (op_cache[a][b][c])
+    return op_cache[a][b][c];
+  op_cache_idx ++;
+  if (op_cache_idx >= N_MAP)
+    {
+      printf("op_cache_idx exceeds %d\n", N_MAP);
+      exit(1);
+    }
+  op_cache[a][b][c] = op_cache_idx;
+  op_cache_rev[op_cache_idx] = (a<<8) | (b<<4) | c;
+  return op_cache_idx;
+}
+
+static char *
+op_cache_string (int map)
+{
+  static int ci;
+  static char cb[5][20];
+  int a, b, c;
+
+  map = op_cache_rev[map];
+  a = (map >> 8) & 15;
+  b = (map >> 4) & 15;
+  c = (map >> 0) & 15;
+  ci = (ci + 1) % 5;
+  sprintf(cb[ci], "%s %s %s", optype_names[a], optype_names[b], optype_names[c]);
+  return cb[ci];
+}
+
+static unsigned long long cycles_per_id[N_RXO][N_MAP];
+static unsigned long long times_per_id[N_RXO][N_MAP];
+static unsigned long long memory_stalls;
+static unsigned long long register_stalls;
+static unsigned long long branch_stalls;
+static unsigned long long branch_alignment_stalls;
+static unsigned long long fast_returns;
+
+static unsigned long times_per_pair[N_RXO][N_MAP][N_RXO][N_MAP];
+static int prev_opcode_id = RXO_unknown;
+static int po0;
+
+#define STATS(x) x
+
+#else
+#define STATS(x)
+#endif /* CYCLE_STATS */
+
+
+#ifdef CYCLE_ACCURATE
+
+static int new_rt = -1;
+
+/* Number of cycles to add if an insn spans an 8-byte boundary.  */
+static int branch_alignment_penalty = 0;
+
+#endif
+
+static int running_benchmark = 1;
+
+#define tprintf if (trace && running_benchmark) printf
 
 jmp_buf decode_jmp_buf;
 unsigned int rx_cycles = 0;
 
+#ifdef CYCLE_ACCURATE
+/* If nonzero, memory was read at some point and cycle latency might
+   take effect.  */
+static int memory_source = 0;
+/* If nonzero, memory was written and extra cycles might be
+   needed.  */
+static int memory_dest = 0;
+
+static void
+cycles (int throughput)
+{
+  tprintf("%d cycles\n", throughput);
+  regs.cycle_count += throughput;
+}
+
+/* Number of execution (E) cycles the op uses.  For memory sources, we
+   include the load micro-op stall as two extra E cycles.  */
+#define E(c) cycles (memory_source ? c + 2 : c)
+#define E1 cycles (1)
+#define E2 cycles (2)
+#define EBIT cycles (memory_source ? 2 : 1)
+
+/* Check to see if a read latency must be applied for a given register.  */
+#define RL(r) \
+  if (regs.rt == r )							\
+    {									\
+      tprintf("register %d load stall\n", r);				\
+      regs.cycle_count ++;						\
+      STATS(register_stalls ++);					\
+      regs.rt = -1;							\
+    }
+
+#define RLD(r)					\
+  if (memory_source)				\
+    {						\
+      tprintf ("Rt now %d\n", r);		\
+      new_rt = r;				\
+    }
+
+#else /* !CYCLE_ACCURATE */
+
+#define cycles(t)
+#define E(c)
+#define E1
+#define E2
+#define EBIT
+#define RL(r)
+#define RLD(r)
+
+#endif /* else CYCLE_ACCURATE */
+
 static int size2bytes[] = {
   4, 1, 1, 1, 2, 2, 2, 3, 4
 };
@@ -53,24 +296,28 @@ _rx_abort (const char *file, int line)
   abort();
 }
 
+static unsigned char *get_byte_base;
+static SI get_byte_page;
+
+/* This gets called a *lot* so optimize it.  */
 static int
 rx_get_byte (void *vdata)
 {
-  int saved_trace = trace;
-  unsigned char rv;
-
-  if (trace == 1)
-    trace = 0;
-
   RX_Data *rx_data = (RX_Data *)vdata;
+  SI tpc = rx_data->dpc;
+
+  /* See load.c for an explanation of this.  */
   if (rx_big_endian)
-    /* See load.c for an explanation of this.  */
-    rv = mem_get_pc (rx_data->dpc ^ 3);
-  else
-    rv = mem_get_pc (rx_data->dpc);
+    tpc ^= 3;
+
+  if (((tpc ^ get_byte_page) & NONPAGE_MASK) || enable_counting)
+    {
+      get_byte_page = tpc & NONPAGE_MASK;
+      get_byte_base = rx_mem_ptr (get_byte_page, MPA_READING) - get_byte_page;
+    }
+
   rx_data->dpc ++;
-  trace = saved_trace;
-  return rv;
+  return get_byte_base [tpc];
 }
 
 static int
@@ -88,6 +335,7 @@ get_op (RX_Opcode_Decoded *rd, int i)
       return o->addend;
 
     case RX_Operand_Register:	/* Rn */
+      RL (o->reg);
       rv = get_reg (o->reg);
       break;
 
@@ -96,6 +344,21 @@ get_op (RX_Opcode_Decoded *rd, int i)
       /* fall through */
     case RX_Operand_Postinc:	/* [Rn+] */
     case RX_Operand_Indirect:	/* [Rn + addend] */
+#ifdef CYCLE_ACCURATE
+      RL (o->reg);
+      regs.rt = -1;
+      if (regs.m2m == M2M_BOTH)
+	{
+	  tprintf("src memory stall\n");
+#ifdef CYCLE_STATS
+	  memory_stalls ++;
+#endif
+	  regs.cycle_count ++;
+	  regs.m2m = 0;
+	}
+
+      memory_source = 1;
+#endif
 
       addr = get_reg (o->reg) + o->addend;
       switch (o->size)
@@ -234,6 +497,7 @@ put_op (RX_Opcode_Decoded *rd, int i, in
 
     case RX_Operand_Register:	/* Rn */
       put_reg (o->reg, v);
+      RLD (o->reg);
       break;
 
     case RX_Operand_Predec:	/* [-Rn] */
@@ -242,6 +506,19 @@ put_op (RX_Opcode_Decoded *rd, int i, in
     case RX_Operand_Postinc:	/* [Rn+] */
     case RX_Operand_Indirect:	/* [Rn + addend] */
 
+#ifdef CYCLE_ACCURATE
+      if (regs.m2m == M2M_BOTH)
+	{
+	  tprintf("dst memory stall\n");
+	  regs.cycle_count ++;
+#ifdef CYCLE_STATS
+	  memory_stalls ++;
+#endif
+	  regs.m2m = 0;
+	}
+      memory_dest = 1;
+#endif
+
       addr = get_reg (o->reg) + o->addend;
       switch (o->size)
 	{
@@ -345,8 +622,8 @@ poppc()
 
 #define MATH_OP(vop,c)				\
 { \
-  uma = US1(); \
   umb = US2(); \
+  uma = US1(); \
   ll = (unsigned long long) uma vop (unsigned long long) umb vop c; \
   tprintf ("0x%x " #vop " 0x%x " #vop " 0x%x = 0x%llx\n", uma, umb, c, ll); \
   ma = sign_ext (uma, DSZ() * 8);					\
@@ -355,23 +632,25 @@ poppc()
   tprintf ("%d " #vop " %d " #vop " %d = %lld\n", ma, mb, c, sll); \
   set_oszc (sll, DSZ(), (long long) ll > ((1 vop 1) ? (long long) b2mask[DSZ()] : (long long) -1)); \
   PD (sll); \
+  E (1);    \
 }
 
 #define LOGIC_OP(vop) \
 { \
-  ma = US1(); \
   mb = US2(); \
+  ma = US1(); \
   v = ma vop mb; \
   tprintf("0x%x " #vop " 0x%x = 0x%x\n", ma, mb, v); \
   set_sz (v, DSZ()); \
   PD(v); \
+  E (1); \
 }
 
 #define SHIFT_OP(val, type, count, OP, carry_mask)	\
 { \
   int i, c=0; \
-  val = (type)US1();				\
   count = US2(); \
+  val = (type)US1();				\
   tprintf("%lld " #OP " %d\n", val, count); \
   for (i = 0; i < count; i ++) \
     { \
@@ -443,8 +722,8 @@ fop_fsub (fp_t s1, fp_t s2, fp_t *d)
   int do_store;   \
   fp_t fa, fb, fc; \
   FPCLEAR(); \
-  fa = GD (); \
   fb = GS (); \
+  fa = GD (); \
   do_store = fop_##func (fa, fb, &fc); \
   tprintf("%g " #func " %g = %g %08x\n", int2float(fa), int2float(fb), int2float(fc), fc); \
   FPCHECK(); \
@@ -549,6 +828,21 @@ do_fp_exception (unsigned long opcode_pc
   return RX_MAKE_STEPPED ();
 }
 
+static int
+op_is_memory (RX_Opcode_Decoded *rd, int i)
+{
+  switch (rd->op[i].type)
+    {
+    case RX_Operand_Predec:
+    case RX_Operand_Postinc:
+    case RX_Operand_Indirect:
+      return 1;
+    default:
+      return 0;
+    }
+}
+#define OM(i) op_is_memory (&opcode, i)
+
 int
 decode_opcode ()
 {
@@ -561,14 +855,46 @@ decode_opcode ()
   RX_Data rx_data;
   RX_Opcode_Decoded opcode;
   int rv;
+#ifdef CYCLE_STATS
+  unsigned long long prev_cycle_count;
+#endif
+#ifdef CYCLE_ACCURATE
+  int tx;
+#endif
 
   if ((rv = setjmp (decode_jmp_buf)))
     return rv;
 
+#ifdef CYCLE_STATS
+  prev_cycle_count = regs.cycle_count;
+#endif
+
+#ifdef CYCLE_ACCURATE
+  memory_source = 0;
+  memory_dest = 0;
+#endif
+
   rx_cycles ++;
 
   rx_data.dpc = opcode_pc = regs.r_pc;
+  memset (&opcode, 0, sizeof(opcode));
   opcode_size = rx_decode_opcode (opcode_pc, &opcode, rx_get_byte, &rx_data);
+
+#ifdef CYCLE_ACCURATE
+  if (branch_alignment_penalty)
+    {
+      if ((regs.r_pc ^ (regs.r_pc + opcode_size - 1)) & ~7)
+	{
+	  tprintf("1 cycle branch alignment penalty\n");
+	  cycles (branch_alignment_penalty);
+#ifdef CYCLE_STATS
+	  branch_alignment_stalls ++;
+#endif
+	}
+      branch_alignment_penalty = 0;
+    }
+#endif
+
   regs.r_pc += opcode_size;
 
   rx_flagmask = opcode.flags_s;
@@ -585,6 +911,7 @@ decode_opcode ()
       tprintf("%lld\n", sll);
       PD (sll);
       set_osz (sll, 4);
+      E (1);
       break;
 
     case RXO_adc:
@@ -608,6 +935,7 @@ decode_opcode ()
 	mb &= 0x07;
       ma &= ~(1 << mb);
       PD (ma);
+      EBIT;
       break;
 
     case RXO_bmcc:
@@ -622,6 +950,7 @@ decode_opcode ()
       else
 	ma &= ~(1 << mb);
       PD (ma);
+      EBIT;
       break;
 
     case RXO_bnot:
@@ -633,16 +962,71 @@ decode_opcode ()
 	mb &= 0x07;
       ma ^= (1 << mb);
       PD (ma);
+      EBIT;
       break;
 
     case RXO_branch:
       if (GS())
-	regs.r_pc = GD();
+	{
+#ifdef CYCLE_ACCURATE
+	  SI old_pc = regs.r_pc;
+	  int delta;
+#endif
+	  regs.r_pc = GD();
+#ifdef CYCLE_ACCURATE
+	  delta = regs.r_pc - old_pc;
+	  if (delta >= 0 && delta < 16
+	      && opcode_size > 1)
+	    {
+	      tprintf("near forward branch bonus\n");
+	      cycles (2);
+	    }
+	  else
+	    {
+	      cycles (3);
+	      branch_alignment_penalty = 1;
+	    }
+#ifdef CYCLE_STATS
+	  branch_stalls ++;
+	  /* This is just for statistics */
+	  if (opcode.op[1].reg == 14)
+	    opcode.op[1].type = RX_Operand_None;
+#endif
+#endif
+	}
+#ifdef CYCLE_ACCURATE
+      else
+	cycles (1);
+#endif
       break;
 
     case RXO_branchrel:
       if (GS())
-	regs.r_pc += GD();
+	{
+	  int delta = GD();
+	  regs.r_pc += delta;
+#ifdef CYCLE_ACCURATE
+	  /* Note: specs say 3, chip says 2.  */
+	  if (delta >= 0 && delta < 16
+	      && opcode_size > 1)
+	    {
+	      tprintf("near forward branch bonus\n");
+	      cycles (2);
+	    }
+	  else
+	    {
+	      cycles (3);
+	      branch_alignment_penalty = 1;
+	    }
+#ifdef CYCLE_STATS
+	  branch_stalls ++;
+#endif
+#endif
+	}
+#ifdef CYCLE_ACCURATE
+      else
+	cycles (1);
+#endif
       break;
 
     case RXO_brk:
@@ -659,6 +1043,7 @@ decode_opcode ()
 	pushpc (old_psw);
 	pushpc (regs.r_pc);
 	regs.r_pc = mem_get_si (regs.r_intb);
+	cycles(6);
       }
       break;
 
@@ -671,6 +1056,7 @@ decode_opcode ()
 	mb &= 0x07;
       ma |= (1 << mb);
       PD (ma);
+      EBIT;
       break;
 
     case RXO_btst:
@@ -682,6 +1068,7 @@ decode_opcode ()
 	mb &= 0x07;
       umb = ma & (1 << mb);
       set_zc (! umb, umb);
+      EBIT;
       break;
 
     case RXO_clrpsw:
@@ -691,6 +1078,7 @@ decode_opcode ()
 	      || v == FLAGBIT_U))
 	break;
       regs.r_psw &= ~v;
+      cycles (1);
       break;
 
     case RXO_div: /* d = d / s */
@@ -709,6 +1097,8 @@ decode_opcode ()
 	  set_flags (FLAGBIT_O, 0);
 	  PD (v);
 	}
+      /* Note: spec says 3 to 22 cycles, we are pessimistic.  */
+      cycles (22);
       break;
 
     case RXO_divu: /* d = d / s */
@@ -727,6 +1117,8 @@ decode_opcode ()
 	  set_flags (FLAGBIT_O, 0);
 	  PD (v);
 	}
+      /* Note: spec says 2 to 20 cycles, we are pessimistic.  */
+      cycles (20);
       break;
 
     case RXO_ediv:
@@ -748,6 +1140,8 @@ decode_opcode ()
 	  opcode.op[0].reg ++;
 	  PD (mb);
 	}
+      /* Note: spec says 3 to 22 cycles, we are pessimistic.  */
+      cycles (22);
       break;
 
     case RXO_edivu:
@@ -769,6 +1163,8 @@ decode_opcode ()
 	  opcode.op[0].reg ++;
 	  PD (umb);
 	}
+      /* Note: spec says 2 to 20 cycles, we are pessimistic.  */
+      cycles (20);
       break;
 
     case RXO_emul:
@@ -779,6 +1175,7 @@ decode_opcode ()
       PD (sll);
       opcode.op[0].reg ++;
       PD (sll >> 32);
+      E2;
       break;
 
     case RXO_emulu:
@@ -789,10 +1186,12 @@ decode_opcode ()
       PD (ll);
       opcode.op[0].reg ++;
       PD (ll >> 32);
+      E2;
       break;
 
     case RXO_fadd:
       FLOAT_OP (fadd);
+      E (4);
       break;
 
     case RXO_fcmp:
@@ -801,24 +1200,32 @@ decode_opcode ()
       FPCLEAR ();
       rxfp_cmp (ma, mb);
       FPCHECK ();
+      E (1);
       break;
 
     case RXO_fdiv:
       FLOAT_OP (fdiv);
+      E (16);
       break;
 
     case RXO_fmul:
       FLOAT_OP (fmul);
+      E (3);
       break;
 
     case RXO_rtfi:
       PRIVILEDGED ();
       regs.r_psw = regs.r_bpsw;
       regs.r_pc = regs.r_bpc;
+#ifdef CYCLE_ACCURATE
+      regs.fast_return = 0;
+      cycles(3);
+#endif
       break;
 
     case RXO_fsub:
       FLOAT_OP (fsub);
+      E (4);
       break;
 
     case RXO_ftoi:
@@ -829,6 +1236,7 @@ decode_opcode ()
       PD (mb);
       tprintf("(int) %g = %d\n", int2float(ma), mb);
       set_sz (mb, 4);
+      E (2);
       break;
 
     case RXO_int:
@@ -845,6 +1253,7 @@ decode_opcode ()
 	  pushpc (regs.r_pc);
 	  regs.r_pc = mem_get_si (regs.r_intb + 4 * v);
 	}
+      cycles (6);
       break;
 
     case RXO_itof:
@@ -855,49 +1264,87 @@ decode_opcode ()
       tprintf("(float) %d = %x\n", ma, mb);
       PD (mb);
       set_sz (ma, 4);
+      E (2);
       break;
 
     case RXO_jsr:
     case RXO_jsrrel:
-      v = GD ();
-      pushpc (get_reg (pc));
-      if (opcode.id == RXO_jsrrel)
-	v += regs.r_pc;
-      put_reg (pc, v);
+      {
+#ifdef CYCLE_ACCURATE
+	int delta;
+	regs.m2m = 0;
+#endif
+	v = GD ();
+#ifdef CYCLE_ACCURATE
+	regs.link_register = regs.r_pc;
+#endif
+	pushpc (get_reg (pc));
+	if (opcode.id == RXO_jsrrel)
+	  v += regs.r_pc;
+#ifdef CYCLE_ACCURATE
+	delta = v - regs.r_pc;
+#endif
+	put_reg (pc, v);
+#ifdef CYCLE_ACCURATE
+	/* Note: docs say 3, chip says 2 */
+	if (delta >= 0 && delta < 16)
+	  {
+	    tprintf ("near forward jsr bonus\n");
+	    cycles (2);
+	  }
+	else
+	  {
+	    branch_alignment_penalty = 1;
+	    cycles (3);
+	  }
+	regs.fast_return = 1;
+#endif
+      }
       break;
 
     case RXO_machi:
       ll = (long long)(signed short)(GS() >> 16) * (long long)(signed short)(GS2 () >> 16);
       ll <<= 16;
       put_reg64 (acc64, ll + regs.r_acc);
+      E1;
       break;
 
     case RXO_maclo:
       ll = (long long)(signed short)(GS()) * (long long)(signed short)(GS2 ());
       ll <<= 16;
       put_reg64 (acc64, ll + regs.r_acc);
+      E1;
       break;
 
     case RXO_max:
-      ma = GD();
       mb = GS();
+      ma = GD();
       if (ma > mb)
 	PD (ma);
       else
 	PD (mb);
+      E (1);
+#ifdef CYCLE_STATS
+      if (opcode.op[0].type == RX_Operand_Register
+	  && opcode.op[1].type == RX_Operand_Register
+	  && opcode.op[0].reg == opcode.op[1].reg)
+	opcode.id = RXO_nop3;
+#endif
       break;
 
     case RXO_min:
-      ma = GD();
       mb = GS();
+      ma = GD();
       if (ma < mb)
 	PD (ma);
       else
 	PD (mb);
+      E (1);
       break;
 
     case RXO_mov:
       v = GS ();
+
       if (opcode.op[0].type == RX_Operand_Register
 	  && opcode.op[0].reg == 16 /* PSW */)
 	{
@@ -927,8 +1374,32 @@ decode_opcode ()
 	    /* These are ignored.  */
 	    break;
 	}
+      if (OM(0) && OM(1))
+	cycles (2);
+      else
+	cycles (1);
+
       PD (v);
+
+#ifdef CYCLE_ACCURATE
+      if ((opcode.op[0].type == RX_Operand_Predec
+	   && opcode.op[1].type == RX_Operand_Register)
+	  || (opcode.op[0].type == RX_Operand_Postinc
+	      && opcode.op[1].type == RX_Operand_Register))
+	{
+	  /* Special case: push reg doesn't cause a memory stall.  */
+	  memory_dest = 0;
+	  tprintf("push special case\n");
+	}
+#endif
+
       set_sz (v, DSZ());
+#ifdef CYCLE_STATS
+      if (opcode.op[0].type == RX_Operand_Register
+	  && opcode.op[1].type == RX_Operand_Register
+	  && opcode.op[0].reg == opcode.op[1].reg)
+	opcode.id = RXO_nop2;
+#endif
       break;
 
     case RXO_movbi:
@@ -939,6 +1410,7 @@ decode_opcode ()
       opcode.op[1].type = RX_Operand_Indirect;
       opcode.op[1].addend = 0;
       PD (GS ());
+      cycles (1);
       break;
 
     case RXO_movbir:
@@ -949,51 +1421,65 @@ decode_opcode ()
       opcode.op[1].type = RX_Operand_Indirect;
       opcode.op[1].addend = 0;
       PS (GD ());
+      cycles (1);
       break;
 
     case RXO_mul:
-      ll = (unsigned long long) US1() * (unsigned long long) US2();
+      v = US2 ();
+      ll = (unsigned long long) US1() * (unsigned long long) v;
       PD(ll);
+      E (1);
       break;
 
     case RXO_mulhi:
-      ll = (long long)(signed short)(GS() >> 16) * (long long)(signed short)(GS2 () >> 16);
+      v = GS2 ();
+      ll = (long long)(signed short)(GS() >> 16) * (long long)(signed short)(v >> 16);
       ll <<= 16;
       put_reg64 (acc64, ll);
+      E1;
       break;
 
     case RXO_mullo:
-      ll = (long long)(signed short)(GS()) * (long long)(signed short)(GS2 ());
+      v = GS2 ();
+      ll = (long long)(signed short)(GS()) * (long long)(signed short)(v);
       ll <<= 16;
       put_reg64 (acc64, ll);
+      E1;
       break;
 
     case RXO_mvfachi:
       PD (get_reg (acchi));
+      E1;
       break;
 
     case RXO_mvfaclo:
       PD (get_reg (acclo));
+      E1;
       break;
 
     case RXO_mvfacmi:
       PD (get_reg (accmi));
+      E1;
       break;
 
     case RXO_mvtachi:
       put_reg (acchi, GS ());
+      E1;
       break;
 
     case RXO_mvtaclo:
       put_reg (acclo, GS ());
+      E1;
       break;
 
     case RXO_mvtipl:
       regs.r_psw &= ~ FLAGBITS_IPL;
       regs.r_psw |= (GS () << FLAGSHIFT_IPL) & FLAGBITS_IPL;
+      E1;
       break;
 
     case RXO_nop:
+      E1;
       break;
 
     case RXO_or:
@@ -1010,11 +1496,11 @@ decode_opcode ()
 	  return RX_MAKE_STOPPED (SIGILL);
 	}
       for (v = opcode.op[1].reg; v <= opcode.op[2].reg; v++)
-	put_reg (v, pop ());
-      break;
-
-    case RXO_pusha:
-      push (get_reg (opcode.op[1].reg) + opcode.op[1].addend);
+	{
+	  cycles (1);
+	  RLD (v);
+	  put_reg (v, pop ());
+	}
       break;
 
     case RXO_pushm:
@@ -1027,7 +1513,11 @@ decode_opcode ()
 	  return RX_MAKE_STOPPED (SIGILL);
 	}
       for (v = opcode.op[2].reg; v >= opcode.op[1].reg; v--)
-	push (get_reg (v));
+	{
+	  RL (v);
+	  push (get_reg (v));
+	}
+      cycles (opcode.op[2].reg - opcode.op[1].reg + 1);
       break;
 
     case RXO_racw:
@@ -1040,6 +1530,7 @@ decode_opcode ()
       else
 	ll &= 0xffffffff00000000ULL;
       put_reg64 (acc64, ll);
+      E1;
       break;
 
     case RXO_rte:
@@ -1048,6 +1539,10 @@ decode_opcode ()
       regs.r_psw = poppc ();
       if (FLAG_PM)
 	regs.r_psw |= FLAGBIT_U;
+#ifdef CYCLE_ACCURATE
+      regs.fast_return = 0;
+      cycles (6);
+#endif
       break;
 
     case RXO_revl:
@@ -1057,6 +1552,7 @@ decode_opcode ()
 	     | ((uma << 8) & 0xff0000)
 	     | ((uma << 24) & 0xff000000UL));
       PD (umb);
+      E1;
       break;
 
     case RXO_revw:
@@ -1064,9 +1560,16 @@ decode_opcode ()
       umb = (((uma >> 8) & 0x00ff00ff)
 	     | ((uma << 8) & 0xff00ff00UL));
       PD (umb);
+      E1;
       break;
 
     case RXO_rmpa:
+      RL(4);
+      RL(5);
+#ifdef CYCLE_ACCURATE
+      tx = regs.r[3];
+#endif
+
       while (regs.r[3] != 0)
 	{
 	  long long tmp;
@@ -1124,6 +1627,22 @@ decode_opcode ()
 	set_flags (FLAGBIT_O|FLAGBIT_S, ma | FLAGBIT_O);
       else
 	set_flags (FLAGBIT_O|FLAGBIT_S, ma);
+#ifdef CYCLE_ACCURATE
+      switch (opcode.size)
+	{
+	case RX_Long:
+	  cycles (6 + 4 * tx);
+	  break;
+	case RX_Word:
+	  cycles (6 + 5 * (tx / 2) + 4 * (tx % 2));
+	  break;
+	case RX_Byte:
+	  cycles (6 + 7 * (tx / 4) + 4 * (tx % 4));
+	  break;
+	default:
+	  abort ();
+	}
+#endif
       break;
 
     case RXO_rolc:
@@ -1133,6 +1652,7 @@ decode_opcode ()
       v |= carry;
       set_szc (v, 4, ma);
       PD (v);
+      E1;
       break;
 
     case RXO_rorc:
@@ -1142,6 +1662,7 @@ decode_opcode ()
       uma |= (carry ? 0x80000000UL : 0);
       set_szc (uma, 4, mb);
       PD (uma);
+      E1;
       break;
 
     case RXO_rotl:
@@ -1154,6 +1675,7 @@ decode_opcode ()
 	}
       set_szc (uma, 4, mb);
       PD (uma);
+      E1;
       break;
 
     case RXO_rotr:
@@ -1166,6 +1688,7 @@ decode_opcode ()
 	}
       set_szc (uma, 4, mb);
       PD (uma);
+      E1;
       break;
 
     case RXO_round:
@@ -1176,10 +1699,30 @@ decode_opcode ()
       PD (mb);
       tprintf("(int) %g = %d\n", int2float(ma), mb);
       set_sz (mb, 4);
+      E (2);
       break;
 
     case RXO_rts:
-      regs.r_pc = poppc ();
+      {
+#ifdef CYCLE_ACCURATE
+	int cyc = 5;
+#endif
+	regs.r_pc = poppc ();
+#ifdef CYCLE_ACCURATE
+	/* Note: specs say 5, chip says 3.  */
+	if (regs.fast_return && regs.link_register == regs.r_pc)
+	  {
+#ifdef CYCLE_STATS
+	    fast_returns ++;
+#endif
+	    tprintf("fast return bonus\n");
+	    cyc -= 2;
+	  }
+	cycles (cyc);
+	regs.fast_return = 0;
+	branch_alignment_penalty = 1;
+#endif
+      }
       break;
 
     case RXO_rtsd:
@@ -1190,12 +1733,39 @@ decode_opcode ()
 	  put_reg (0, get_reg (0) + GS() - (opcode.op[0].reg-opcode.op[2].reg+1)*4);
 	  if (opcode.op[2].reg == 0)
 	    EXCEPTION (EX_UNDEFINED);
+#ifdef CYCLE_ACCURATE
+	  tx = opcode.op[0].reg - opcode.op[2].reg + 1;
+#endif
 	  for (i = opcode.op[2].reg; i <= opcode.op[0].reg; i ++)
-	    put_reg (i, pop ());
+	    {
+	      RLD (i);
+	      put_reg (i, pop ());
+	    }
 	}
       else
-	put_reg (0, get_reg (0) + GS());
-      put_reg (pc, poppc ());
+	{
+#ifdef CYCLE_ACCURATE
+	  tx = 0;
+#endif
+	  put_reg (0, get_reg (0) + GS());
+	}
+      put_reg (pc, poppc());
+#ifdef CYCLE_ACCURATE
+      if (regs.fast_return && regs.link_register == regs.r_pc)
+	{
+	  tprintf("fast return bonus\n");
+#ifdef CYCLE_STATS
+	  fast_returns ++;
+#endif
+	  cycles (tx < 3 ? 3 : tx + 1);
+	}
+      else
+	{
+	  cycles (tx < 5 ? 5 : tx + 1);
+	}
+      regs.fast_return = 0;
+      branch_alignment_penalty = 1;
+#endif
       break;
 
     case RXO_sat:
@@ -1203,6 +1773,7 @@ decode_opcode ()
 	PD (0x7fffffffUL);
       else if (FLAG_O && ! FLAG_S)
 	PD (0x80000000UL);
+      E1;
       break;
 
     case RXO_sbb:
@@ -1214,9 +1785,13 @@ decode_opcode ()
 	PD (1);
       else
 	PD (0);
+      E1;
       break;
 
     case RXO_scmpu:
+#ifdef CYCLE_ACCURATE
+      tx = regs.r[3];
+#endif
       while (regs.r[3] != 0)
 	{
 	  uma = mem_get_qi (regs.r[1] ++);
@@ -1229,6 +1804,7 @@ decode_opcode ()
 	set_zc (1, 1);
       else
 	set_zc (0, ((int)uma - (int)umb) >= 0);
+      cycles (2 + 4 * (tx / 4) + 4 * (tx % 4));
       break;
 
     case RXO_setpsw:
@@ -1238,24 +1814,40 @@ decode_opcode ()
 	      || v == FLAGBIT_U))
 	break;
       regs.r_psw |= v;
+      cycles (1);
       break;
 
     case RXO_smovb:
+      RL (3);
+#ifdef CYCLE_ACCURATE
+      tx = regs.r[3];
+#endif
       while (regs.r[3])
 	{
 	  uma = mem_get_qi (regs.r[2] --);
 	  mem_put_qi (regs.r[1]--, uma);
 	  regs.r[3] --;
 	}
+#ifdef CYCLE_ACCURATE
+      if (tx > 3)
+	cycles (6 + 3 * (tx / 4) + 3 * (tx % 4));
+      else
+	cycles (2 + 3 * (tx % 4));
+#endif
       break;
 
     case RXO_smovf:
+      RL (3);
+#ifdef CYCLE_ACCURATE
+      tx = regs.r[3];
+#endif
       while (regs.r[3])
 	{
 	  uma = mem_get_qi (regs.r[2] ++);
 	  mem_put_qi (regs.r[1]++, uma);
 	  regs.r[3] --;
 	}
+      cycles (2 + 3 * (int)(tx / 4) + 3 * (tx % 4));
       break;
 
     case RXO_smovu:
@@ -1271,17 +1863,24 @@ decode_opcode ()
 
     case RXO_shar: /* d = ma >> mb */
       SHIFT_OP (sll, int, mb, >>=, 1);
+      E (1);
       break;
 
     case RXO_shll: /* d = ma << mb */
       SHIFT_OP (ll, int, mb, <<=, 0x80000000UL);
+      E (1);
       break;
 
     case RXO_shlr: /* d = ma >> mb */
       SHIFT_OP (ll, unsigned int, mb, >>=, 1);
+      E (1);
       break;
 
     case RXO_sstr:
+      RL (3);
+#ifdef CYCLE_ACCURATE
+      tx = regs.r[3];
+#endif
       switch (opcode.size)
 	{
 	case RX_Long:
@@ -1291,6 +1890,7 @@ decode_opcode ()
 	      regs.r[1] += 4;
 	      regs.r[3] --;
 	    }
+	  cycles (2 + tx);
 	  break;
 	case RX_Word:
 	  while (regs.r[3] != 0)
@@ -1299,6 +1899,7 @@ decode_opcode ()
 	      regs.r[1] += 2;
 	      regs.r[3] --;
 	    }
+	  cycles (2 + (int)(tx / 2) + tx % 2);
 	  break;
 	case RX_Byte:
 	  while (regs.r[3] != 0)
@@ -1307,6 +1908,7 @@ decode_opcode ()
 	      regs.r[1] ++;
 	      regs.r[3] --;
 	    }
+	  cycles (2 + (int)(tx / 4) + tx % 4);
 	  break;
 	default:
 	  abort ();
@@ -1316,6 +1918,7 @@ decode_opcode ()
     case RXO_stcc:
       if (GS2())
 	PD (GS ());
+      E1;
       break;
 
     case RXO_stop:
@@ -1328,8 +1931,15 @@ decode_opcode ()
       break;
 
     case RXO_suntil:
+      RL(3);
+#ifdef CYCLE_ACCURATE
+      tx = regs.r[3];
+#endif
       if (regs.r[3] == 0)
-	break;
+	{
+	  cycles (3);
+	  break;
+	}
       switch (opcode.size)
 	{
 	case RX_Long:
@@ -1342,6 +1952,7 @@ decode_opcode ()
 	      if (umb == uma)
 		break;
 	    }
+	  cycles (3 + 3 * tx);
 	  break;
 	case RX_Word:
 	  uma = get_reg (2) & 0xffff;
@@ -1353,6 +1964,7 @@ decode_opcode ()
 	      if (umb == uma)
 		break;
 	    }
+	  cycles (3 + 3 * (tx / 2) + 3 * (tx % 2));
 	  break;
 	case RX_Byte:
 	  uma = get_reg (2) & 0xff;
@@ -1364,6 +1976,7 @@ decode_opcode ()
 	      if (umb == uma)
 		break;
 	    }
+	  cycles (3 + 3 * (tx / 4) + 3 * (tx % 4));
 	  break;
 	default:
 	  abort();
@@ -1375,6 +1988,10 @@ decode_opcode ()
       break;
 
     case RXO_swhile:
+      RL(3);
+#ifdef CYCLE_ACCURATE
+      tx = regs.r[3];
+#endif
       if (regs.r[3] == 0)
 	break;
       switch (opcode.size)
@@ -1389,6 +2006,7 @@ decode_opcode ()
 	      if (umb != uma)
 		break;
 	    }
+	  cycles (3 + 3 * tx);
 	  break;
 	case RX_Word:
 	  uma = get_reg (2) & 0xffff;
@@ -1400,6 +2018,7 @@ decode_opcode ()
 	      if (umb != uma)
 		break;
 	    }
+	  cycles (3 + 3 * (tx / 2) + 3 * (tx % 2));
 	  break;
 	case RX_Byte:
 	  uma = get_reg (2) & 0xff;
@@ -1411,6 +2030,7 @@ decode_opcode ()
 	      if (umb != uma)
 		break;
 	    }
+	  cycles (3 + 3 * (tx / 4) + 3 * (tx % 4));
 	  break;
 	default:
 	  abort();
@@ -1427,9 +2047,18 @@ decode_opcode ()
       return RX_MAKE_STOPPED(0);
 
     case RXO_xchg:
+#ifdef CYCLE_ACCURATE
+      regs.m2m = 0;
+#endif
       v = GS (); /* This is the memory operand, if any.  */
       PS (GD ()); /* and this may change the address register.  */
       PD (v);
+      E2;
+#ifdef CYCLE_ACCURATE
+      /* all M cycles happen during xchg's cycles.  */
+      memory_dest = 0;
+      memory_source = 0;
+#endif
       break;
 
     case RXO_xor:
@@ -1440,5 +2069,122 @@ decode_opcode ()
       EXCEPTION (EX_UNDEFINED);
     }
 
+#ifdef CYCLE_ACCURATE
+  regs.m2m = 0;
+  if (memory_source)
+    regs.m2m |= M2M_SRC;
+  if (memory_dest)
+    regs.m2m |= M2M_DST;
+
+  regs.rt = new_rt;
+  new_rt = -1;
+#endif
+
+#ifdef CYCLE_STATS
+  if (prev_cycle_count == regs.cycle_count)
+    {
+      printf("Cycle count not updated! id %s\n", id_names[opcode.id]);
+      abort ();
+    }
+#endif
+
+#ifdef CYCLE_STATS
+  if (running_benchmark)
+    {
+      int omap = op_lookup (opcode.op[0].type, opcode.op[1].type, opcode.op[2].type);
+
+
+      cycles_per_id[opcode.id][omap] += regs.cycle_count - prev_cycle_count;
+      times_per_id[opcode.id][omap] ++;
+
+      times_per_pair[prev_opcode_id][po0][opcode.id][omap] ++;
+
+      prev_opcode_id = opcode.id;
+      po0 = omap;
+    }
+#endif
+
   return RX_MAKE_STEPPED ();
 }
+
+#ifdef CYCLE_STATS
+void
+reset_pipeline_stats (void)
+{
+  memset (cycles_per_id, 0, sizeof(cycles_per_id));
+  memset (times_per_id, 0, sizeof(times_per_id));
+  memory_stalls = 0;
+  register_stalls = 0;
+  branch_stalls = 0;
+  branch_alignment_stalls = 0;
+  fast_returns = 0;
+  memset (times_per_pair, 0, sizeof(times_per_pair));
+  running_benchmark = 1;
+
+  benchmark_start_cycle = regs.cycle_count;
+}
+
+void
+halt_pipeline_stats (void)
+{
+  running_benchmark = 0;
+  benchmark_end_cycle = regs.cycle_count;
+}
+#endif
+
+void
+pipeline_stats (void)
+{
+#ifdef CYCLE_STATS
+  int i, o1;
+  int p, p1;
+#endif
+
+#ifdef CYCLE_ACCURATE
+  if (verbose == 1)
+    {
+      printf ("cycles: %llu\n", regs.cycle_count);
+      return;
+    }
+
+  printf ("cycles: %13s\n", comma (regs.cycle_count));
+#endif
+
+#ifdef CYCLE_STATS
+  if (benchmark_start_cycle)
+    printf ("bmark:  %13s\n", comma (benchmark_end_cycle - benchmark_start_cycle));
+
+  printf("\n");
+  for (i = 0; i < N_RXO; i++)
+    for (o1 = 0; o1 < N_MAP; o1 ++)
+      if (times_per_id[i][o1])
+	printf("%13s %13s %7.2f  %s %s\n",
+	       comma (cycles_per_id[i][o1]),
+	       comma (times_per_id[i][o1]),
+	       (double)cycles_per_id[i][o1] / times_per_id[i][o1],
+	       op_cache_string(o1),
+	       id_names[i]+4);
+
+  printf("\n");
+  for (p = 0; p < N_RXO; p ++)
+    for (p1 = 0; p1 < N_MAP; p1 ++)
+      for (i = 0; i < N_RXO; i ++)
+	for (o1 = 0; o1 < N_MAP; o1 ++)
+	  if (times_per_pair[p][p1][i][o1])
+	    {
+	      printf("%13s   %s %-9s  ->  %s %s\n",
+		     comma (times_per_pair[p][p1][i][o1]),
+		     op_cache_string(p1),
+		     id_names[p]+4,
+		     op_cache_string(o1),
+		     id_names[i]+4);
+	    }
+
+  printf("\n");
+  printf("%13s memory stalls\n", comma (memory_stalls));
+  printf("%13s register stalls\n", comma (register_stalls));
+  printf("%13s branches taken (non-return)\n", comma (branch_stalls));
+  printf("%13s branch alignment stalls\n", comma (branch_alignment_stalls));
+  printf("%13s fast returns\n", comma (fast_returns));
+#endif
+}
Index: trace.c
===================================================================
RCS file: /cvs/src/src/sim/rx/trace.c,v
retrieving revision 1.2
diff -p -U3 -r1.2 trace.c
--- trace.c	1 Jan 2010 10:03:33 -0000	1.2
+++ trace.c	28 Jul 2010 02:00:19 -0000
@@ -19,6 +19,7 @@ You should have received a copy of the G
 along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
 
 
+#include "config.h"
 #include <stdio.h>
 #include <stdarg.h>
 #include <string.h>
@@ -321,7 +322,13 @@ sim_disasm_one (void)
     }
 
   opbuf[0] = 0;
-  printf ("\033[33m%06x: ", mypc);
+#ifdef CYCLE_ACCURATE
+  printf ("\033[33m %04u %06x: ", (int)(regs.cycle_count % 10000), mypc);
+#else
+  printf ("\033[33m %06x: ", mypc);
+
+#endif
+
   max = print_insn_rx (mypc, & info);
 
   for (i = 0; i < max; i++)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]