This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC][BZ #17943] Use long for int_fast8_t

From: "Maciej W. Rozycki" <macro at linux-mips dot org>
To: Ondřej Bílka <neleai at seznam dot cz>
Cc: libc-alpha at sourceware dot org
Date: Mon, 9 Feb 2015 13:36:48 +0000 (GMT)
Subject: Re: [RFC][BZ #17943] Use long for int_fast8_t
Authentication-results: sourceware.org; auth=none
References: <20150208110426 dot GA28729 at domone>

On Sun, 8 Feb 2015, Ondřej Bílka wrote:

> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
> 
> It is definitely slower with division, following code is 25% slower on
> haswell with char than when you use long.

 It may boil down to the choice of instructions produced made by the 
compiler.  I can hardly imagine 8-bit division to be slower than 64-bit 
one on a processor that implements subword integer arithmetic.

> There is question what about other architectures and atomic operations,
> are byte ones better than int?
> 
> int main ()
> {
>   int i;
>   char x = 32;
>   for (i=0; i<1000000000; i++)
>     x = 11 * x + 5 + x / 3;
>   return x;
> }

 On Intel Atom for example division latencies are as follows[1]:

		latency	throughput
IDIV r/m8	   33	    32
IDIV r/m16	   42	    41
IDIV r/m32	   57	    56
IDIV r/m64	  197	   196

I'd expect the ratio of elapsed times for the corresponding data widths 
and a manual division algorithm used with processors that have no hardware 
divider to be similar.  There is no latency difference between individual 
data widths AFAICT for multiplication or general ALU operations.

 For processors that do have a hardware divider implementing word 
calculation only I'd expect either a constant latency or again, a decrease 
in operation time depending on the actual width of significant data 
contained in operands.

 For example the M14Kc MIPS32 processor has an RTL configuration option to 
include either an area-efficient or a high-performance MDU 
(multiply/divide unit).  The area-efficient MDU has a latency of 33 clocks 
for unsigned division (signed division adds up to 2 clocks for sign 
reversal).  The high-performance MDU reduces the latency as follows[2]:

"Divide operations are implemented with a simple 1-bit-per-clock iterative 
algorithm.  An early-in detection checks the sign extension of the 
dividend (rs) operand.  If rs is 8 bits wide, 23 iterations are skipped. 
For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs, 
7 iterations are skipped.  Any attempt to issue a subsequent MDU 
instruction while a divide is still active causes an IU pipeline stall 
until the divide operation has completed."

As it happens automatically there is no benefit from using a narrower data 
type, and the lack of subword arithmetic operations means that using such 
a type will require a truncation operation from time to time for 
multiplication or general ALU operations.

> --- a/sysdeps/generic/stdint.h
> +++ b/sysdeps/generic/stdint.h
> @@ -87,12 +87,13 @@ typedef unsigned long long int	uint_least64_t;
>  /* Fast types.  */
>  
>  /* Signed.  */
> -typedef signed char		int_fast8_t;
>  #if __WORDSIZE == 64
> +typedef long int		int_fast8_t;
>  typedef long int		int_fast16_t;
>  typedef long int		int_fast32_t;
>  typedef long int		int_fast64_t;
>  #else
> +typedef int			int_fast8_t;
>  typedef int			int_fast16_t;
>  typedef int			int_fast32_t;
>  __extension__

 So I find the choice of types above to be already questionable for a 
generic header.  By default I'd expect fast data types to have the same 
width as their fixed-width counterparts for the large benefit they provide 
with most architectures that do implement subword arithmetic weighed 
against the small loss they will likely incur with architectures that only 
implement word arithmetic.  Then individual ports could override the 
defaults as they see fit.

 At this point the discussion is however I believe moot though -- there 
will have been relocatable objects out there with data embedded using 
these types already so the ABI has been set and I don't see a way of 
changing it without breaking binary compatibility.

 References:

[1] "Intel 64 and IA-32 Architectures Optimization Reference Manual", 
    Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2 
    "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21

[2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS 
    Technologies, Inc., Document Number: MD00672, November 2, 2009, 
    Subsection "High-Performance MDU", p. 6

  Maciej

Follow-Ups:
- Re: [RFC][BZ #17943] Use long for int_fast8_t
  - From: H.J. Lu
- Re: [RFC][BZ #17943] Use long for int_fast8_t
  - From: Richard Earnshaw

References:
- [RFC][BZ #17943] Use long for int_fast8_t
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]