This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC][BZ #17943] Use long for int_fast8_t
- From: "Maciej W. Rozycki" <macro at linux-mips dot org>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org
- Date: Mon, 9 Feb 2015 13:36:48 +0000 (GMT)
- Subject: Re: [RFC][BZ #17943] Use long for int_fast8_t
- Authentication-results: sourceware.org; auth=none
- References: <20150208110426 dot GA28729 at domone>
On Sun, 8 Feb 2015, Ondřej Bílka wrote:
> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
>
> It is definitely slower with division, following code is 25% slower on
> haswell with char than when you use long.
It may boil down to the choice of instructions produced made by the
compiler. I can hardly imagine 8-bit division to be slower than 64-bit
one on a processor that implements subword integer arithmetic.
> There is question what about other architectures and atomic operations,
> are byte ones better than int?
>
> int main ()
> {
> int i;
> char x = 32;
> for (i=0; i<1000000000; i++)
> x = 11 * x + 5 + x / 3;
> return x;
> }
On Intel Atom for example division latencies are as follows[1]:
latency throughput
IDIV r/m8 33 32
IDIV r/m16 42 41
IDIV r/m32 57 56
IDIV r/m64 197 196
I'd expect the ratio of elapsed times for the corresponding data widths
and a manual division algorithm used with processors that have no hardware
divider to be similar. There is no latency difference between individual
data widths AFAICT for multiplication or general ALU operations.
For processors that do have a hardware divider implementing word
calculation only I'd expect either a constant latency or again, a decrease
in operation time depending on the actual width of significant data
contained in operands.
For example the M14Kc MIPS32 processor has an RTL configuration option to
include either an area-efficient or a high-performance MDU
(multiply/divide unit). The area-efficient MDU has a latency of 33 clocks
for unsigned division (signed division adds up to 2 clocks for sign
reversal). The high-performance MDU reduces the latency as follows[2]:
"Divide operations are implemented with a simple 1-bit-per-clock iterative
algorithm. An early-in detection checks the sign extension of the
dividend (rs) operand. If rs is 8 bits wide, 23 iterations are skipped.
For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs,
7 iterations are skipped. Any attempt to issue a subsequent MDU
instruction while a divide is still active causes an IU pipeline stall
until the divide operation has completed."
As it happens automatically there is no benefit from using a narrower data
type, and the lack of subword arithmetic operations means that using such
a type will require a truncation operation from time to time for
multiplication or general ALU operations.
> --- a/sysdeps/generic/stdint.h
> +++ b/sysdeps/generic/stdint.h
> @@ -87,12 +87,13 @@ typedef unsigned long long int uint_least64_t;
> /* Fast types. */
>
> /* Signed. */
> -typedef signed char int_fast8_t;
> #if __WORDSIZE == 64
> +typedef long int int_fast8_t;
> typedef long int int_fast16_t;
> typedef long int int_fast32_t;
> typedef long int int_fast64_t;
> #else
> +typedef int int_fast8_t;
> typedef int int_fast16_t;
> typedef int int_fast32_t;
> __extension__
So I find the choice of types above to be already questionable for a
generic header. By default I'd expect fast data types to have the same
width as their fixed-width counterparts for the large benefit they provide
with most architectures that do implement subword arithmetic weighed
against the small loss they will likely incur with architectures that only
implement word arithmetic. Then individual ports could override the
defaults as they see fit.
At this point the discussion is however I believe moot though -- there
will have been relocatable objects out there with data embedded using
these types already so the ABI has been set and I don't see a way of
changing it without breaking binary compatibility.
References:
[1] "Intel 64 and IA-32 Architectures Optimization Reference Manual",
Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2
"Intel Atom Microarchitecture Instructions Latency Data", p. 12-21
[2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS
Technologies, Inc., Document Number: MD00672, November 2, 2009,
Subsection "High-Performance MDU", p. 6
Maciej