This is the mail archive of the
mailing list for the glibc project.
Re: [RFC][BZ #17943] Use long for int_fast8_t
- From: Richard Earnshaw <Richard dot Earnshaw at foss dot arm dot com>
- To: "Maciej W. Rozycki" <macro at linux-mips dot org>, OndÅej BÃlka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org
- Date: Mon, 09 Feb 2015 18:28:33 +0000
- Subject: Re: [RFC][BZ #17943] Use long for int_fast8_t
- Authentication-results: sourceware.org; auth=none
- References: <20150208110426 dot GA28729 at domone> <alpine dot LFD dot 2 dot 11 dot 1502091225590 dot 22715 at eddie dot linux-mips dot org>
On 09/02/15 13:36, Maciej W. Rozycki wrote:
> On Sun, 8 Feb 2015, OndÅej BÃlka wrote:
>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
>> It is definitely slower with division, following code is 25% slower on
>> haswell with char than when you use long.
> It may boil down to the choice of instructions produced made by the
> compiler. I can hardly imagine 8-bit division to be slower than 64-bit
> one on a processor that implements subword integer arithmetic.
>> There is question what about other architectures and atomic operations,
>> are byte ones better than int?
>> int main ()
>> int i;
>> char x = 32;
>> for (i=0; i<1000000000; i++)
>> x = 11 * x + 5 + x / 3;
>> return x;
> On Intel Atom for example division latencies are as follows:
> latency throughput
> IDIV r/m8 33 32
> IDIV r/m16 42 41
> IDIV r/m32 57 56
> IDIV r/m64 197 196
> I'd expect the ratio of elapsed times for the corresponding data widths
> and a manual division algorithm used with processors that have no hardware
> divider to be similar. There is no latency difference between individual
> data widths AFAICT for multiplication or general ALU operations.
> For processors that do have a hardware divider implementing word
> calculation only I'd expect either a constant latency or again, a decrease
> in operation time depending on the actual width of significant data
> contained in operands.
> For example the M14Kc MIPS32 processor has an RTL configuration option to
> include either an area-efficient or a high-performance MDU
> (multiply/divide unit). The area-efficient MDU has a latency of 33 clocks
> for unsigned division (signed division adds up to 2 clocks for sign
> reversal). The high-performance MDU reduces the latency as follows:
> "Divide operations are implemented with a simple 1-bit-per-clock iterative
> algorithm. An early-in detection checks the sign extension of the
> dividend (rs) operand. If rs is 8 bits wide, 23 iterations are skipped.
> For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs,
> 7 iterations are skipped. Any attempt to issue a subsequent MDU
> instruction while a divide is still active causes an IU pipeline stall
> until the divide operation has completed."
> As it happens automatically there is no benefit from using a narrower data
> type, and the lack of subword arithmetic operations means that using such
> a type will require a truncation operation from time to time for
> multiplication or general ALU operations.
>> --- a/sysdeps/generic/stdint.h
>> +++ b/sysdeps/generic/stdint.h
>> @@ -87,12 +87,13 @@ typedef unsigned long long int uint_least64_t;
>> /* Fast types. */
>> /* Signed. */
>> -typedef signed char int_fast8_t;
>> #if __WORDSIZE == 64
>> +typedef long int int_fast8_t;
>> typedef long int int_fast16_t;
>> typedef long int int_fast32_t;
>> typedef long int int_fast64_t;
>> +typedef int int_fast8_t;
>> typedef int int_fast16_t;
>> typedef int int_fast32_t;
On AArch64 there's nothing to be gained in terms of performance from
using a 64-bit type over a 32-bit type when both can hold the required
range of values. In fact, it's likely to make things slower, since
multiply and divide operations will most likely take longer.
So on AArch64 int_fast8_t, int_fast16_t and int_fast32_t should all map
to int, not long.
> So I find the choice of types above to be already questionable for a
> generic header. By default I'd expect fast data types to have the same
> width as their fixed-width counterparts for the large benefit they provide
> with most architectures that do implement subword arithmetic weighed
> against the small loss they will likely incur with architectures that only
> implement word arithmetic. Then individual ports could override the
> defaults as they see fit.
> At this point the discussion is however I believe moot though -- there
> will have been relocatable objects out there with data embedded using
> these types already so the ABI has been set and I don't see a way of
> changing it without breaking binary compatibility.
>  "Intel 64 and IA-32 Architectures Optimization Reference Manual",
> Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2
> "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21
>  "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS
> Technologies, Inc., Document Number: MD00672, November 2, 2009,
> Subsection "High-Performance MDU", p. 6