This is the mail archive of the
mailing list for the glibc project.
Re: [RFC][BZ #17943] Use long for int_fast8_t
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: "Maciej W. Rozycki" <macro at linux-mips dot org>
- Cc: OndÅej BÃlka <neleai at seznam dot cz>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Mon, 9 Feb 2015 05:41:26 -0800
- Subject: Re: [RFC][BZ #17943] Use long for int_fast8_t
- Authentication-results: sourceware.org; auth=none
- References: <20150208110426 dot GA28729 at domone> <alpine dot LFD dot 2 dot 11 dot 1502091225590 dot 22715 at eddie dot linux-mips dot org>
On Mon, Feb 9, 2015 at 5:36 AM, Maciej W. Rozycki <email@example.com> wrote:
> On Sun, 8 Feb 2015, OndÅej BÃlka wrote:
>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
>> It is definitely slower with division, following code is 25% slower on
>> haswell with char than when you use long.
> It may boil down to the choice of instructions produced made by the
> compiler. I can hardly imagine 8-bit division to be slower than 64-bit
> one on a processor that implements subword integer arithmetic.
>> There is question what about other architectures and atomic operations,
>> are byte ones better than int?
>> int main ()
>> int i;
>> char x = 32;
>> for (i=0; i<1000000000; i++)
>> x = 11 * x + 5 + x / 3;
>> return x;
> On Intel Atom for example division latencies are as follows:
> latency throughput
> IDIV r/m8 33 32
> IDIV r/m16 42 41
> IDIV r/m32 57 56
> IDIV r/m64 197 196
> I'd expect the ratio of elapsed times for the corresponding data widths
> and a manual division algorithm used with processors that have no hardware
> divider to be similar. There is no latency difference between individual
> data widths AFAICT for multiplication or general ALU operations.
> For processors that do have a hardware divider implementing word
> calculation only I'd expect either a constant latency or again, a decrease
> in operation time depending on the actual width of significant data
> contained in operands.
> For example the M14Kc MIPS32 processor has an RTL configuration option to
> include either an area-efficient or a high-performance MDU
> (multiply/divide unit). The area-efficient MDU has a latency of 33 clocks
> for unsigned division (signed division adds up to 2 clocks for sign
> reversal). The high-performance MDU reduces the latency as follows:
> "Divide operations are implemented with a simple 1-bit-per-clock iterative
> algorithm. An early-in detection checks the sign extension of the
> dividend (rs) operand. If rs is 8 bits wide, 23 iterations are skipped.
> For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs,
> 7 iterations are skipped. Any attempt to issue a subsequent MDU
> instruction while a divide is still active causes an IU pipeline stall
> until the divide operation has completed."
> As it happens automatically there is no benefit from using a narrower data
> type, and the lack of subword arithmetic operations means that using such
> a type will require a truncation operation from time to time for
> multiplication or general ALU operations.
>> --- a/sysdeps/generic/stdint.h
>> +++ b/sysdeps/generic/stdint.h
>> @@ -87,12 +87,13 @@ typedef unsigned long long int uint_least64_t;
>> /* Fast types. */
>> /* Signed. */
>> -typedef signed char int_fast8_t;
>> #if __WORDSIZE == 64
>> +typedef long int int_fast8_t;
>> typedef long int int_fast16_t;
>> typedef long int int_fast32_t;
>> typedef long int int_fast64_t;
>> +typedef int int_fast8_t;
>> typedef int int_fast16_t;
>> typedef int int_fast32_t;
> So I find the choice of types above to be already questionable for a
> generic header. By default I'd expect fast data types to have the same
> width as their fixed-width counterparts for the large benefit they provide
> with most architectures that do implement subword arithmetic weighed
> against the small loss they will likely incur with architectures that only
> implement word arithmetic. Then individual ports could override the
> defaults as they see fit.
> At this point the discussion is however I believe moot though -- there
> will have been relocatable objects out there with data embedded using
> these types already so the ABI has been set and I don't see a way of
> changing it without breaking binary compatibility.
>  "Intel 64 and IA-32 Architectures Optimization Reference Manual",
> Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2
> "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21
This manual is very old and Intel Atom Microarchitecture has been
replaced by Silvermont Microarchitecture.
>  "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS
> Technologies, Inc., Document Number: MD00672, November 2, 2009,
> Subsection "High-Performance MDU", p. 6