This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC][BZ #17943] Use long for int_fast8_t


On Mon, Feb 9, 2015 at 5:36 AM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Sun, 8 Feb 2015, OndÅej BÃlka wrote:
>
>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
>>
>> It is definitely slower with division, following code is 25% slower on
>> haswell with char than when you use long.
>
>  It may boil down to the choice of instructions produced made by the
> compiler.  I can hardly imagine 8-bit division to be slower than 64-bit
> one on a processor that implements subword integer arithmetic.
>
>> There is question what about other architectures and atomic operations,
>> are byte ones better than int?
>>
>> int main ()
>> {
>>   int i;
>>   char x = 32;
>>   for (i=0; i<1000000000; i++)
>>     x = 11 * x + 5 + x / 3;
>>   return x;
>> }
>
>  On Intel Atom for example division latencies are as follows[1]:
>
>                 latency throughput
> IDIV r/m8          33       32
> IDIV r/m16         42       41
> IDIV r/m32         57       56
> IDIV r/m64        197      196
>
> I'd expect the ratio of elapsed times for the corresponding data widths
> and a manual division algorithm used with processors that have no hardware
> divider to be similar.  There is no latency difference between individual
> data widths AFAICT for multiplication or general ALU operations.
>
>  For processors that do have a hardware divider implementing word
> calculation only I'd expect either a constant latency or again, a decrease
> in operation time depending on the actual width of significant data
> contained in operands.
>
>  For example the M14Kc MIPS32 processor has an RTL configuration option to
> include either an area-efficient or a high-performance MDU
> (multiply/divide unit).  The area-efficient MDU has a latency of 33 clocks
> for unsigned division (signed division adds up to 2 clocks for sign
> reversal).  The high-performance MDU reduces the latency as follows[2]:
>
> "Divide operations are implemented with a simple 1-bit-per-clock iterative
> algorithm.  An early-in detection checks the sign extension of the
> dividend (rs) operand.  If rs is 8 bits wide, 23 iterations are skipped.
> For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs,
> 7 iterations are skipped.  Any attempt to issue a subsequent MDU
> instruction while a divide is still active causes an IU pipeline stall
> until the divide operation has completed."
>
> As it happens automatically there is no benefit from using a narrower data
> type, and the lack of subword arithmetic operations means that using such
> a type will require a truncation operation from time to time for
> multiplication or general ALU operations.
>
>> --- a/sysdeps/generic/stdint.h
>> +++ b/sysdeps/generic/stdint.h
>> @@ -87,12 +87,13 @@ typedef unsigned long long int    uint_least64_t;
>>  /* Fast types.  */
>>
>>  /* Signed.  */
>> -typedef signed char          int_fast8_t;
>>  #if __WORDSIZE == 64
>> +typedef long int             int_fast8_t;
>>  typedef long int             int_fast16_t;
>>  typedef long int             int_fast32_t;
>>  typedef long int             int_fast64_t;
>>  #else
>> +typedef int                  int_fast8_t;
>>  typedef int                  int_fast16_t;
>>  typedef int                  int_fast32_t;
>>  __extension__
>
>  So I find the choice of types above to be already questionable for a
> generic header.  By default I'd expect fast data types to have the same
> width as their fixed-width counterparts for the large benefit they provide
> with most architectures that do implement subword arithmetic weighed
> against the small loss they will likely incur with architectures that only
> implement word arithmetic.  Then individual ports could override the
> defaults as they see fit.
>
>  At this point the discussion is however I believe moot though -- there
> will have been relocatable objects out there with data embedded using
> these types already so the ABI has been set and I don't see a way of
> changing it without breaking binary compatibility.
>
>  References:
>
> [1] "Intel 64 and IA-32 Architectures Optimization Reference Manual",
>     Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2
>     "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21

This manual is very old and Intel Atom Microarchitecture has been
replaced by Silvermont Microarchitecture.

> [2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS
>     Technologies, Inc., Document Number: MD00672, November 2, 2009,
>     Subsection "High-Performance MDU", p. 6
>
>   Maciej



-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]