This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: IEEE128 binary float to decimal float conversion routines
- From: Steven Munroe <munroesj at linux dot vnet dot ibm dot com>
- To: Joseph Myers <joseph at codesourcery dot com>
- Cc: Steve Munroe <sjmunroe at us dot ibm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, Michael R Meissner <mrmeissn at us dot ibm dot com>, "Paul E. Murphy" <murphyp at linux dot vnet dot ibm dot com>, Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>
- Date: Tue, 15 Dec 2015 15:18:46 -0600
- Subject: Re: IEEE128 binary float to decimal float conversion routines
- Authentication-results: sourceware.org; auth=none
- References: <564A16D5 dot 3020105 at linux dot vnet dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511161803500 dot 30498 at digraph dot polyomino dot org dot uk> <564A6A90 dot 40607 at linux dot vnet dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511162356020 dot 32387 at digraph dot polyomino dot org dot uk> <201511180131 dot tAI1Vs2L023118 at d03av01 dot boulder dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511180144150 dot 2302 at digraph dot polyomino dot org dot uk> <201511182301 dot tAIN1Igc011083 at d03av02 dot boulder dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511182322260 dot 26547 at digraph dot polyomino dot org dot uk> <1449594999 dot 9274 dot 45 dot camel at oc7878010663> <alpine dot DEB dot 2 dot 10 dot 1512081737230 dot 19569 at digraph dot polyomino dot org dot uk>
- Reply-to: munroesj at linux dot vnet dot ibm dot com
On Tue, 2015-12-08 at 18:25 +0000, Joseph Myers wrote:
> On Tue, 8 Dec 2015, Steven Munroe wrote:
>
> > The PowerISA (2.05 and later ) Decimal Floating-point "Round to Prepare
> > for Shorter Precision" mode would not address the Decimal128
> > convert/truncate to shorter binary floating-point (double or float).
> >
> > But it will address the Float128 convert/truncate to to shorter decimal
> > floating-pointer (_Decimal64 and _Decimal32).
>
> Yes, if you have a conversion from _Float128 to _Decimal128 that works for
> Round to Prepare for Shorter Precision then you could use that as an
> intermediate step in converting to _Decimal64 and _Decimal32 (it's not the
> most efficient approach, but it's certainly simpler than having multiple
> variants of the full conversion code).
>
> The hardest part is converting from _Float128 to _Decimal128. Once you
> can do that (for all rounding modes and with correct exceptions),
> converting to the narrower types is easy, whether you have multiple
> variants of the same code or use Round to Prepare for Shorter Precision.
> Likewise for conversions in the other direction - _Decimal128 to _Float128
> is the hardest part, if you can do that then converting to narrower types
> is straightforward.
>
> > So in the case of TIMode or KFmode conversion to _Decimal64/_Decimal32
> > we can save the current rounding mode (fe_dec_getround()) then use
> > fe_dec_setround (DEC_ROUND_05UP) to set the "Round to Prepare for
> > Shorter Precision" before the multiply that converts the mantissa to the
> > target radix. Then just before the the instruction that rounds to the
> > final (_Decimal64 or _Decimal32) type, we restore the callers rounding
> > more and execute the final version in the correct rounding mode.
> >
> > I believe that addresses you double rounding concern for these
> > conversions.
>
> For TImode it's not hard to avoid double rounding this way, by splitting
> the TImode number into two numbers that are exactly convertible to
> _Decimal128, so the only inexact operation is a single addition, which can
> be done in the Round to Prepare for Shorter Precision mode (and then you
> can convert to _Decimal64 / _Decimal32 in the original mode). [In all
> cases, getting the preferred quantum for decimal results is a minor matter
> to deal with at the end.]
>
> For _Float128, this only reduces the problem to doing a conversion of
> _Float128 to _Decimal128 in that mode. Which is not simply a single
> multiply. Not all mantissa values for _Float128 can be represented in
> _Decimal128 (2**113 > 10**34). And nor can all powers of 2 that you need
> to multiply / divide by be represented in _Decimal128. And when you have
> more than one inexact operation, the final result is generally not
> correctly rounded for any rounding mode. And so the complexity goes
> massively up (compare the fmaf implementation with round-to-odd on double
> - a single inexact addition on double done in round-to-odd followed by
> converting back to float in the original rounding mode - with the
> sysdeps/ieee754/dbl-64/s_fma.c code, which also uses round-to-odd, but
> with far more complexity in order to achieve the precision extension
> required for intermediate computations).
>
> You may well be able to use precision-extension techniques - so doing a
> conversion that produces a sum of two or three _Decimal128 values (the
> exact number needed being determined by a continued fraction analysis) and
> then adding up those values in the Round to Prepare for Shorter Precision
> mode. But I'd be surprised if there is a simple and correct
> implementation of the conversion that doesn't involve extending
> intermediate precision to have about 128 extra bits, given the complexity
> and extra precision described in the papers on this subject such as the
> one referenced in this thread.
>
> > My observation is that a common element of these conversion is a large
> > precision multiply (to convert the radix of the mantissa) then a
> > possible truncation (with rounding) to the final precision in the new
> > radix.
>
> Where large precision means about 256 bits (not simply 128 * 128 -> 256
> multiplication, but also having the powers of 2 or 10 to that precision,
> so more like 128 * 256 -> 384 which can be truncated to about 256).
> Again, exact precisions to be determined by continued fraction analysis.
>
Ok let my try with the simpler case of _Decimal128 to Float128 where the
significand conversion is exact (log2(10e34) -> 112.9 -> <= 113 bits).
So you mention "continued fraction analysis" which was not part of my
formal education (40+ years ago) but I will try.
The question is how many significant bits does it take to represent a
power of 10? This is interesting because my implementation of trunctfkf
involves a multiple of converted (to float128) mantissa by 10eN where N
is the exponent of the original _Decimal128. So what powers of 10 can be
represented exactly as a float128?
The requires significant bits should be log2(10eN), but as the binary of
an exact power of 10 generate trailing zero bit for each N (1000 has 3
trailing zeros, 10000000 has 6, ...)
So the number significant bits are log2(10eN)-N. A quick binary search
of shows that values up to 10e48 require less than 113-bits and so can
be represented exactly in _float128.
So any _Decimal128 < 9999999999999999999999999999999999e48 (1.0e82) can
be converted with one _Float128 multiply, of 2 exact values, giving a
rounded result to 1ULP.
This does not require conversion to string and back or carrying more
precision then naturally available in the _float128.
Now as the exponent of _Decimal128 input exceeds 48 the table of
_float128 powers of 10 will contain values that have been rounded. Now I
assume that some additional exponent range can be covered by by insuring
that the table _float128 powers_of_10 have been pre-rounded to odd?
Do you agree with this analysis?