This is the mail archive of the
mailing list for the glibc project.
Re: IEEE128 binary float to decimal float conversion routines
- From: Steven Munroe <munroesj at linux dot vnet dot ibm dot com>
- To: Joseph Myers <joseph at codesourcery dot com>
- Cc: Steve Munroe <sjmunroe at us dot ibm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, Michael R Meissner <mrmeissn at us dot ibm dot com>, "Paul E. Murphy" <murphyp at linux dot vnet dot ibm dot com>, Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>
- Date: Tue, 08 Dec 2015 11:16:39 -0600
- Subject: Re: IEEE128 binary float to decimal float conversion routines
- Authentication-results: sourceware.org; auth=none
- References: <564A16D5 dot 3020105 at linux dot vnet dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511161803500 dot 30498 at digraph dot polyomino dot org dot uk> <564A6A90 dot 40607 at linux dot vnet dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511162356020 dot 32387 at digraph dot polyomino dot org dot uk> <201511180131 dot tAI1Vs2L023118 at d03av01 dot boulder dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511180144150 dot 2302 at digraph dot polyomino dot org dot uk> <201511182301 dot tAIN1Igc011083 at d03av02 dot boulder dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511182322260 dot 26547 at digraph dot polyomino dot org dot uk>
- Reply-to: munroesj at linux dot vnet dot ibm dot com
On Wed, 2015-11-18 at 23:53 +0000, Joseph Myers wrote:
> On Wed, 18 Nov 2015, Steve Munroe wrote:
> > > The problem I see is with the final "result = temp;" which converts
> > double
> > > to float.
> > >
> > > The earlier steps are probably accurate to within 1ulp. But if temp (a
> > > double) is half way between two representable float values - while the
> > > original argument is very close to that half way value, but not exact -
> > > then the final conversion will round to even, which may or may not be
> > > correct depending on which side of that double value the original
> > > _Decimal128 value was. (Much the same applies in other rounding modes
> > > when the double value equals a float value but the original value isn't
> > > exactly that float value.)
> > >
> > Would changing the the decimal to binary conversion to be round to odd,
> > offset the following round double to float?
> > http://www.exploringbinary.com/gcc-avoids-double-rounding-errors-with-round-to-odd/
> No, because it would just offload the problem onto getting a conversion
> from _Decimal128 to double that is correctly rounded to odd, which is no
> easier (indeed, requires more work, not less) than the original problem of
> converting to float.
Joseph I think we are talking pass each other on this.
The PowerISA (2.05 and later ) Decimal Floating-point "Round to Prepare
for Shorter Precision" mode would not address the Decimal128
convert/truncate to shorter binary floating-point (double or float).
But it will address the Float128 convert/truncate to to shorter decimal
floating-pointer (_Decimal64 and _Decimal32).
> The existing code loses some of the original precision when taking just 15
> digits of the mantissa for conversion to double (not OK when you want to
> determine the exact value rounded to odd after further operations - in the
> hard cases, the final decimal digit will affect the correct rounding).
> Then the multiplications / divisions by precomputed powers of 10 use a
> table of long double values - while that gives extra precision (though
> probably not enough extra precision), it's also incompatible with doing
> rounding to odd, since IBM long double doesn't give meaningful "inexact"
> exceptions or work in non-default rounding modes, while rounding to odd
> requires working in round-to-zero mode and then checking the "inexact"
So in the case of TIMode or KFmode conversion to _Decimal64/_Decimal32
we can save the current rounding mode (fe_dec_getround()) then use
fe_dec_setround (DEC_ROUND_05UP) to set the "Round to Prepare for
Shorter Precision" before the multiply that converts the mantissa to the
target radix. Then just before the the instruction that rounds to the
final (_Decimal64 or _Decimal32) type, we restore the callers rounding
more and execute the final version in the correct rounding mode.
I believe that addresses you double rounding concern for these
And now we are address the similar issues with float128 with the recent
public release of the PowerISA-3.0:
The PowerISA-3.0 document is here: http://ibm.biz/power-isa3
As an addition to the Vector Scalar extents we add support for IEEE
128-bit binary floating point. This architecture provides the option to
override the default round-mode and force "round to odd" on a
instruction basis for all the key operators. This obviously includes
add/subtract/multiple/divide/sqrt and convert convert quad to double.
This will be sufficient to support the _Decimal128 to double and float
conversion that you mentioned as problematic.
While we are waiting for the hardware implementation of the
PowerISA-3.0, we need to address the specifics of these conversions in
the soft-fp implementation.
My observation is that a common element of these conversion is a large
precision multiply (to convert the radix of the mantissa) then a
possible truncation (with rounding) to the final precision in the new
It seem a simple effort to provide a soft-fp implementation that
combines the multiple and truncation, without intermediate rounding.
As the soft-fp implementation performs rounding in the FP_PACK
operation, it simple to avoid the intermediate PACK/UNPACK steps between
the FP_MUL and FP_TRUNC operations. The only trick is that arithmetic
operations user the canonical PACK/UNPACK operation while the Truncate
operations use the SEMIRAW PACK/UNPACK. Some adjustments of the exponent
are requires to flow the (unrounded) result of the FP_MUL directly into
the FP_TRUNC operation, but this does not seem to require a large
effort. The final FP_PACK operation following the FP_TRUNC will perform
the final and correct rounding to the require precision.
This seems sufficient to address the issues you have raised and seems
much simpler then wholesale additions of round to odd to the soft-fp