This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: IEEE128 binary float to decimal float conversion routines
- From: Joseph Myers <joseph at codesourcery dot com>
- To: Steve Munroe <sjmunroe at us dot ibm dot com>
- Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, Michael R Meissner <mrmeissn at us dot ibm dot com>, "Paul E. Murphy" <murphyp at linux dot vnet dot ibm dot com>, Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>
- Date: Wed, 18 Nov 2015 02:03:49 +0000
- Subject: Re: IEEE128 binary float to decimal float conversion routines
- Authentication-results: sourceware.org; auth=none
- References: <564A16D5 dot 3020105 at linux dot vnet dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511161803500 dot 30498 at digraph dot polyomino dot org dot uk> <564A6A90 dot 40607 at linux dot vnet dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511162356020 dot 32387 at digraph dot polyomino dot org dot uk> <201511180131 dot tAI1Vs2L023118 at d03av01 dot boulder dot ibm dot com>
On Wed, 18 Nov 2015, Steve Munroe wrote:
> I only see one rounds associated with the BINPOWOF10[sexp] multiply/divide.
>
> The mant = a_norm * 1E+15DL operation is a scaling in decimal and should be
> exact.
>
> The temp = mant operation is a decimal to long conversion which will cause
> a truncation to 15 digits.
>
> So my analysis is this code does not double round. Do you think the
> truncation is an issue.
The problem I see is with the final "result = temp;" which converts double
to float.
The earlier steps are probably accurate to within 1ulp. But if temp (a
double) is half way between two representable float values - while the
original argument is very close to that half way value, but not exact -
then the final conversion will round to even, which may or may not be
correct depending on which side of that double value the original
_Decimal128 value was. (Much the same applies in other rounding modes
when the double value equals a float value but the original value isn't
exactly that float value.)
I haven't done the detailed analysis with continued fractions to determine
the worst cases for conversion of _Decimal128 to float (it is, however,
clearly possible to determine the worst cases like that with only a small
amount of computation needed, unlike the large exhaustive searches needed
for worst cases for correctly rounded transcendental functions). Nor have
I read the paper Christoph helpfully pointed out. But heuristically, if
you have a 128-bit input, you can expect there to be some input values for
which, on converting to binary, the initial 24 bits are followed by (1
then about 127 0s, then other nonzero bits, or likewise with 0 followed by
about 127 1s), just by random chance, and so you expect to need about 24 +
128 bits internal precision for the conversion so as to get a result that
rounds correctly when truncated to float.
(Actually you expect a few bits less than that to be needed because almost
all the exponent range of _Decimal128 is outside the range of float. But
that doesn't change the basic analysis, that neither double, long double
nor __float128 is expected to have enough precision as an intermediate
type for correctly rounded results. Cf. the BID code in libgcc using at
least 256-bit precision for internal computations.)
--
Joseph S. Myers
joseph@codesourcery.com