This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: IEEE128 binary float to decimal float conversion routines


On Wed, 18 Nov 2015, Steve Munroe wrote:

> I only see one rounds associated with the BINPOWOF10[sexp] multiply/divide.
> 
> The mant = a_norm * 1E+15DL operation is a scaling in decimal and should be
> exact.
> 
> The temp = mant operation is a decimal to long conversion which will cause
> a truncation to 15 digits.
> 
> So my analysis is this code does not double round. Do you think the
> truncation is an issue.

The problem I see is with the final "result = temp;" which converts double 
to float.

The earlier steps are probably accurate to within 1ulp.  But if temp (a 
double) is half way between two representable float values - while the 
original argument is very close to that half way value, but not exact - 
then the final conversion will round to even, which may or may not be 
correct depending on which side of that double value the original 
_Decimal128 value was.  (Much the same applies in other rounding modes 
when the double value equals a float value but the original value isn't 
exactly that float value.)

I haven't done the detailed analysis with continued fractions to determine 
the worst cases for conversion of _Decimal128 to float (it is, however, 
clearly possible to determine the worst cases like that with only a small 
amount of computation needed, unlike the large exhaustive searches needed 
for worst cases for correctly rounded transcendental functions).  Nor have 
I read the paper Christoph helpfully pointed out.  But heuristically, if 
you have a 128-bit input, you can expect there to be some input values for 
which, on converting to binary, the initial 24 bits are followed by (1 
then about 127 0s, then other nonzero bits, or likewise with 0 followed by 
about 127 1s), just by random chance, and so you expect to need about 24 + 
128 bits internal precision for the conversion so as to get a result that 
rounds correctly when truncated to float.

(Actually you expect a few bits less than that to be needed because almost 
all the exponent range of _Decimal128 is outside the range of float.  But 
that doesn't change the basic analysis, that neither double, long double 
nor __float128 is expected to have enough precision as an intermediate 
type for correctly rounded results.  Cf. the BID code in libgcc using at 
least 256-bit precision for internal computations.)

-- 
Joseph S. Myers
joseph@codesourcery.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]