This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v2 1/1] aarch64: Add optimized version of sinf and cosf

From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
To: Ashwin Sekhar T K <ashwin dot sekhar at caviumnetworks dot com>, libc-alpha at sourceware dot org
Cc: nd at arm dot com
Date: Thu, 06 Jul 2017 13:30:39 +0100
Subject: Re: [PATCH v2 1/1] aarch64: Add optimized version of sinf and cosf
Authentication-results: sourceware.org; auth=none
Authentication-results: arm.com; dkim=none (message not signed) header.d=none;arm.com; dmarc=none action=none header.from=arm.com;
Nodisclaimer: True
References: <20170613071707.43396-1-ashwin.sekhar@caviumnetworks.com> <20170706095516.34985-1-ashwin.sekhar@caviumnetworks.com> <20170706095516.34985-2-ashwin.sekhar@caviumnetworks.com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

On 06/07/17 10:55, Ashwin Sekhar T K wrote:
> The algorithm is based on the sinf and cosf implementations
> for powerpc in ./powerpc/powerpc64/power8/fpu and that for
> x86 in ./x86_64/fpu.
> 
> 	* sysdeps/aarch64/fpu/chebyshev.h: New file.
> 	* sysdeps/aarch64/fpu/reduce.h: Likewise.
> 	* sysdeps/aarch64/fpu/s_cosf.c: Likewise.
> 	* sysdeps/aarch64/fpu/s_sinf.c: Likewise.

better than asm, but i think this should be the generic
implementation if all targets want it.

> +/* sin (x) = x+x^3*(S0+x^2*(S1+x^2*(S2+x^2*(S3+x^2*S4)))),
> +   where 2^-5<=|x|<Pi/4.  */
> +static inline double
> +chebyshev_sin_polynomial_1 (double x)
> +{
> +	double S0, S1, S2, S3, S4, y;
> +
> +	S0 = -1.66666666666265311791406134034332e-01;
> +	S1 = 8.33333332439092043519845987020744e-03;
> +	S2 = -1.98412633515623692105969699817081e-04;
> +	S3 = 2.75552591873811586009688362475245e-06;
> +	S4 = -2.47545996176987174320511533257699e-08;
> +	y = x * x;
> +	return  x * (1.0 + y * (S0 + y * (S1 + y * (S2 + y * (S3 + y * S4)))));
> +}

i think one less coefficient should be enough
(may be optimized for max norm instead of chebyshev)
and modern cores tend to have more than one fp
pipelines (most targets, not just aarch64) so
evaluating the polynomial differently would be
faster (more multiplications but less latency
since they can be done in parallel).

> +/* 4/Pi broken into sum of positive DP values.  */
> +static double INVPIO4_TABLE[25] = {
> +	0.00000000000000000000000000000000e+00,
> +	1.27323953807353973388671875000000e+00,
> +	6.66162294771233121082332218065858e-09,
> +	4.55202009562027200395410188316081e-18,
> +	5.05365203780056430379174094616698e-26,
> +	3.33657353140390501092355175630980e-34,
> +	2.31774265657771014271759915567725e-43,
> +	1.41079183488085906188916890478151e-51,
> +	1.78201357714620429447917285584916e-59,
> +	6.45440934111020426845713721490652e-68,
> +	2.96289605657163538186678664280422e-77,
> +	2.34290278673081796193027756956770e-85,
> +	6.89165747744598758512920848666612e-94,
> +	2.61827895738527799147711877424548e-102,
> +	5.22516501694879285329083609946870e-111,
> +	2.31723558129677581012126324525784e-119,
> +	5.36762980505877895920512518100769e-128,
> +	2.49914273179058265651805723374739e-135,
> +	3.32028340088181692034775714274009e-144,
> +	1.98261407071607720791943444409774e-152,
> +	1.34423330943468258263688817309715e-162,
> +	2.61360711509865349546753597258351e-169,
> +	2.26640747502086603170125306310953e-178,
> +	2.39096791372273354372455558796085e-186,
> +	2.14336400443697767564443045577643e-194
> +};
> +
> +/* Range Reduction for Pi/4<=|x|<Inf
> +
> +   Returns n and y, where
> +    y = |x| - j * Pi/4,
> +    j = (k + 1) & 0xfffffffe,
> +    n = (k + 1) & 0x7,
> +    k = trunc (|x| / (Pi / 4)).
> +
> +   In other words,
> +    n is the pi/4 octant that x falls into in a trignometric plane.
> +    y is the difference between x and the multiple of pi/2 closest to x.  */
> +

i guess this is fine, but it is possible to do arg reduction
in a simpler way with mostly int arithmetics and then less
data is needed (24bytes instead of 200).

> +float
> +__cosf (float x)
> +{
> +	uint32_t ix;
> +
> +	GET_FLOAT_WORD (ix, x);
> +	ix &= I_SPABS_MASK;
> +
> +	if (ix >= I_SPABS_PIO4)
> +		goto large_args;
> +
> +	if (ix < I_SPABS_2POWN5)
> +		goto small_args;
> +
> +	/* Here if 2^-5<=|x|<Pi/4.  */
> +	return chebyshev_cos_polynomial_1 (x);
> +
> +large_args:
> +	if (ix < I_SPABS_INF)
> +	{
> +		/* Here if Pi/4<=|x|<Inf.  */
> +		double t, sign, val;
> +		int n;
> +
> +		n = reducef (x, &t);
> +
> +		if (((n + 2) >> 2) & 0x1)
> +			sign = -1.0;
> +		else
> +			sign = 1.0;
> +
> +		if ((n + 2) & 0x2)
> +			val = chebyshev_cos_polynomial_1 (t);
> +		else
> +			val = chebyshev_sin_polynomial_1 (t);
> +
> +		return sign * val;
> +	}
> +	/* Here if x is Inf or Nan.  */
> +	if (ix == I_SPABS_INF)
> +		__set_errno (EDOM);
> +	return x - x;

i'd do the failure handling separately (tailcall)
so __sinf and __cosf are leaf functions.

References:
- [PATCH v2 0/1] aarch64: Add optimized version of sinf and cosf
  - From: Ashwin Sekhar T K
- [PATCH v2 1/1] aarch64: Add optimized version of sinf and cosf
  - From: Ashwin Sekhar T K

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]