[PATCH 1/4] aarch64: Add vector implementations of cos routines

Tue Jun 13 19:56:58 GMT 2023

On 08/06/23 10:39, Joe Ramsay via Libc-alpha wrote:
> Replace the loop-over-scalar placeholder routines with optimised
> implementations from Arm Optimized Routines (AOR).
> 
> Also add some headers containing utilities for aarch64 libmvec
> routines, and update libm-test-ulps.
> 
> AOR exposes a config option, WANT_SIMD_EXCEPT, to enable
> selective masking (and later fixing up) of invalid lanes, in
> order to trigger fp exceptions correctly (AdvSIMD only). This is
> tested and maintained in AOR, however it is configured off at
> source level here for performance reasons. We keep the
> WANT_SIMD_EXCEPT blocks in routine sources to greatly simplify
> the upstreaming process from AOR to glibc.
> ---
>  sysdeps/aarch64/fpu/cos_advsimd.c             |  81 ++++++-
>  sysdeps/aarch64/fpu/cos_sve.c                 |  73 ++++++-
>  sysdeps/aarch64/fpu/cosf_advsimd.c            |  76 ++++++-
>  sysdeps/aarch64/fpu/cosf_sve.c                |  70 ++++++-
>  sysdeps/aarch64/fpu/sv_math.h                 | 141 +++++++++++++
>  sysdeps/aarch64/fpu/sve_utils.h               |  55 -----
>  sysdeps/aarch64/fpu/v_math.h                  | 197 ++++++++++++++++++
>  .../fpu/{advsimd_utils.h => vecmath_config.h} |  30 ++-
>  sysdeps/aarch64/libm-test-ulps                |   2 +-
>  9 files changed, 629 insertions(+), 96 deletions(-)
>  create mode 100644 sysdeps/aarch64/fpu/sv_math.h
>  delete mode 100644 sysdeps/aarch64/fpu/sve_utils.h
>  create mode 100644 sysdeps/aarch64/fpu/v_math.h
>  rename sysdeps/aarch64/fpu/{advsimd_utils.h => vecmath_config.h} (57%)
> 
> diff --git a/sysdeps/aarch64/fpu/cos_advsimd.c b/sysdeps/aarch64/fpu/cos_advsimd.c
> index 40831e6b0d..1f7a7023f5 100644
> --- a/sysdeps/aarch64/fpu/cos_advsimd.c
> +++ b/sysdeps/aarch64/fpu/cos_advsimd.c
> @@ -17,13 +17,82 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> -#include <math.h>
> +#include "v_math.h"
>  
> -#include "advsimd_utils.h"
> +static const volatile struct
> +{
> +  float64x2_t poly[7];
> +  float64x2_t range_val, shift, inv_pi, half_pi, pi_1, pi_2, pi_3;
> +} data = {
> +  /* Worst-case error is 3.3 ulp in [-pi/2, pi/2].  */
> +  .poly = { V2 (-0x1.555555555547bp-3), V2 (0x1.1111111108a4dp-7),
> +	    V2 (-0x1.a01a019936f27p-13), V2 (0x1.71de37a97d93ep-19),
> +	    V2 (-0x1.ae633919987c6p-26), V2 (0x1.60e277ae07cecp-33),
> +	    V2 (-0x1.9e9540300a1p-41) },
> +  .inv_pi = V2 (0x1.45f306dc9c883p-2),
> +  .half_pi = V2 (0x1.921fb54442d18p+0),
> +  .pi_1 = V2 (0x1.921fb54442d18p+1),
> +  .pi_2 = V2 (0x1.1a62633145c06p-53),
> +  .pi_3 = V2 (0x1.c1cd129024e09p-106),
> +  .shift = V2 (0x1.8p52),
> +  .range_val = V2 (0x1p23)
> +};
> +
> +#define C(i) data.poly[i]
> +
> +static float64x2_t VPCS_ATTR NOINLINE

Why does it need NOINLINE here?  Are you trying to optimize for code size?
With stack protector I do see a small code size increase which does not 
happen without stack protector.

Otherwise, I don't think you will get much regarding code reorganization.