Bug 21236

Summary: NaN generation by optimized math functions
Product: glibc Reporter: Charles Schwieters <charles>
Component: dynamic-linkAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED DUPLICATE    
Severity: normal CC: carlos, charles, fweimer, hjl.tools
Priority: P2 Flags: fweimer: security-
Version: 2.24   
Target Milestone: ---   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1421155
https://sourceware.org/bugzilla/show_bug.cgi?id=20495
https://sourceware.org/bugzilla/show_bug.cgi?id=21265
https://sourceware.org/bugzilla/show_bug.cgi?id=22636
Host: Target:
Build: Last reconfirmed: 2017-03-14 00:00:00

Description Charles Schwieters 2017-03-08 19:26:56 UTC
In code compiled with the Intel Fortran compiler
(ifort-v11 (IFORT) 16.0.3 20160415) with optimizing, my very old code suddenly started misbehaving, giving NaNs unexpectedly. I traced this to when Debian made the upgrade of libc6 from 2.24-8 --> 2.24-9. I then did some sleuthing to discover that git commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 was the culprit. Reversing this patch allowed things to work (as this binary does on many, many different versions of glibc). 

Rather than make a fix myself (I an not qualified), I can describe the behavior: the problem seems to only occur on machines with an avx cpu flag. On machines which list no AVX cpu flags, the bad behavior does not occur. Unfortunately, I have been unable to run this on a machine with AVX512 instructions.
Comment 1 Florian Weimer 2017-03-09 07:40:18 UTC
Can you reproduce this with current upstream master?  Instructions for running a program against the newly built glibc are here:

https://sourceware.org/glibc/wiki/Testing/Builds#Compile_normally.2C_run_under_new_glibc

I suggest you engage Intel support.  Without access to your binary, we cannot tell if this is an Intel compiler issue, or a problem with the dynamic linker trampoline.  The trampoline is not obviously wrong.
Comment 2 Charles Schwieters 2017-03-09 17:09:49 UTC
I have reproduced the behavior with the upstream master, including fixing things by reverting that single commit. Thanks to the good documentation you pointed me to, I was able to run this on more platforms, including a Xeon Phi 7230. Here is a summary of the results:

platform          NaN occurs
Xeon Phi 7230      no
Xeon CPU E5-2670   yes
Xeon E5-1650 v2    yes
Pentium 4405       no
AMD Opteron 2218   no

My binary is available for this sort of analysis, but Intel may be the correct place to turn.
Comment 3 Carlos O'Donell 2017-03-14 14:11:16 UTC
H.J.,

This looks like a serious issue with the dynamic loader support code that saves/restores the AVX-related registers.

Do you have time to look into this?
Comment 4 H.J. Lu 2017-03-14 14:54:08 UTC
Please verify that

+_dl_runtime_resolve_avx_slow:
+  cfi_startproc
+  cfi_adjust_cfa_offset(16) # Incorporate PLT
+  vorpd %ymm0, %ymm1, %ymm8
+  vorpd %ymm2, %ymm3, %ymm9
+  vorpd %ymm4, %ymm5, %ymm10
+  vorpd %ymm6, %ymm7, %ymm11
+  vorpd %ymm8, %ymm9, %ymm9
+  vorpd %ymm10, %ymm11, %ymm10
+  vpcmpeqd %xmm8, %xmm8, %xmm8
+  vorpd %ymm9, %ymm10, %ymm10
+  vptest %ymm10, %ymm8

is the cause by changing it to

_dl_runtime_resolve_avx_slow:
  jmp _dl_runtime_resolve_avx
Comment 5 Charles Schwieters 2017-03-14 15:27:59 UTC
Correct. This patch gets rid of the bad behavior:

diff --git a/sysdeps/x86_64/dl-trampoline.h b/sysdeps/x86_64/dl-trampoline.h
index b27fa06974..73c9003006 100644
--- a/sysdeps/x86_64/dl-trampoline.h
+++ b/sysdeps/x86_64/dl-trampoline.h
@@ -66,16 +66,7 @@
 	.align 16
 _dl_runtime_resolve_avx_slow:
 	cfi_startproc
-	cfi_adjust_cfa_offset(16) # Incorporate PLT
-	vorpd %ymm0, %ymm1, %ymm8
-	vorpd %ymm2, %ymm3, %ymm9
-	vorpd %ymm4, %ymm5, %ymm10
-	vorpd %ymm6, %ymm7, %ymm11
-	vorpd %ymm8, %ymm9, %ymm9
-	vorpd %ymm10, %ymm11, %ymm10
-	vpcmpeqd %xmm8, %xmm8, %xmm8
-	vorpd %ymm9, %ymm10, %ymm10
-	vptest %ymm10, %ymm8
+	jmp _dl_runtime_resolve_avx
 	# Preserve %ymm0 - %ymm7 registers if the upper 128 bits of any
 	# %ymm0 - %ymm7 registers aren't zero.
 	PRESERVE_BND_REGS_PREFIX
Comment 6 H.J. Lu 2017-03-14 15:54:13 UTC
According to x86-64 psABI, xmm0-xmm7 can be used to pass function
parameters.  But ICC also uses xmm8-xmm15 to pass function parameters
which violates x86-64 psABI.  As a workaround, you can set environment
variable LD_BIND_NOW=1 by

# export LD_BIND_NOW=1
Comment 7 Charles Schwieters 2017-03-14 16:02:46 UTC
Yes. LD_BIND_NOW=1 fixes things up. Is this a bug in icc/ifort/icpc which should be fixed? I have a bug report open on the compilers. What are the performance implications for LD_BIND_NOW=1? Thanks.
Comment 8 Carlos O'Donell 2017-03-14 16:14:30 UTC
(In reply to H.J. Lu from comment #6)
> According to x86-64 psABI, xmm0-xmm7 can be used to pass function
> parameters.  But ICC also uses xmm8-xmm15 to pass function parameters
> which violates x86-64 psABI.  As a workaround, you can set environment
> variable LD_BIND_NOW=1 by
> 
> # export LD_BIND_NOW=1

Given that this used to work do we need to carry a fix in glibc for ICC binaries?

Or are you going to take this to the ICC team? Is this fixed in a particular version of ICC?
Comment 9 H.J. Lu 2017-03-14 16:18:15 UTC
There is nothing to fix in glibc since it follows x86-64 psABI.
I am discussing with ICC team now to see how to address this.
Comment 10 Florian Weimer 2017-03-15 06:20:31 UTC
Thanks.  Closing as invalid based on comment 6 and comment 9.
Comment 11 Florian Weimer 2017-11-03 08:24:19 UTC

*** This bug has been marked as a duplicate of bug 21265 ***