This is the mail archive of the
libc-help@sourceware.org
mailing list for the glibc project.
_dl_runtime_resolve_avx_slow clobbering xmm8
- From: Ivan Tubert-Brohman <ivan dot tubert-brohman at schrodinger dot com>
- To: libc-help at sourceware dot org, "Colvin,Tor" <colvin at schrodinger dot com>
- Date: Fri, 25 Aug 2017 14:58:11 -0400
- Subject: _dl_runtime_resolve_avx_slow clobbering xmm8
- Authentication-results: sourceware.org; auth=none
TL;DR: we found that _dl_runtime_resolve_avx_slow clobbers xmm8, but
code generated by the Intel fortran compiler assumes the persistence
of xmm8 when calling a shared library function. Which one is wrong?
Long version:
We noticed strange behavior in our software after upgrading to RHEL
7.4. We were able to reproduce the bug with the simplified Fortran
function below (compiled with ifort 17.0.1):
subroutine multbox(actmin, actmax, bsize, nlev)
! adjust the size of a box with corners in actmin, actmax
! to be a multiple of 2**(nlev-1)*bsize.
implicit none
real*8 bsize
integer k, lbig, ldiv, length, lnew, nlev
real*8 actmin(2), actmax(2)
do k = 1, 3
length = int((actmax(k)-actmin(k))/bsize)+1
ldiv = 2**(nlev-1)
lbig = length/ldiv+1
lnew = lbig*ldiv
actmax(k) = actmin(k)+lnew*bsize
enddo
return
end subroutine multbox
Called twice with the same arguments, we get the wrong results the first time.
For example, calling the function with certain values, we expect:
actmax after call 1 22.3282900000000 21.7469530000000
actmax after call 2 22.3282900000000 21.7469530000000
Observed result:
actmax after call 1 -536.000000000000 -536.004394531251
actmax after call 2 22.3282900000000 21.7469530000000
We stepped through the function and found that the problem is that the
first time __svml_idiv4 (an Intel runtime library function which
apparently divides integers in xmm0 by xmm1 and stores the results in
xmm0) is called, the value of the xmm8 register gets clobbered, but
the code in multbox_ assumes that it will preserve its value. Stepping
into that first call, we found that with glibc-2.17-196.el7.x86_64
(found in RHEL 7.4), the loading involves the recently introduced
_dl_runtime_resolve_avx_slow, which clobbers xmm8; with an older
version of glibc (we tried glibc-2.17-78.el7.x86_64), the loading
involves _dl_runtime_resolve, which doesn't affect xmm8.
My question here is, who is at fault? Is ifort making unfounded
assumptions about the persistence of xmm8, or is
_dl_runtime_resolve_avx_slow wrong in not preserving it? I looked at
the latter's code and it looks like it tries to preserve xmm0-xmm7,
but not xmm8.
The problem goes away with LD_BIND_NOW, but that's not an option in production.
We'll file a ticket with Intel, but I'm interested in hearing the
glibc perspective on this question.
Thanks,
Ivan
PS: For reference, here's the disassembled multbox function:
0x0000000000403080 <+0>: push %r13
0x0000000000403082 <+2>: push %r14
0x0000000000403084 <+4>: push %rbx
0x0000000000403085 <+5>: mov %rsi,%r13
0x0000000000403088 <+8>: mov %rdi,%r14
0x000000000040308b <+11>: mov $0x1,%eax
0x0000000000403090 <+16>: movsd (%rdx),%xmm11
0x0000000000403095 <+21>: xor %ebx,%ebx
0x0000000000403097 <+23>: movaps %xmm11,%xmm9
0x000000000040309b <+27>: movups 0x0(%r13),%xmm2
0x00000000004030a0 <+32>: movups (%r14),%xmm10
0x00000000004030a4 <+36>: subpd %xmm10,%xmm2
0x00000000004030a9 <+41>: unpcklpd %xmm9,%xmm9
0x00000000004030ae <+46>: divpd %xmm9,%xmm2
0x00000000004030b3 <+51>: mov (%rcx),%ecx
0x00000000004030b5 <+53>: dec %ecx
0x00000000004030b7 <+55>: shl %cl,%eax
0x00000000004030b9 <+57>: cmp $0x1f,%ecx
0x00000000004030bc <+60>: cvttpd2dq %xmm2,%xmm0
0x00000000004030c0 <+64>: cmovbe %eax,%ebx
0x00000000004030c3 <+67>: movdqu 0x81794(%rip),%xmm12 # 0x484860
0x00000000004030cc <+76>: paddd %xmm12,%xmm0
0x00000000004030d1 <+81>: movd %ebx,%xmm8
0x00000000004030d6 <+86>: pshufd $0x0,%xmm8,%xmm8
0x00000000004030dc <+92>: movlhps %xmm8,%xmm8 # puts something in xmm8
0x00000000004030e0 <+96>: movdqa %xmm8,%xmm1
0x00000000004030e5 <+101>: callq 0x402610 <__svml_idiv4@plt> #
the problematic call
0x00000000004030ea <+106>: movsd 0x10(%r14),%xmm5
0x00000000004030f0 <+112>: paddd %xmm12,%xmm0
0x00000000004030f5 <+117>: movsd 0x10(%r13),%xmm3
0x00000000004030fb <+123>: movaps %xmm8,%xmm2 # this expects
that xmm8 still has the value set above.
0x00000000004030ff <+127>: pmuludq %xmm0,%xmm2
0x0000000000403103 <+131>: subsd %xmm5,%xmm3
0x0000000000403107 <+135>: divsd %xmm11,%xmm3
0x000000000040310c <+140>: cvttsd2si %xmm3,%eax
0x0000000000403110 <+144>: inc %eax
0x0000000000403112 <+146>: psrlq $0x20,%xmm8
0x0000000000403118 <+152>: cltd
0x0000000000403119 <+153>: idiv %ebx
0x000000000040311b <+155>: psrlq $0x20,%xmm0
0x0000000000403120 <+160>: pxor %xmm4,%xmm4
0x0000000000403124 <+164>: pmuludq %xmm0,%xmm8
0x0000000000403129 <+169>: inc %eax
0x000000000040312b <+171>: imul %eax,%ebx
0x000000000040312e <+174>: pand 0x8173a(%rip),%xmm2 # 0x484870
0x0000000000403136 <+182>: psllq $0x20,%xmm8
0x000000000040313c <+188>: por %xmm8,%xmm2
0x0000000000403141 <+193>: cvtdq2pd %xmm2,%xmm1
0x0000000000403145 <+197>: cvtsi2sd %ebx,%xmm4
0x0000000000403149 <+201>: mulpd %xmm1,%xmm9
0x000000000040314e <+206>: mulsd %xmm4,%xmm11
0x0000000000403153 <+211>: addpd %xmm9,%xmm10
0x0000000000403158 <+216>: addsd %xmm11,%xmm5
0x000000000040315d <+221>: movups %xmm10,0x0(%r13)
0x0000000000403162 <+226>: movsd %xmm5,0x10(%r13)
0x0000000000403168 <+232>: pop %rbx
0x0000000000403169 <+233>: pop %r14
0x000000000040316b <+235>: pop %r13
0x000000000040316d <+237>: retq
0x000000000040316e <+238>: xchg %ax,%ax