- Subject: libc/1712: Performance loss with glibc-2.1.[23] vs 2.1.1
- From: Various
- Date: Sun Apr 30 13:53:55 2000
Topics:
libc/1712: Performance loss with glibc-2.1.[23] vs 2.1.1
Re: libc/1712: Performance loss with glibc-2.1.[23] vs 2.1.1
----------------------------------------------------------------------
Date: Sun, 30 Apr 2000 09:32:35 +0200
From: Milan Hodoscek <milan@ala.cmm.ki.si>
To: bugs@gnu.org
Subject: libc/1712: Performance loss with glibc-2.1.[23] vs 2.1.1
Message-Id: <200004300732.e3U7WZc07141@ala.cmm.ki.si>
>Number: 1712
>Category: libc
>Synopsis: Performance loss with glibc-2.1.[23] vs 2.1.1
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: libc-gnats
>State: open
>Class: sw-bug
>Submitter-Id: unknown
>Arrival-Date: Sun Apr 30 03:40:02 EDT 2000
>Last-Modified:
>Originator: Milan Hodoscek
>Organization:
>
>Release: libc-2.1.3
>Environment:
Host type: i386-pc-linux-gnu
System: Linux ala 2.3.99-pre6 #1 Thu Apr 27 12:44:42 CEST 2000 i686 unknown
Architecture: i686
Addons: crypt linuxthreads nss-v1
Build CC: gcc
Compiler version: 2.95.2 20000220 (Debian GNU/Linux)
Kernel headers: UTS_RELEASE
Symbol versioning: yes
Build static: yes
Build shared: yes
Build pic-default: no
Build profile: yes
Build omitfp: no
Build bounded: no
Build static-nss: no
Stdio: libio
>Description:
I am working with the program CHARMM (~0.5M lines, molecular dynamics)
and I noticed there are some performance problems with it when using
both 2.1.2 and 2.1.3 versions of glibc. With glibc-2.1.1 it works at
full speed (3 times faster!!). Since the program is pretty big and
there is no easy model of this performance problem I had to profile it
and found few routines which show this performance problem with the
new library (2.1.[23] vs 2.1.1). I picked the smallest one and
simplified it so it can be run as a standalone program. I am sure it
is possible to optimize this small model program, but it would be more
difficult in the real situation. I just want to show that there is a
problem with the library no matter how badly the program is written.
This is what I tested:
I am using Debian (woody) and machine I tested on is PII-450MHz. I
tried also Athlon-700MHz and the performance difference is similar. I
also tried a variety of optimization options, and different compilers
(pgcc). The difference is always significant and only when I switch
the libraries. I also recompiled glibc-2.1.3 on my own with all the
optimization flags on and the problem is still there.
>How-To-Repeat:
So if you run the program below with the glibc-2.1.1 then the timing
is like this:
ala:~/test $ time gs1
F(n)= 1.44779895E+09 2.72745061E+09 1.94683947E+09
real 0m22.105s
user 0m22.100s
sys 0m0.010s
if it runs with glibc-2.1.3, then it is:
ala:~/test $ time gs1
F(n)= 1.44779895E+09 2.72745061E+09 1.94683947E+09
real 1m8.860s
user 1m7.720s
sys 0m0.760s
I compiled the program wit the following script:
#! /bin/sh -x
FFLAGS="-O6 -m486 -malign-double -ffast-math -fomit-frame-pointer -funroll-loops -funroll-all-loops -mcpu=pentiumpro -march=pentiumpro -ffloat-store -fforce-mem -frerun-cse-after-loop -fexpensive-optimizations -fugly-complex -fno-backslash -fno-globals -Wno-globals"
g77 -o gs1 $FFLAGS gs1.f
And the program itself is this:
C This is to figure out why libc6-2.1.2 is so slow ????
C The most simplified version made after GRAD_SUM from nbonds/pme.src
C
implicit none
C
INTEGER ORDER,max,iseed
parameter (max=100,order=200)
REAL*8 ZERO,RECIP(9)
REAL*8 FX(max),FY(max),FZ(max)
REAL*8 THETA1(ORDER,max),THETA2(ORDER,max),
$ THETA3(ORDER,max),CHARGE(max)
REAL*8 DTHETA1(ORDER,max),DTHETA2(ORDER,max),
$ DTHETA3(ORDER,max)
REAL*8 Q(max)
C
integer ig,igood,nfft1,nfft2,nfft3
REAL*8 VAL1,VAL2,VAL3,VAL1A,VAL2A,VAL3A
INTEGER IPT1,IPT2,IPT3
C
INTEGER N,ITH1,ITH2,ITH3,I,J,K
REAL*8 F1,F2,F3,TERM,CFACT,random
C
CFACT=300.0d0
igood=max
zero=0.0d0
nfft1=64
nfft2=64
nfft3=64
iseed=1
C
C Initialize all the arrays with the random numbers...
C
do i=1,9
recip(i)=random(iseed)
enddo
C
do i=1,max
q(i)=random(iseed)
charge(i)=random(iseed)
do j=1,order
theta1(j,i)=random(iseed)
dtheta1(j,i)=random(iseed)
theta2(j,i)=random(iseed)
dtheta2(j,i)=random(iseed)
theta3(j,i)=random(iseed)
dtheta3(j,i)=random(iseed)
enddo
enddo
C
ipt2=0
do ig = 1,igood
n=ig
F1 = ZERO
F2 = ZERO
F3 = ZERO
ipt2=ipt2+1
C
DO ITH3 = 1,ORDER
VAL1A = CHARGE(N) * NFFT1 * THETA3(ITH3,ig)
VAL2A = CHARGE(N) * NFFT2 * THETA3(ITH3,ig)
VAL3A = CHARGE(N) * NFFT3 * DTHETA3(ITH3,ig)
C
DO ITH2 = 1,ORDER
C
VAL1= VAL1A * THETA2(ITH2,ig)
VAL2= VAL2A * DTHETA2(ITH2,ig)
VAL3= VAL3A * THETA2(ITH2,ig)
C
DO ITH1 = 1,ORDER
C
F1 = F1 - VAL1 * Q(IPT2) * DTHETA1(ITH1,ig)
F2 = F2 - VAL2 * Q(IPT2) * THETA1(ITH1,ig)
F3 = F3 - VAL3 * Q(IPT2) * THETA1(ITH1,ig)
C
ENDDO
ENDDO
ENDDO
C
FX(N) = FX(N) - CFACT*(RECIP(1)*F1+RECIP(4)*F2+RECIP(7)*F3)
FY(N) = FY(N) - CFACT*(RECIP(2)*F1+RECIP(5)*F2+RECIP(8)*F3)
FZ(N) = FZ(N) - CFACT*(RECIP(3)*F1+RECIP(6)*F2+RECIP(9)*F3)
ENDDO
C
write(*,*)'F(n)=',fx(max),fy(max),fz(max)
C
RETURN
END
C
REAL*8 FUNCTION RANDOM(ISEED)
implicit none
INTEGER ISEED
REAL*8 DSEED,DIVIS,DENOM,MULTIP
C
DATA DIVIS/2147483647.D0/
DATA DENOM /2147483711.D0/
DATA MULTIP/16807.D0/
C
IF(ISEED.LE.1) ISEED=314159
DSEED=MULTIP*ISEED
DSEED=MOD(DSEED,DIVIS)
RANDOM=DSEED/DENOM
ISEED=DSEED
C
RETURN
END
>Fix:
>Audit-Trail:
>Unformatted:
------------------------------
Date: 30 Apr 2000 13:53:43 +0200
From: Andreas Jaeger <aj@arthur.rhein-neckar.de>
To: Milan Hodoscek <milan@ala.cmm.ki.si>
Cc: bugs@gnu.org
Subject: Re: libc/1712: Performance loss with glibc-2.1.[23] vs 2.1.1
Message-ID: <u8vh0z4xfc.fsf@gromit.rhein-neckar.de>
References: <200004300732.e3U7WZc07141@ala.cmm.ki.si>
Content-Type: text/plain; charset=us-ascii
>>>>> Milan Hodoscek writes:
>> Number: 1712
>> Category: libc
>> Synopsis: Performance loss with glibc-2.1.[23] vs 2.1.1
[...]
> I am working with the program CHARMM (~0.5M lines, molecular dynamics)
> and I noticed there are some performance problems with it when using
> both 2.1.2 and 2.1.3 versions of glibc. With glibc-2.1.1 it works at
> full speed (3 times faster!!). Since the program is pretty big and
> there is no easy model of this performance problem I had to profile it
> and found few routines which show this performance problem with the
> new library (2.1.[23] vs 2.1.1). I picked the smallest one and
> simplified it so it can be run as a standalone program. I am sure it
> is possible to optimize this small model program, but it would be more
> difficult in the real situation. I just want to show that there is a
> problem with the library no matter how badly the program is written.
> This is what I tested:
> I am using Debian (woody) and machine I tested on is PII-450MHz. I
> tried also Athlon-700MHz and the performance difference is similar. I
> also tried a variety of optimization options, and different compilers
> (pgcc). The difference is always significant and only when I switch
> the libraries. I also recompiled glibc-2.1.3 on my own with all the
> optimization flags on and the problem is still there.
I don't have both libraries installed on my system and therefore can't
test this.
I profiled the usage of libc.so and libm.so of your program to see
which functions are the culprit with:
$ LD_PROFILE=libc.so.6 ./pr1712
$ sprof /lib/libc.so.6 /var/tmp/libc.so.6.profile
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
100.00 0.02 0.02 120212 0.17 isnan
0.00 0.02 0.00 56 0.00 __overflow
0.00 0.02 0.00 51 0.00 strncmp
0.00 0.02 0.00 19 0.00 __errno_location
0.00 0.02 0.00 16 0.00 flockfile
0.00 0.02 0.00 16 0.00 funlockfile
[...]
(libm showed even less output)
So most of the time is actually spent in your program - and not in
glibc.
It might be that the stack is somehow misaglined for doubles and
therefore you get such a slowdown - but looking at our glibc changes
between glibc 2.1.2 and 2.1.3 I don't see any change which might
affect this.
I would appreciate if you could try to locate the exact cause of this
slowdown.
I'll forward your email and my comments to the glibc list.
Andreas
- -
Andreas Jaeger
SuSE Labs aj@suse.de
private aj@arthur.rhein-neckar.de
------------------------------
End of forward_IAP9F Digest
***************************