When trying to use Calculix (www.calculix.de) with the multithreaded version of
Spooles I ran into problems. It turned out that the malloc delivered with
Novell's SLES10 (glibc 2.4) isn't thread-safe. After linking in Wolfram
Gloger's ptmalloc3 the program runs properly.
Newer glibc versions up to 2.7 are still based on ptmalloc2, so I tried to use
ptmalloc2 instead of ptmalloc3 and got the same problems as with glibc2.4.
Testcondition: 8 threads, either 4 dualcore Opterons or 2 quadcore Xeons, an
average of 10000 mallocs/s.
Worthless report. Using a different implementation and saying it works does not
at all prove anything. Unless you provide a small reproducer nothing will happen.
Created attachment 2986 [details]
The test case shows signs of a memory leak with additional memory usage in
excess of 1000 MB when linked against glibc. When linked against
libptmalloc3.so the peak memory usage stays constant during the run.
The tests were run on a 4 socket Opteron 8356 system running 64bit SLES9SP3.
My original observation was a severe memory leak, which is easily reproducible,
and 2 instances of deviating results. The memory leak disappeared when I
switched to ptmalloc3 (vs. ptmalloc2, which is in glibc), and I could not
reproduce the deviating results.
However I can no longer reproduce deviating results when using standard
malloc/free, the memory leak is still present. I do no longer consider the
severity "critical" I see it as "normal"
I ran some more tests. The memory leak seems to be related to free being called
inside threads. When I moved the free calls into the master thread, the memory
Created attachment 3002 [details]
Test case with multithreaded malloc and free in master thread only
This is the source for the test with áll free() calls moved into the master
thread. The usleep and sleep calls are intended to slow the execution down, so
that it can be observed with top.
Created attachment 3722 [details]
improved test case which makes it easier to reproduce the effect
With our real world example there is a striking dependence on RLIMIT_STACK. If
the stacklimit is either low (SuSE default 8192) or unlimited, the memory leak
is very pronounced, if the limit is 512 MB (ulimit -s 524288) it takes several
attempts to reproduce the problem.
The stacklimit has almost no influence on my small example, but if I omit the
usleeps (second parameter 0), it may take dozens of runs to reproduce the
I have modified my example, so that two or three runs should be sufficient to
reproduce the problem. You will need a system with at least two quadcore or
four dualcore processors.
The example is compiled with:
gcc -l pthread -o malloc_thread_test_pa malloc_thread_test_pa.c
If you want to watch the program you should run top -d 1 in a second window.
Another way is to run it in a loop, my program now outputs the peak RSS:
while true; do
If I run malloc_thread_test_pa 8 0 instead, ist nearly impossible to reproduce
That's how the output should look like:
loop 0: VmHWM: 3080068 kB
loop 1: VmHWM: 3080188 kB
loop 2: VmHWM: 3080188 kB
loop 3: VmHWM: 3080188 kB
loop 4: VmHWM: 3080188 kB
loop 5: VmHWM: 3080188 kB
loop 6: VmHWM: 3080188 kB
loop 7: VmHWM: 3080188 kB
loop 8: VmHWM: 3080188 kB
loop 9: VmHWM: 3080188 kB
and that's what I typically get:
loop 0: VmHWM: 3079520 kB
loop 1: VmHWM: 3464160 kB
loop 2: VmHWM: 3464160 kB
loop 3: VmHWM: 3464280 kB
loop 4: VmHWM: 3464280 kB
loop 5: VmHWM: 3464280 kB
loop 6: VmHWM: 3849292 kB
loop 7: VmHWM: 3849292 kB
loop 8: VmHWM: 3849292 kB
loop 9: VmHWM: 3849292 kB
The problem demonstrated by the test program appearently depends on
architecture. On a dual quadcore Intel Xeon box, it is sufficient to run the
test with 6 threads, however on a quad quadcore AMD Opteron system the test
program does not show any excessive memory consumption even if run with 24
Apparently fixed by