When trying to use Calculix (www.calculix.de) with the multithreaded version of Spooles I ran into problems. It turned out that the malloc delivered with Novell's SLES10 (glibc 2.4) isn't thread-safe. After linking in Wolfram Gloger's ptmalloc3 the program runs properly. Newer glibc versions up to 2.7 are still based on ptmalloc2, so I tried to use ptmalloc2 instead of ptmalloc3 and got the same problems as with glibc2.4. Testcondition: 8 threads, either 4 dualcore Opterons or 2 quadcore Xeons, an average of 10000 mallocs/s.
Worthless report. Using a different implementation and saying it works does not at all prove anything. Unless you provide a small reproducer nothing will happen.
Created attachment 2986 [details] Test case The test case shows signs of a memory leak with additional memory usage in excess of 1000 MB when linked against glibc. When linked against libptmalloc3.so the peak memory usage stays constant during the run. The tests were run on a 4 socket Opteron 8356 system running 64bit SLES9SP3.
My original observation was a severe memory leak, which is easily reproducible, and 2 instances of deviating results. The memory leak disappeared when I switched to ptmalloc3 (vs. ptmalloc2, which is in glibc), and I could not reproduce the deviating results. However I can no longer reproduce deviating results when using standard malloc/free, the memory leak is still present. I do no longer consider the severity "critical" I see it as "normal"
I ran some more tests. The memory leak seems to be related to free being called inside threads. When I moved the free calls into the master thread, the memory leak disappeared.
Created attachment 3002 [details] Test case with multithreaded malloc and free in master thread only This is the source for the test with áll free() calls moved into the master thread. The usleep and sleep calls are intended to slow the execution down, so that it can be observed with top.
Created attachment 3722 [details] improved test case which makes it easier to reproduce the effect With our real world example there is a striking dependence on RLIMIT_STACK. If the stacklimit is either low (SuSE default 8192) or unlimited, the memory leak is very pronounced, if the limit is 512 MB (ulimit -s 524288) it takes several attempts to reproduce the problem. The stacklimit has almost no influence on my small example, but if I omit the usleeps (second parameter 0), it may take dozens of runs to reproduce the problem. I have modified my example, so that two or three runs should be sufficient to reproduce the problem. You will need a system with at least two quadcore or four dualcore processors. The example is compiled with: gcc -l pthread -o malloc_thread_test_pa malloc_thread_test_pa.c If you want to watch the program you should run top -d 1 in a second window. Another way is to run it in a loop, my program now outputs the peak RSS: while true; do malloc_thread_test_pa 2>/dev/null done If I run malloc_thread_test_pa 8 0 instead, ist nearly impossible to reproduce the problem. That's how the output should look like: loop 0: VmHWM: 3080068 kB loop 1: VmHWM: 3080188 kB loop 2: VmHWM: 3080188 kB loop 3: VmHWM: 3080188 kB loop 4: VmHWM: 3080188 kB loop 5: VmHWM: 3080188 kB loop 6: VmHWM: 3080188 kB loop 7: VmHWM: 3080188 kB loop 8: VmHWM: 3080188 kB loop 9: VmHWM: 3080188 kB and that's what I typically get: loop 0: VmHWM: 3079520 kB loop 1: VmHWM: 3464160 kB loop 2: VmHWM: 3464160 kB loop 3: VmHWM: 3464280 kB loop 4: VmHWM: 3464280 kB loop 5: VmHWM: 3464280 kB loop 6: VmHWM: 3849292 kB loop 7: VmHWM: 3849292 kB loop 8: VmHWM: 3849292 kB loop 9: VmHWM: 3849292 kB
The problem demonstrated by the test program appearently depends on architecture. On a dual quadcore Intel Xeon box, it is sufficient to run the test with 6 threads, however on a quad quadcore AMD Opteron system the test program does not show any excessive memory consumption even if run with 24 threads.
Apparently fixed by http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=4cd4c5d6a28c4fbdc86651c4578f4c4f24efce08