Thread memory allocation issue
Mark Geisert
mark@maxrnd.com
Tue Nov 19 04:01:00 GMT 2024
Hello Teepean,
On 11/17/2024 11:32 AM, Teepean via Cygwin wrote:
> I raised this issue couple of years ago on cygwin-developer but now when the problem has manifested again with recent versions of Cygwin I decided to post this to general discussion list.
This (main Cygwin) list is the correct place for reports like this.
There is no need to contact me (or other maintainers/devs) off-list.
Given that the result of the investigation a couple years ago was,
essentially, no change to Cygwin's malloc*, why has the problem
manifested again recently? Have you been benchmarking/testing all
along? Can you be more specific about which recent Cygwin versions?
*My own benchmark, building the Cygwin tree, showed that there wasn't
much difference between the half-dozen malloc implementations I tried
and they were all spending more time in Windows' ntdll.dll than the
current Cygwin malloc (==dlmalloc), though a little less time in Cygwin
itself.
> Steps to Reproduce
>
> 1. Compile BWA normally
>
> https://github.com/lh3/bwa/
What's involved with that? Clone the repo, ./configure, make? Anything else?
> 2. Compile BWA with rpmalloc and the following patch:
>
>
> // In thread worker function:
> #ifdef __CYGWIN__
> rpmalloc_thread_initialize();
> #endif
>
>
> // ... thread work ...
> #ifdef __CYGWIN__
> rpmalloc_thread_finalize(1);
> #endif
Where does that patch go? Assume I know nothing about BWA.
> 3. Run both versions with the following command:
> time ./bwa mem -t 11 chr19_KI270866v1_alt.fasta test_1.fastq test_2.fastq > testorigsingle.sam
>
>
> Without Patch (Default malloc):
>
>
> [M::mem_process_seqs] Processed 120000 reads in 30.296 CPU sec, 3.743 real sec
> [main] Real time: 3.883 sec; CPU: 30.436 sec
> real 0m3.907s
> user 0m19.186s
> sys 0m11.265s
>
>
> With Patch (rpmalloc):
>
>
> [M::mem_process_seqs] Processed 120000 reads in 7.530 CPU sec, 0.702 real sec
> [main] Real time: 0.830 sec; CPU: 7.640 sec
> real 0m0.868s
> user 0m7.343s
> sys 0m0.327s
Are these examples of runs one would do "in production"? Or are you
running much longer-lasting processing in the usual case?
> Analysis
>
> 1. The default malloc implementation shows extremely high system time (11.265s) compared to the rpmalloc version (0.327s)
> 2. Total real time is about 4.5x slower with default malloc
> 3. The dramatic difference in system time suggests heavy contention in the memory allocation subsystem
> 4. The issue only manifests on Cygwin with bwa; the same code performs normally on native Linux and MacOS
Are you saying there is non-bwa code that runs on Cygwin comparably to
Linux and Mac?
> 5. The issue manifests with recent versions of Cygwin but does work with older versions
Again, it would really help if you could give Cygwin versions or at
least dates here...
> The issue becomes more pronounced with higher thread counts
That I believe; dlmalloc as it is currently set up for Cygwin is not the
greatest performer for heavy thread usage.
> The patched code is located here in branch Cygwin:
>
>
> https://github.com/WGSExtract/bwa.git
>
>
> Simple testsuite. Run bash testsuite.sh. The testsuite includes a version compiled with an older version of Cygwin called bwa_working.exe
>
>
> https://drive.google.com/file/d/1jtbQVUAcCmpJM-8Exi0C6pzDXcEo4cV6/view?usp=drive_link
I'll glance at this stuff when I can but I hope to have some answers to
my questions above from you to save me some time.
..mark
More information about the Cygwin
mailing list