Thread memory allocation issue

Mark Geisert mark@maxrnd.com
Tue Nov 19 04:01:00 GMT 2024


Hello Teepean,

On 11/17/2024 11:32 AM, Teepean via Cygwin wrote:
> I raised this issue couple of years ago on cygwin-developer but now when the problem has manifested again with recent versions of Cygwin I decided to post this to general discussion list.

This (main Cygwin) list is the correct place for reports like this. 
There is no need to contact me (or other maintainers/devs) off-list.

Given that the result of the investigation a couple years ago was, 
essentially, no change to Cygwin's malloc*, why has the problem 
manifested again recently?  Have you been benchmarking/testing all 
along?  Can you be more specific about which recent Cygwin versions?

*My own benchmark, building the Cygwin tree, showed that there wasn't 
much difference between the half-dozen malloc implementations I tried 
and they were all spending more time in Windows' ntdll.dll than the 
current Cygwin malloc (==dlmalloc), though a little less time in Cygwin 
itself.

> Steps to Reproduce
> 
> 1. Compile BWA normally
> 
> https://github.com/lh3/bwa/

What's involved with that? Clone the repo, ./configure, make? Anything else?

> 2. Compile BWA with rpmalloc and the following patch:
> 
> 
> // In thread worker function:
> #ifdef __CYGWIN__
> rpmalloc_thread_initialize();
> #endif
> 
> 
> // ... thread work ...
> #ifdef __CYGWIN__
> rpmalloc_thread_finalize(1);
> #endif

Where does that patch go? Assume I know nothing about BWA.

> 3. Run both versions with the following command:
> time ./bwa mem -t 11 chr19_KI270866v1_alt.fasta test_1.fastq test_2.fastq > testorigsingle.sam
> 
> 
> Without Patch (Default malloc):
> 
> 
> [M::mem_process_seqs] Processed 120000 reads in 30.296 CPU sec, 3.743 real sec
> [main] Real time: 3.883 sec; CPU: 30.436 sec
> real    0m3.907s
> user    0m19.186s
> sys     0m11.265s
> 
> 
> With Patch (rpmalloc):
> 
> 
> [M::mem_process_seqs] Processed 120000 reads in 7.530 CPU sec, 0.702 real sec
> [main] Real time: 0.830 sec; CPU: 7.640 sec
> real    0m0.868s
> user    0m7.343s
> sys     0m0.327s

Are these examples of runs one would do "in production"? Or are you 
running much longer-lasting processing in the usual case?

> Analysis
> 
> 1. The default malloc implementation shows extremely high system time (11.265s) compared to the rpmalloc version (0.327s)
> 2. Total real time is about 4.5x slower with default malloc
> 3. The dramatic difference in system time suggests heavy contention in the memory allocation subsystem
> 4. The issue only manifests on Cygwin with bwa; the same code performs normally on native Linux and MacOS

Are you saying there is non-bwa code that runs on Cygwin comparably to 
Linux and Mac?

> 5. The issue manifests with recent versions of Cygwin but does work with older versions

Again, it would really help if you could give Cygwin versions or at 
least dates here...

> The issue becomes more pronounced with higher thread counts

That I believe; dlmalloc as it is currently set up for Cygwin is not the 
greatest performer for heavy thread usage.

> The patched code is located here in branch Cygwin:
> 
> 
> https://github.com/WGSExtract/bwa.git
> 
> 
> Simple testsuite. Run bash testsuite.sh. The testsuite includes a version compiled with an older version of Cygwin called bwa_working.exe
> 
> 
> https://drive.google.com/file/d/1jtbQVUAcCmpJM-8Exi0C6pzDXcEo4cV6/view?usp=drive_link

I'll glance at this stuff when I can but I hope to have some answers to 
my questions above from you to save me some time.

..mark



More information about the Cygwin mailing list