Cygwin malloc tune-up status
Mark Geisert
mark@maxrnd.com
Fri Sep 25 06:01:09 GMT 2020
Hi folks,
I've been looking into two potential enhancements of Cygwin's malloc operation
for Cygwin 3.2. The first is to replace the existing function-level locking
with something more fine-grained to help threaded processes; the second is to
implement thread-specific memory pools with the aim of lessening lock activity
even further.
Although I've investigated several alternative malloc packages, including
ptmalloc[23], nedalloc, and Windows Heaps, only the latter seems to improve on
the performance of Cygwin's malloc. Unfortunately using Windows Heaps would
require fiddling with undocumented heap structures to enable use with fork().
I also looked at BSD's jemalloc and Google's tcmalloc. Those two require much
more work to port to Cygwin so I've back-burnered them for the time being.
I decided to concentrate on Cygwin's malloc, which is actually the most recent
version of Doug Lea's dlmalloc package. Both of the desired enhancements can
largely be achieved by changing a couple #defines in Cygwin's malloc_wrapper.cc.
The following table shows some results of my investigation into various tunings
of Cygwin's malloc implementation. Each column below is a form of the existing
dlmalloc-based code, but making use of certain #defines to tune the allocator's
behavior. The "legacy" implementation (first data column) is the one used up
to Cygwin 3.1.x. See the NOTES at end of doc for table details.
One thing that stands out in the profiling data is that Windows overhead is on
the order of 90% of total profiling counts (on a malloc torture test program).
So changes made within the Cygwin DLL are unlikely to speed up Cygwin's malloc
unless the changes lead to less reliance on Windows calls. (I think this data
boils down to "mmaps are slow on Windows", just as we already know "file I/O is
slow on Windows".) I also have Cygwin DLL hot spot profiling data for the "no
mspaces" and all mspace sizes shown here that justifies blaming mmaps.
locking strategy: func locks data lock lockless* lockless* lockless*
---------- ---------- ---------- ---------- ----------
malloc strategy: legacy no mspaces mspace=64K mspace=512K mspace=8M
---------- ---------- ---------- ---------- ----------
*** MALTEST PROFILING DATA ***
profile count subtotals...
KernelBase.dll 94 97 45 43 36
kernel32.dll 19 29 27 27 18
ntdll.dll 46950 34389 74997 75765 82826
cygwin1.dll 1172 1545 4727 4906 4519
maltest.exe 1873 2472 3249 3573 3075
profile count totals 50108 38532 83045 84314 90474
perf vs legacy 1.00 1.30 0.60 0.59 0.55
profile counts as percentage of totals...
Windows dlls 93.9% 89.6% 90.4% 89.9% 91.6%
cygwin1.dll 2.3% 4.0% 5.7% 5.8% 5.0%
maltest.exe 3.7% 6.4% 3.9% 4.2% 3.4%
*** OTHER TEST DATA ***
raw data...
maltest, ops/sec 13863 19500 8087 8417 9197
cygwin config, secs 39.82 38.45 40.92 41.29 46.04
cygwin make -j4, secs 1600 1555 1611 1589 1611
perf vs legacy...
maltest 1.00 1.41 0.58 0.61 0.66
cygwin config 1.00 1.04 0.97 0.96 0.86
cygwin make -j4 1.00 1.03 0.99 1.01 0.99
*** NOTES ***
- "lockless*" means no lock needed if request can be satisfied from thread's
own mspace, else a lock+unlock is needed on the global mspace.
- Each profile "count" equals 0.01 CPU seconds.
- Under OTHER TEST DATA, maltest and cygwin config data are averages of 5 runs,
cygwin make data are from single runs.
- "maltest" is a threaded malloc stress tester. In my testing it's set up to
use 4 threads that allocate and later release random-sized blocks <= 512kB.
A subset of block sizes are skewed somewhat smaller than truly random to
simulate frequent C++ class instantiation. Threads also touch each page of a
block on return from malloc(); this simulates actual app behavior better than
just doing mallocs+frees. A subset of mallocs (~6%) are morphed into
reallocs just to exercise that path.
- The profile counts were obtained with cygmon, a tool I ought to release.
- All investigation done on a 2C/4T 2.3GHz Windows 10 machine using an SSD.
This email is basically a state dump of where I'm currently at. Comments or
questions are welcome. I'm inclined to release what implements the first
enhancement (maybe 3% speedup, more for multi-thread processes) but leave
mspaces for the future, if at all. Maybe the less-than-satisfying mspace
performance argues for trying harder to get jemalloc or tcmalloc investigated
in the Cygwin environment.
Thanks for reading.
..mark
More information about the Cygwin-developers
mailing list