Cygwin malloc tune-up status

Fri Sep 25 06:01:09 GMT 2020

Hi folks,
I've been looking into two potential enhancements of Cygwin's malloc operation
for Cygwin 3.2.  The first is to replace the existing function-level locking
with something more fine-grained to help threaded processes; the second is to
implement thread-specific memory pools with the aim of lessening lock activity
even further.

Although I've investigated several alternative malloc packages, including
ptmalloc[23], nedalloc, and Windows Heaps, only the latter seems to improve on
the performance of Cygwin's malloc.  Unfortunately using Windows Heaps would
require fiddling with undocumented heap structures to enable use with fork().
I also looked at BSD's jemalloc and Google's tcmalloc.  Those two require much
more work to port to Cygwin so I've back-burnered them for the time being.

I decided to concentrate on Cygwin's malloc, which is actually the most recent
version of Doug Lea's dlmalloc package.  Both of the desired enhancements can
largely be achieved by changing a couple #defines in Cygwin's malloc_wrapper.cc.

The following table shows some results of my investigation into various tunings
of Cygwin's malloc implementation.  Each column below is a form of the existing
dlmalloc-based code, but making use of certain #defines to tune the allocator's
behavior.  The "legacy" implementation (first data column) is the one used up
to Cygwin 3.1.x.  See the NOTES at end of doc for table details.

One thing that stands out in the profiling data is that Windows overhead is on
the order of 90% of total profiling counts (on a malloc torture test program).
So changes made within the Cygwin DLL are unlikely to speed up Cygwin's malloc
unless the changes lead to less reliance on Windows calls.  (I think this data
boils down to "mmaps are slow on Windows", just as we already know "file I/O is
slow on Windows".)  I also have Cygwin DLL hot spot profiling data for the "no
mspaces" and all mspace sizes shown here that justifies blaming mmaps.

locking strategy: func locks    data lock    lockless*    lockless*    lockless*
                   ----------   ----------   ----------   ----------   ----------
malloc strategy:      legacy   no mspaces   mspace=64K  mspace=512K    mspace=8M
                   ----------   ----------   ----------   ----------   ----------
*** MALTEST PROFILING DATA ***
profile count subtotals...
KernelBase.dll            94           97           45           43           36
kernel32.dll              19           29           27           27           18
ntdll.dll              46950        34389        74997        75765        82826
cygwin1.dll             1172         1545         4727         4906         4519
maltest.exe             1873         2472         3249         3573         3075

profile count totals   50108        38532        83045        84314        90474
perf vs legacy          1.00         1.30         0.60         0.59         0.55

profile counts as percentage of totals...
Windows dlls           93.9%        89.6%        90.4%        89.9%        91.6%
cygwin1.dll             2.3%         4.0%         5.7%         5.8%         5.0%
maltest.exe             3.7%         6.4%         3.9%         4.2%         3.4%

*** OTHER TEST DATA ***
raw data...
maltest, ops/sec       13863        19500         8087         8417         9197
cygwin config, secs    39.82        38.45        40.92        41.29        46.04
cygwin make -j4, secs   1600         1555         1611         1589         1611

perf vs legacy...
maltest                 1.00         1.41         0.58         0.61         0.66
cygwin config           1.00         1.04         0.97         0.96         0.86
cygwin make -j4         1.00         1.03         0.99         1.01         0.99

*** NOTES ***
- "lockless*" means no lock needed if request can be satisfied from thread's
   own mspace, else a lock+unlock is needed on the global mspace.

- Each profile "count" equals 0.01 CPU seconds.

- Under OTHER TEST DATA, maltest and cygwin config data are averages of 5 runs,
   cygwin make data are from single runs.

- "maltest" is a threaded malloc stress tester.  In my testing it's set up to
   use 4 threads that allocate and later release random-sized blocks <= 512kB.
   A subset of block sizes are skewed somewhat smaller than truly random to
   simulate frequent C++ class instantiation.  Threads also touch each page of a
   block on return from malloc(); this simulates actual app behavior better than
   just doing mallocs+frees.  A subset of mallocs (~6%) are morphed into
   reallocs just to exercise that path.

- The profile counts were obtained with cygmon, a tool I ought to release.

- All investigation done on a 2C/4T 2.3GHz Windows 10 machine using an SSD.

This email is basically a state dump of where I'm currently at.  Comments or
questions are welcome.  I'm inclined to release what implements the first
enhancement (maybe 3% speedup, more for multi-thread processes) but leave
mspaces for the future, if at all.  Maybe the less-than-satisfying mspace
performance argues for trying harder to get jemalloc or tcmalloc investigated
in the Cygwin environment.

Thanks for reading.

..mark