This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
malloc per-thread cache: benchmarks

From: DJ Delorie <dj at redhat dot com>
To: libc-alpha at sourceware dot org
Date: Tue, 24 Jan 2017 16:10:53 -0500
Subject: malloc per-thread cache: benchmarks
Authentication-results: sourceware.org; auth=none
The purpose of this email is to summarize my findings while
benchmarking the new per-thread cache (TCache) optimization to glibc's
malloc code.  Please respond to this email with comments on
performance; a future email will contain the patch and start
patch-related conversations.

Executive summary: A per-thread cache reduces malloc call time to
about 83% of pristine, with a cost of about 1% more RSS[*].  TCache
doesn't replace Fastbins; combined performance is better than either
separately.  Performance testing was done on an x86_64 system, as most
of my workloads won't fit on a 32-bit system :-)

[*] but 0.5% *less* RSS if you ignore synthetic benchmarks :-)

----------------------------------------

The rest of the mail is the raw data.  It's a bit "busy" number-wise
and best viewed with a fixed font and a wide screen because of all the
tables.  The benchmarks were done by running workloads captured from
various real-world apps and some synthetic benchmarks, using the tools
and simulator in the dj/malloc branch.

These first three charts show the breakdown of how many cycles each
API call used, for both a pristine glibc build and a build with TCache
enabled.  "Total" is the total number of cycles used by the entire
test, the other colums are mean cycles per API call.  The RSS
column indicates memory used.  Note that the increase in mean
calloc/realloc times is due to some overhead (which needs to be done
anyway) being moved from malloc to those; since malloc is called far
more often this is a net win.

               --------------------  Pristine  --------------------
Workload               Total     malloc calloc realloc free     RSS
                                                
389-ds-2         182,211,121,679    165    707  239  123  1,012,798
dj                 1,237,241,201    443  1,796    0  173     28,896
dj2               10,886,657,226    183    482  223  110     40,480
mt_test_one_alloc  8,737,515,199 15,593 55,927    0  943  1,820,204
oocalc             2,051,407,045    179    392  718  118    153,160
qemu-virtio        1,172,123,759    343    573  558  226    858,240
qemu-win7            892,111,785    401    676  658  211    690,810
customer-1       105,139,327,336    198  2,036  401  131  2,800,478
customer-2         3,328,843,534    186  2,298  559  147     99,600

                 --------------------  TCache  --------------------
Workload               Total     malloc calloc realloc free     RSS
                                        
389-ds-2         165,443,672,318     90    712  204  108  1,013,340
dj                   858,955,060    310  1,818    0  118     27,509
dj2                9,143,739,161    138    508  230   94     40,338
mt_test_one_alloc  7,469,894,433 13,292 54,600    0  841  2,096,968
oocalc             1,428,492,279    120    411  778   85    153,038
qemu-virtio        1,053,619,586    296    608  518  208    859,659
qemu-win7            809,406,757    331    701  630  199    694,089
customer-1        88,805,361,692    153  2,187  407  124  2,807,520
customer-2         2,641,419,852    132  2,687  688  131     97,196

                --------------------  Change  --------------------
Workload               Total     malloc calloc realloc free     RSS
                                        
389-ds-2                91%         55%   101%  85%  88%       100%
dj                      69%         70%   101%  68%  95%        95%
dj2                     84%         75%   105% 103%  85%       100%
mt_test_one_alloc       85%         85%    98%  89% 115%       115%
oocalc                  70%         67%   105% 108%  72%       100%
qemu-virtio             90%         86%   106%  93%  92%       100%
qemu-win7               91%         83%   104%  96%  94%       100%
customer-1              84%         77%   107% 101%  95%       100%
customer-2              79%         71%   117% 123%  89%        98%

            Mean:       83%         74%   105% 101%  86%       101%


This chart shows what effects fastbins and tcache have independently,
and with both combined.  Things to note: the contribution from
fastbins (the FB+ and TC:FB+ colums) show that fastbins continue to
contribute to performance even with tcache enabled, but its
contributions are less with tcache.  Likewise, the TC+ and FB:TC+
columns show that adding tcache is a win, with or without fastbins.

Neither = both fasbins and tcache are disabled.
FB+ = relative time when fastbins are added to "neither"
TC+ = when tcache is added to "neither"
FB:TC+ = time change from "fastbin only" to "fastbins+tcache" (i.e. when TC is added to FB)
TC:FB+ = likewise, when fastbins are added to tcache.
TC/FB = ratio of "how fast with just tcache" to "how fast with just fastbins".

Test              Neither         Fastbins        TCache          Both             FB+ TC+ FB:TC+ TC:FB+ TC/FB
                                                                                
389-ds-2          191,664,220,060 182,211,121,679 172,441,402,072 165,443,672,318  95%  90%  91%   96%    95%
dj                  1,716,034,269   1,237,241,201     833,276,021     858,955,060  72%  49%  69%  103%    67%
dj2                11,852,642,821  10,886,657,226   9,470,287,095   9,143,739,161  92%  80%  84%   97%    87%
mt_test_one_alloc   8,776,157,170   8,737,515,199   7,359,052,251   7,469,894,433 100%  84%  85%  102%    84%
oocalc              2,343,558,811   2,051,407,045   1,455,081,145   1,428,492,279  88%  62%  70%   98%    71%
qemu-virtio         1,354,220,960   1,172,123,759   1,129,676,630   1,053,619,586  87%  83%  90%   93%    96%
qemu-win7             950,748,214     892,111,785     811,040,794     809,406,757  94%  85%  91%  100%    91%
customer-1        120,712,604,936 105,139,327,336 112,007,827,423  88,805,361,692  87%  93%  84%   79%   107%
customer-2          3,725,017,314   3,328,843,534   2,994,818,396   2,641,419,852  89%  80%  79%   88%    90%
                                                                                
                                                               Mean:               89%  78%  83%   95%    88%


This last chart shows the relative RSS overhead for the various
algorithms.  The OH: columns are the overhead (actual RSS minus ideal
RSS) for each configuration, with the + columns showing overhead for
each optimization relative to the "neither" value.  The last column is
the overhead for the "neither" case relative to ideal.  Note that the
synthetic benchmarks (dj, dj2, and mt_test_one_alloc) have unusual
results because they're designed to create worst-case scenarios.

Workload          Ideal     Neither    Fastbins   TCache     Both          OH:N     OH:FB       OH:TC      OH:B     FB+    TC+     B+        OH%
                                                                                                                                
389-ds-2          735,268  1,013,122  1,012,798 1,013,277  1,013,340     277,854    277,530    278,009   278,072   99.9%  100.1%  100.1%    137.8%
dj                     53     33,577     28,896    30,238     27,509      33,524     28,843     30,185    27,456   86.0%   90.0%   81.9%  63352.8%
dj2                17,023     40,853     40,480    40,582     40,338      23,830     23,457     23,559    23,315   98.4%   98.9%   97.8%    240.0%
mt_test_one_alloc  90,533  1,839,58   1,820,204 2,176,172  2,096,968   1,749,055  1,729,671  2,085,639 2,006,435   98.9%  119.2%  114.7%   2032.0%
oocalc             90,151    153,543    153,160   152,909    153,038      63,392     63,009     62,758    62,887   99.4%   99.0%   99.2%    170.3%
qemu-virtio       697,211    855,511    858,240   860,147    859,659     158,300    161,029    162,936   162,448  101.7%  102.9%  102.6%    122.7%
qemu-win7         634,275    689,756    690,810   691,965    694,089      55,481     56,535     57,690    59,814  101.9%  104.0%  107.8%    108.7%
customer-1      2,510,785  2,803,894  2,800,478 2,806,886  2,807,520     293,109    289,693    296,101   296,735   98.8%  101.0%  101.2%    111.7%
customer-2         75,579     99,889     99,600    98,026     97,196      24,310     24,021     22,447    21,617   98.8%   92.3%   88.9%    132.2%
                                                                                                                                
                                                                                                    Mean:          98.2%  100.8%   99.4%   7378.7%
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]