This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
malloc per-thread cache: benchmarks
- From: DJ Delorie <dj at redhat dot com>
- To: libc-alpha at sourceware dot org
- Date: Tue, 24 Jan 2017 16:10:53 -0500
- Subject: malloc per-thread cache: benchmarks
- Authentication-results: sourceware.org; auth=none
The purpose of this email is to summarize my findings while
benchmarking the new per-thread cache (TCache) optimization to glibc's
malloc code. Please respond to this email with comments on
performance; a future email will contain the patch and start
patch-related conversations.
Executive summary: A per-thread cache reduces malloc call time to
about 83% of pristine, with a cost of about 1% more RSS[*]. TCache
doesn't replace Fastbins; combined performance is better than either
separately. Performance testing was done on an x86_64 system, as most
of my workloads won't fit on a 32-bit system :-)
[*] but 0.5% *less* RSS if you ignore synthetic benchmarks :-)
----------------------------------------
The rest of the mail is the raw data. It's a bit "busy" number-wise
and best viewed with a fixed font and a wide screen because of all the
tables. The benchmarks were done by running workloads captured from
various real-world apps and some synthetic benchmarks, using the tools
and simulator in the dj/malloc branch.
These first three charts show the breakdown of how many cycles each
API call used, for both a pristine glibc build and a build with TCache
enabled. "Total" is the total number of cycles used by the entire
test, the other colums are mean cycles per API call. The RSS
column indicates memory used. Note that the increase in mean
calloc/realloc times is due to some overhead (which needs to be done
anyway) being moved from malloc to those; since malloc is called far
more often this is a net win.
-------------------- Pristine --------------------
Workload Total malloc calloc realloc free RSS
389-ds-2 182,211,121,679 165 707 239 123 1,012,798
dj 1,237,241,201 443 1,796 0 173 28,896
dj2 10,886,657,226 183 482 223 110 40,480
mt_test_one_alloc 8,737,515,199 15,593 55,927 0 943 1,820,204
oocalc 2,051,407,045 179 392 718 118 153,160
qemu-virtio 1,172,123,759 343 573 558 226 858,240
qemu-win7 892,111,785 401 676 658 211 690,810
customer-1 105,139,327,336 198 2,036 401 131 2,800,478
customer-2 3,328,843,534 186 2,298 559 147 99,600
-------------------- TCache --------------------
Workload Total malloc calloc realloc free RSS
389-ds-2 165,443,672,318 90 712 204 108 1,013,340
dj 858,955,060 310 1,818 0 118 27,509
dj2 9,143,739,161 138 508 230 94 40,338
mt_test_one_alloc 7,469,894,433 13,292 54,600 0 841 2,096,968
oocalc 1,428,492,279 120 411 778 85 153,038
qemu-virtio 1,053,619,586 296 608 518 208 859,659
qemu-win7 809,406,757 331 701 630 199 694,089
customer-1 88,805,361,692 153 2,187 407 124 2,807,520
customer-2 2,641,419,852 132 2,687 688 131 97,196
-------------------- Change --------------------
Workload Total malloc calloc realloc free RSS
389-ds-2 91% 55% 101% 85% 88% 100%
dj 69% 70% 101% 68% 95% 95%
dj2 84% 75% 105% 103% 85% 100%
mt_test_one_alloc 85% 85% 98% 89% 115% 115%
oocalc 70% 67% 105% 108% 72% 100%
qemu-virtio 90% 86% 106% 93% 92% 100%
qemu-win7 91% 83% 104% 96% 94% 100%
customer-1 84% 77% 107% 101% 95% 100%
customer-2 79% 71% 117% 123% 89% 98%
Mean: 83% 74% 105% 101% 86% 101%
This chart shows what effects fastbins and tcache have independently,
and with both combined. Things to note: the contribution from
fastbins (the FB+ and TC:FB+ colums) show that fastbins continue to
contribute to performance even with tcache enabled, but its
contributions are less with tcache. Likewise, the TC+ and FB:TC+
columns show that adding tcache is a win, with or without fastbins.
Neither = both fasbins and tcache are disabled.
FB+ = relative time when fastbins are added to "neither"
TC+ = when tcache is added to "neither"
FB:TC+ = time change from "fastbin only" to "fastbins+tcache" (i.e. when TC is added to FB)
TC:FB+ = likewise, when fastbins are added to tcache.
TC/FB = ratio of "how fast with just tcache" to "how fast with just fastbins".
Test Neither Fastbins TCache Both FB+ TC+ FB:TC+ TC:FB+ TC/FB
389-ds-2 191,664,220,060 182,211,121,679 172,441,402,072 165,443,672,318 95% 90% 91% 96% 95%
dj 1,716,034,269 1,237,241,201 833,276,021 858,955,060 72% 49% 69% 103% 67%
dj2 11,852,642,821 10,886,657,226 9,470,287,095 9,143,739,161 92% 80% 84% 97% 87%
mt_test_one_alloc 8,776,157,170 8,737,515,199 7,359,052,251 7,469,894,433 100% 84% 85% 102% 84%
oocalc 2,343,558,811 2,051,407,045 1,455,081,145 1,428,492,279 88% 62% 70% 98% 71%
qemu-virtio 1,354,220,960 1,172,123,759 1,129,676,630 1,053,619,586 87% 83% 90% 93% 96%
qemu-win7 950,748,214 892,111,785 811,040,794 809,406,757 94% 85% 91% 100% 91%
customer-1 120,712,604,936 105,139,327,336 112,007,827,423 88,805,361,692 87% 93% 84% 79% 107%
customer-2 3,725,017,314 3,328,843,534 2,994,818,396 2,641,419,852 89% 80% 79% 88% 90%
Mean: 89% 78% 83% 95% 88%
This last chart shows the relative RSS overhead for the various
algorithms. The OH: columns are the overhead (actual RSS minus ideal
RSS) for each configuration, with the + columns showing overhead for
each optimization relative to the "neither" value. The last column is
the overhead for the "neither" case relative to ideal. Note that the
synthetic benchmarks (dj, dj2, and mt_test_one_alloc) have unusual
results because they're designed to create worst-case scenarios.
Workload Ideal Neither Fastbins TCache Both OH:N OH:FB OH:TC OH:B FB+ TC+ B+ OH%
389-ds-2 735,268 1,013,122 1,012,798 1,013,277 1,013,340 277,854 277,530 278,009 278,072 99.9% 100.1% 100.1% 137.8%
dj 53 33,577 28,896 30,238 27,509 33,524 28,843 30,185 27,456 86.0% 90.0% 81.9% 63352.8%
dj2 17,023 40,853 40,480 40,582 40,338 23,830 23,457 23,559 23,315 98.4% 98.9% 97.8% 240.0%
mt_test_one_alloc 90,533 1,839,58 1,820,204 2,176,172 2,096,968 1,749,055 1,729,671 2,085,639 2,006,435 98.9% 119.2% 114.7% 2032.0%
oocalc 90,151 153,543 153,160 152,909 153,038 63,392 63,009 62,758 62,887 99.4% 99.0% 99.2% 170.3%
qemu-virtio 697,211 855,511 858,240 860,147 859,659 158,300 161,029 162,936 162,448 101.7% 102.9% 102.6% 122.7%
qemu-win7 634,275 689,756 690,810 691,965 694,089 55,481 56,535 57,690 59,814 101.9% 104.0% 107.8% 108.7%
customer-1 2,510,785 2,803,894 2,800,478 2,806,886 2,807,520 293,109 289,693 296,101 296,735 98.8% 101.0% 101.2% 111.7%
customer-2 75,579 99,889 99,600 98,026 97,196 24,310 24,021 22,447 21,617 98.8% 92.3% 88.9% 132.2%
Mean: 98.2% 100.8% 99.4% 7378.7%