Performance of 389 Directory Server using glibc's malloc

In [1] the 389 directory server (389-ds) community publishes an analysis of jemalloc versus glibc's malloc. In response to [1] this article clarifies the terminology and methodology for measuring the performance of a memory allocator. We then show how this terminology and methodology can be applied to the analysis in [1] to improve on the results.

1. Terminology

These terms are discussed specifically in their GNU/Linux context.

1.1. RSS vs VSZ

The term RSS refers to "resident set size" and is the exact amount of memory a given process is using in resident memory (RAM). The RSS value does not account for memory that has been moved to swap. The term VSZ refers to "virtual set size" and is the total sum of virtual address space being used by the process.

The expectation is that VSZ is the true size of the process (number of physical memory pages being used), but this is not always true. It is possible to consume address space but not resident memory. For example a process can use mmap (under heuristic overcommit) to allocate very large amounts of memory, but until those pages are dirtied they don't count towards RSS. In addition a process may use mprotect with MADV_DONTNEED to give back pages to the OS. Pages given back via MADV_DONTNEED will no longer count to RSS (under heuristic overcommit, but not strict overcommit mode 2 where they do), and they also do not count towards an out-of-memory (OOM) killer score. However, such pages do still count towards the total virtual set size held by the process. So while VSZ may be large, that may be a consequence of a large unused mmap or calls to mprotect with MADV_DONTNEED that returned those pages to the OS. Astute readers will point out that the large VSZ does take up some amount of memory in that OS page tables need to remain to account for the memory ranges and support the eventual faulting back in of new pages (on the order of 2MB per 1GB e.g. 1/512th).

In [1] the analysis assumes that VSZ can usefully be used to determine the total memory being used by the process, and in the case of jemalloc and glibc this is not possible. Both allocators use MADV_DONTNEED to return memory to the operating system and therefore VSZ is not indicative of actual process memory usage. The best indicator is RSS.

In [1] the unlabelled values under "Memory Profile" are: date, pid, VSZ, RSS, CPU%.

In which case we can build the following table from the analysis based on RSS usage:

Search Tests:

Test

glibc RSS (kb)

jemalloc RSS (kb)

% RSS change

100K Searches Per Thread

56872

58656

+3.13

1 Million Searches Per Thread

47924

59252

+23.63

Modify Tests:

Test

glibc RSS (kb)

jemalloc RSS (kb)

% RSS change

10k Modifies Per Thread

48940

35712

-27.02

100k Modifies Per Thread

48940

59252

+21.07

Add/Delete Tests:

Test

glibc RSS (kb)

jemalloc RSS (kb)

% RSS change

10k Add/Delete

49096

53844

+9.67

Add/Delete 10k - Run 10 Times

106804

56808

-46.81

Unindexed Searches:

Test

glibc RSS (kb)

jemalloc RSS (kb)

% RSS change

20 Searches Per Thread

75420

59224

-21.47

100 Searches Per Thread

78928

60152

-23.78

Large Entry Cache (cache primed)

Test

glibc RSS (kb)

jemalloc RSS (kb)

% RSS change

Search for 100k Entries

1022348

838540

-17.97

Search for 1 Million Entries

1020960

838784

-17.84

Modify Tests

Test

glibc RSS (kb)

jemalloc RSS (kb)

% RSS change

10k Modifies Per Thread

1197852

870444

-27.33

100k Modifies Per Thread

1845452

1067028

-42.18

Unindexed Search Tests

Test

glibc RSS (kb)

jemalloc RSS (kb)

% RSS change

20 Searches Per Thread

1021036

835312

-18.18

100 Searches Per Thread

1020756

834760

-18.22

As can be seen in the RSS-based analysis there are some cases where glibc's RSS usage is lower than jemalloc's RSS usage. However, there are also several key tests where glibc's RSS usage is consistently 20-45% higher.

The side-effect of that is that for 389-ds usage such a server would need in the average (mean) 14% more physical memory to operate under the same conditions as above (assuming equally likely operation of the various modes) using the glibc memory allocator.

1.2. Fragmentation

When discussing the fragmentation of a given allocator care must be given to the definition of fragmentation. We define two types of fragmentation, internal and external.

In [1] the term fragmentation is used to mean "increase in the VSZ over time," which is a consequence-based definition of fragmentation, but requires more precision if we are to act on the information we have collected. What the long-running tests are looking for is fragmentation that might lead to unbounded growth. In theory, given that linear packing problems are NP-complete, and we want malloc to complete in bounded time, all allocators suffer from potential for unbounded storage growth given a certain allocation pattern that causes external fragmentation.

Unfortunately glibc's malloc does not make it easy to determine if unbounded RSS growth is due to:

If an alternate allocator shows stable RSS usage over time, given the same workload, then it is very likely that there is a fragmentation problem between the workload and the original allocator.

The data collected for long-running processes in [1] is based on VSZ measurements and therefore not suitable for determining exactly how RSS is behaving over time. A long running test needs to be carried out to capture RSS behaviour over time to distinguish between the above three causes of RSS growth.

2. Methodology

2.1. Statistical significance

While the tests themselves run for a long period of time they produce a singular value in terms of total RSS usage. Comparing a singular value for RSS usage against another singular RSS usage value of a completely distinct implementation may not give you an accurate or meaningful comparison (particularly if the two random variables have a high variance).

We should look to run the test enough times that we capture a series of RSS usage numbers (enough for your confidence interval) and using those numbers determine if the two implementations are different within some statistical significance. That is to say that the difference between glibc's malloc and jemalloc's malloc should be formulated as a question of statistical significance between the two results.

The use of statistical significance becomes more important if one considers the performance differences between glibc and jemalloc's implementations. When looking at performance the nominal variance in a desktop system will cause a considerable amount of noise in the measurements, and for examples about this see [4]. Therefore you need to take many measurements or special precautions to remove such noise.

In summary we strongly recommend multiple runs of tests followed by a check for statistically significant differences as part of the standard operating procedure for comparing any two random variables e.g. time taken to complete the same task. Such tests include all paired difference tests [5], usually a Student's t-test.

2.2. VSZ and Linux OOM Killer

The size of VSZ does not significantly impact the commit charge of the process (page table usage counts as mentioned earlier). Thus VSZ does not increase the scoring used by the OOM killer in any appreciable way. When one looks at reasons for switching from glibc's malloc to jemalloc the increased VSZ is not a good reason. In fact there are at least two solutions one can take to protect production environments from failing randomly due to OOM Killer and at the same time report reasonable errors, and they are:

If you disable strict overcommit the application will always be able to write to the pages of memory it has allocated from malloc without a scenario where touching the page causes a segfault (not enough memory and we were overcommitted and killed by OOM killer). In some scenarios switching to strict overcommit will require significantly more swap. More swap will be required because poorly written applications that allocate large amounts of memory but never use them would, under overcommit practices, have been in an overcommitted state, and now the linux kernel actually allocates pages to honour the request. Thus depending on the mix of applications the first solution may not work.

Disabling OOM Killer for key applications is often the simplest solution. Other processes will be killed to reclaim the pages needed by the critical processes. This change has less of a global impact on existing production environment.

2.3. Tuning glibc's malloc

There are a number tunable parameters that can be used to adjust the behaviour of glibc's malloc. As a general purpose allocator there are several parameters that could impact RSS usage or growth. When attempting to limit RSS usage there are several key things to try.

The first suggestion to limit the number of arenas will increase contention between threads that share the same arena (serialize the allocations). However, it may decrease fragmentation as the arena free lists are used to service a wider variety of allocations, perhaps finding it easier to match a broader set of request sizes (as opposed to just one allocation pattern). This kind of tuning can be applied to production systems after performance measurements are made.

The second suggestion is perhaps the most dramatic change that has a serious performance impact, but results in some interesting experimental testing results. Using ltrace or systemtap determine the largest regular allocation being done by the application. Take that regular application size and set MALLOC_MMAP_THRESHOLD_ to just under that value (rounded down to a page size). This forces most allocations to be handled by mmap/munmap, which means there can be no algorithm-based fragmentation because the blocks are being taken from the OS and handed back (the fragmentation would be at the OS virtual-address space layout which for 64-bit applications is not an issue). Increase MALLOC_MMAP_MAX_ to account for the increased number of mmap-based allocations, and test your application. If RSS usage continues to grow, it might be an indicator that you have a real leak or resource requirement increase since the likelihood of fragmentation in the allocator is low.

3. Conclusion

In the case of 389-ds it isn't clear if switching to jemalloc will reduce deployed server RSS usage given the results of the analysis in [1], the missing data on RSS usage, long-term RSS usage, and missing statistical significance numbers. More testing is certainly required to capture the needed data to make an informed choice.

Through the use of RSS measurements, statistical testing, system configuration, and malloc tunning one can improve the accuracy of the results when deciding to change from the default system allocator to any alternate allocator.

Note that the original 389-ds tests used a single threaded client. Since then there have been doubts expressed regarding the accuracy of these tests when it comes to modelling typical workloads with many clients. Thus there is incentive to redo the tests with multiple client threads and at the same time use the recommendations given here to enhance the accuracy of the results.

In general the primary reasons for using glibc's malloc is that it is very well tested and supported by most major distribution under their support for key runtimes. Therefore developer tooling is also well tested with glibc's malloc. Lastly, the security profile of glibc's malloc is constantly being reviewed and improved as a major target for attackers. All of these aspects combined make glibc's malloc a key place to join the open source collaboration around memory allocation techniques.

The Red Hat glibc team is working on a variety of malloc performance and feature enhancements for glibc 2.24 (August 1st 2016). One of which is a thread-local cache similar to those used in tcmalloc[2] and jemalloc[3] and available on the dj/malloc branch. As part of these enhancements we are looking at real user workloads, tracing them, modelling them, and using those models to improve malloc. We encourage developers to approach us with their malloc related problems to gather workload models that we, as a community, can use to improve glibc's malloc.

4. Reference

[1] http://www.port389.org/docs/389ds/FAQ/jemalloc-testing.html

[2] http://goog-perftools.sourceforge.net/doc/tcmalloc.html

[3] https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919

[4] http://developers.redhat.com/blog/2016/03/11/practical-micro-benchmarking-with-ltrace-and-sched/

[5] https://en.wikipedia.org/wiki/Paired_difference_test

None: 389-ds-malloc (last edited 2016-05-13 18:28:27 by CarlosODonell)