Monitoring Red Hat Enterprise Linux 6 Transparent Huge Pages Performance with SystemTap

Problem

Red Hat Enterprise Linux 6 automatically uses huge pages (2MB) for anonymous memory allocations. The user would like to know when there might be some performance impact due to the use of the huge pages. There are cases where additional latency can be encountered due to zeroing the memory in huge pages and modifying the page management information to allow access as normal sized pages.

Introduction

Note that you can still collect this data using systemtap. However recent RHEL 6 kernels now export this data in /proc/vmstat:

# egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 2018
thp_fault_alloc 7302
thp_fault_fallback 0
thp_collapse_alloc 401
thp_collapse_alloc_failed 0
thp_split 21

Typical memory page size in linux is on the order of kilobytes in size; for example on the x86 processors each memory page is normally 4096 bytes in size. Using a small, uniform-sized page makes it easier to implement virtual memory. However, the small page size can lead to unwanted overhead due to the amount of memory used to manage the many small virtual memory pages, more frequent virtual memory bookkeeping operations, and higher cost for individual virtual memory bookkeeping operations such as virtual to physical address lookups.

The transparent huge pages in the Red Hat Enterprise 6 Linux kernel is an implicit mechanism to allow normal applications that allocate large amounts of memory to use the huge (2MB) pages. This mechanism limits the use of huge pages to anonymous (heap) allocated memory. However, it has the following benefits:

The advantage to using the huge (2MB) pages is that each mapping entry (Translation Lookaside Buffer, TLB, entry) from virtual to physical address space covers a larger region of memory, reducing the amount of memory required for bookkeeping. Each virtual to physical page mapping for a 2MB page would require 512 virtual to physical page mappings if implemented using 4096 byte pages. This mean less memory is required for the virtual to physical mapping if huge pages are used. The processor caches a limited number of virtual to physical mappings, Translation Lookaside Buffers (TLB) entries. Using the huge pages allows the processor to address a larger portion of memory without having to recompute the TLB entries. Finally, the virtual to physical translation computations are quicker for the huge pages than for the normal sized pages because there are fewer levels of indirection required to look up the information.

The transparent huge page mechanism can improve performance. However, there are cases were there are increased costs:

Page Faults

The common case that causes new huge pages to be allocated is that the user has large allocation of heap memory followed by an access. The linux kernel does not actually create a page for allocation until it is touched by the user program. When the user process attempts to access the memory a page fault occurs, the kernel maps the page into the user process and clears out the memory. The result is that the latency for handling minor page faults may be greater for the huge pages than for the normal sized pages. Thus, the first access to a newly allocated but unmapped page of memory may be on the order of hundreds of microseconds rather than on the order of 10 microseconds. This can be observed using the SystemTap pfaults.stp example script and filtering with grep:

pfaults.stp |grep ":minor" |grep "[0-9][0-9][0-9]$"

The latency for the first access to the huge page may be longer. However, the overhead of the page fault is only encountered once for the 2MB region rather than 512 times for the equivalent 4KB pages if all the memory is touched.

The following script uses the @entry operator in SystemTap 1.3 to directly probe the function clearing out the memory in the huge pages:

stap  -e 'global huge_clear probe kernel.function("clear_huge_page").return {
  huge_clear [execname(), pid()] <<< (gettimeofday_us() - @entry(gettimeofday_us()))}'

When cntrl-c is pressed the information about the time spent in the huge_clear function is grouped by execname and pid. The @count is the number of times the clear_huge_page function was used by the process. The @min is the shortest time in microseconds required to clear out a page and the @max is the longest time in microseconds. The @sum is the total microseconds spent by the process. Below is a sample output:

huge_clear["plugin-config",23481] @count=7 @min=253 @max=303 @sum=1945 @avg=277
huge_clear["firefox",23493] @count=3 @min=246 @max=248 @sum=740 @avg=246
huge_clear["firefox",23125] @count=2 @min=215 @max=239 @sum=454 @avg=227

Splitting Huge Pages

There are portions of the Linux kernel that are only able to handle normal sized pages. For these sections of kernel the 2MB huge page must be split into 4KB pages before being used. More accurately, bookkeeping information must be generated so the 2MB huge page can be accessed as 512 4KB sized pages. This can also add latency to operations. The following script will show when the huge pages are being split:

stap -e 'probe kernel.function("split_huge_page") { printf("%s: %s(%d)\n", pp(), execname(), pid());}'

It will generate output like the following:

kernel.function("split_huge_page@mm/huge_memory.c:1298"): firefox(23493)
kernel.function("split_huge_page@mm/huge_memory.c:1298"): firefox(23493)
kernel.function("split_huge_page@mm/huge_memory.c:1298"): firefox(23493)
kernel.function("split_huge_page@mm/huge_memory.c:1298"): firefox(23493)
kernel.function("split_huge_page@mm/huge_memory.c:1298"): firefox(23493)
kernel.function("split_huge_page@mm/huge_memory.c:1298"): firefox(23493)

If you want to get more context what is going why the split_huge_page function is being called you could add a print_backtrace() to the previous script:

stap -e 'probe kernel.function("split_huge_page") { printf("%s: %s(%d):\n", pp(), execname(), pid()); print_backtrace()}'

Each time the split_huge_page function is called, you will see output like the following printed:

kernel.function("split_huge_page@mm/huge_memory.c:1298"): thunderbird-bin(29374):
 0xffffffff81166040 : split_huge_page+0x0/0x7f0 [kernel]
 0xffffffff811668b1 : __split_huge_page_pmd+0x81/0xc0 [kernel] (inexact)
 0xffffffff8113364e : unmap_vmas+0xa1e/0xc00 [kernel] (inexact)
 0xffffffff811342e1 : zap_page_range+0x81/0xf0 [kernel] (inexact)
 0xffffffff8113027d : sys_madvise+0x54d/0x760 [kernel] (inexact)
 0xffffffff810d40a2 : audit_syscall_entry+0x272/0x2a0 [kernel] (inexact)
 0xffffffff8101f7c9 : ftrace_raw_event_sys_enter+0xd9/0x130 [kernel] (inexact)
 0xffffffff8101ea88 : syscall_trace_enter+0x1d8/0x1e0 [kernel] (inexact)
 0xffffffff81013387 : tracesys+0xd9/0xde [kernel] (inexact)

In the example output above it looks like a hugepage is being removed with a madvise operation.

Merging Pages

The transparent huge page mechanism also has the ability to merge normal sized pages into a huge page. This is accomplished with a kernel thread khugepaged running periodically. This kernel thread is controlled through the /sys/kernel/mm/redhat_transparent_hugepage/khugepaged directory. This directory contains the following entries to control the operation of khugepaged:

alloc_sleep_millisecs  full_scans     pages_collapsed  scan_sleep_millisecs
defrag                 max_ptes_none  pages_to_scan

One could monitor when pages are collapsed into huge pages with the following script:

stap -e 'probe kernel.function("split_huge_page") { printf("%s: %s(%d)\n", pp(), execname(), pid());}'

References


WarStories

None: WSRHEL6HugePages (last edited 2012-04-24 15:52:32 by nat-pool-rdu)