Measuring Linux Filesystem Cache Hit Rate

Script written by Jake Maul, jakemaul@gmail.com. Lots of input from the mailing list (Josh Stone, Will Cohen, Thiago Manel), and Dave Wright.

Problem

None of the standard tools give any good indication as to how effective the Linux filesystem cache is at caching your data. You can guess at it by clearing it and watching how fast it grows, but this is a fairly crude measurement.

If you know about the data, you may be able to predict whether or not adding more RAM will be useful or not. If you don't know the data, this is much harder to diagnose- you can't just quickly look at the amount of free RAM, see that it's low, and say "Add RAM!". It's very possible that the cache has built up over a long time and contains many things that are never accessed... or maybe everything in it is vital for performance reasons... there's no good way to tell. You could flush the cache and watch how quickly it grows back, but this is only acceptable if you can afford the performance impact, and if you've got time to watch and see how quickly it fills up. Even if it fills up quickly, you can't say with complete confidence that you don't have just the right amount of RAM for that data set.

Knowing what to do is much simpler if you know the cache hit rate:

If the cache is full and the hit rate is low, then more RAM will probably help I/O performance.

If the cache is full and the hit rate is very high, more RAM probably will not help I/O performance much.

If the cache is not full and the hit rate is low, then your data is not particularly cache-friendly for some reason. You might benefit from trying to restructure it such that it caches better. This might be a no-win scenario.

If the cache is not full and the hit rate is high, then you have successfully cached all of your data, or at least as much as is useful to cache. More RAM won't help improve I/O performance.

Scripts

cache-hit-rate.stp: (Note, you may want to increase the length of the timer for "smoother" results)

global total_bytes, disk_bytes, counter, overall_cache_bytes, overall_disk_bytes

probe vfs.read {
        if (bytes_to_read > 0 && devname != "N/A") {
                total_bytes <<< bytes_to_read
        }
}

probe ioblock.request {
        mydevname = substr(devname,0,3)
        if (rw == 0 && size > 0 && devname != "N/A" && mydevname != "dm-") {
                disk_bytes <<< size
        #       printf("%10d %s %s\n", size, mydevname, devname)
        }
  
}

probe begin {
        printf("Starting...\n")
}

# print VFS hits and misses every 5 second, plus the hit rate in %
probe timer.s(5) {
        if (counter%15 == 0) {
                printf ("\n%18s %18s %18s %10s %10s\n", 
                        "Total Reads (KB)", "Cache Reads (KB)", "Disk Reads (KB)", "Miss Rate", "Hit Rate")
        }
        counter++

        total_bytes <<< 0 # seeding in case no reads came in
        disk_bytes <<< 0 # seeding in case no reads came in
        sum_disk_bytes = @sum(disk_bytes)
        sum_total_bytes = @sum(total_bytes)
        cache_bytes = sum_total_bytes - sum_disk_bytes
        if (cache_bytes < 0)
                cache_bytes = 0
        if (cache_bytes+sum_disk_bytes > 0) {
                hitrate =  10000 * cache_bytes / (cache_bytes+sum_disk_bytes)
                missrate = 10000 * sum_disk_bytes / (cache_bytes+sum_disk_bytes)
        } else {
                hitrate = 0
                missrate = 0
        }
        printf ("%18d %18d %18d %6d.%02d%% %6d.%02d%%\n",
                sum_total_bytes/1024, cache_bytes/1024, sum_disk_bytes/1024,
                missrate/100, missrate%100, hitrate/100, hitrate%100)
        overall_cache_bytes <<< cache_bytes
        overall_disk_bytes <<< sum_disk_bytes
        delete total_bytes
        delete disk_bytes
}

probe end {
        avg_hitrate =  10000 * @sum(overall_cache_bytes) / ( @sum(overall_cache_bytes) + @sum(overall_disk_bytes) )
        avg_missrate = 10000 * @sum(overall_disk_bytes)  / ( @sum(overall_cache_bytes) + @sum(overall_disk_bytes) )
        printf("\n%s: %d.%02d\n%s: %d.%02d\n", 
                " Average Hit Rate", avg_hitrate/100, avg_hitrate%100, 
                "Average Miss Rate", avg_missrate/100, avg_missrate%100)
}

Output

This is a sample of the script running while cat'ing a big file to /dev/null once, then immediately again. You can see how the first few iterations are entirely disk reads, while the second is entirely cache (and completes in 1 update cycle, rather than 3).

# stap cache-hit-rate.stp 
Starting...

  Total Reads (KB)   Cache Reads (KB)    Disk Reads (KB)  Miss Rate   Hit Rate
            167041                  0             167452    100.00%      0.00%
            395648                  0             396036    100.00%      0.00%
            228904                 36             228868     99.98%      0.01%
                 0                  0                  0      0.00%      0.00%
            791577             791577                  0      0.00%    100.00%
                 0                  0                  0      0.00%      0.00%

 Average Hit Rate: 49.97
Average Miss Rate: 50.02

Lessons

Do not trust in the accuracy of existing scripts... especially this one, if you read the original version based on "vfs.read.return" and the "N/A" devname. That is a red herring.

ioblock.request is pretty reliable, but you can sometimes end up in situations where a naive usage of it will drastically over-report your disk I/O. In my case, this was because of LVM. LVM usage triggers 2 ioblock.request probe hits- one for dm-X (hitting LVM), and another for sdXY (hitting the physical volume that is hosting that particular block). I've taken a somewhat blunt approach by attempting to simply exclude all 'dm-X' devices from the count. Assuming it's not possible that an I/O request could hit only device mapper (and not an actual block device), this should be fine... although there's probably a better way to do it.

There is still a small problem of ioblock.request sometimes recording things that vfs.read does not. You can see this by simply adding up the cache reads and disk reads column, and seeing that sometimes they don't add up to the total reads column. Alternatively, simply comment out the check for "cache_reads < 0", and you'll sometimes see small negative cache read values. If anyone can point out what causes this and provide a suitable fix, I'd be happy to improve the script.

Another hurdle to overcome was the lack of floating point arithmetic. I stole this method from another SystemTap example I found relating to measuring user and system CPU time.