UDP Fragment Loss

Problem

The following war story comes to us from Kevin Rudd at IBM.

Kevin's customer reported intermittent throughput drops in his UDP based client server application. With further investigation he correlated the drops outs with an intermittent loss of UDP fragments somewhere in the linux network stack. Keven was able to reproduce the symptoms in his lab and used systemtap to help narrow down the cause of the problem.

Scripts

The following is the tapscript used for debugging:

%{

#include <linux/netdevice.h>

#include <linux/skbuff.h>

#include <linux/ip.h>

%}

function isfrag:long (skbuff:long) %{

%}

global frag

probe kernel.function("ip_expire") {

probe kernel.function("ip_fragment") {

}

probe kernel.function("ip_finish_output") {

}

probe kernel.function("ip_finish_output").return {

}

probe kernel.function("neigh_resolve_output") {

}

probe kernel.function("neigh_event_send") {

}

probe kernel.function("neigh_event_send").return {

}

Lessons

In my lab testing, I noticed an intermittent loss of fragments when I was doing simple 8k pings with an MTU of 1500 between RHEL4 systems. I decided to write a system tap script to help trace the handling of fragmented messages. What I found was that the fragments were being silently dropped by the sending system. The systemtap output helped to show that it was the neigh_event_send() function that was dropping packets during ARP negotiation if the number on the queue exceeded queue_len:

The default for queue_len is 3, so only 3 fragments would be queued if the system ARP entry had timed out. The rest would be dropped. I recommended that the customer add the following to their /etc/sysctl.conf file in addition to the other MTU and ipfrag_high_thresh recommendations already made:

# ARP changes to deal with large fragmented messages net.ipv4.neigh.default.unres_qlen = 24

Here is the associated output along with tcpdump data taken at the time of the testing. Running ping from blade3 without a valid neighbor (ARP) entry for blade8: ip_fragment: skb==0xf7048680, len==8028

(note the nud_state of 0 above)

The final test was done with a valid (although stale) entry on blade3, but an expired entry on blade8

(to force blade8 to drop some fragments in the response): ip_fragment: skb==0xf70e0e80, len==8028

ip_expire:

qp->user = 0

qp->saddr = 0x94412f09

qp->daddr = 0x8f412f09

qp->id = 0x4f08

qp->protocol = 1

qp->last_in = 1 qp->len = 8008

qp->meat = 3568

qp->iif = 3

14:34:03 IP (id 32751, offset 0, length: 1500) blade3 > blade8: icmp 1480: echo request seq 0

14:34:03 IP (id 32751, offset 1480, length: 1500) blade3 > blade8: icmp

14:34:03 IP (id 32751, offset 2960, length: 1500) blade3 > blade8: icmp 1

4:34:03 IP (id 32751, offset 4440, length: 1500) blade3 > blade8: icmp

14:34:03 IP (id 32751, offset 5920, length: 1500) blade3 > blade8: icmp

14:34:03 IP (id 32751, offset 7400, length: 628) blade3 > blade8: icmp

14:34:03 IP (id 2127, offset 4440, length: 1500) blade8 > blade3: icmp

14:34:03 IP (id 2127, offset 5920, length: 1500) blade8 > blade3: icmp

14:34:03 IP (id 2127, offset 7400, length: 628) blade8 > blade3: icmp


None: WSUDPfragmentLoss (last edited 2008-01-10 19:47:24 by localhost)