WSUDPfragmentLoss - Systemtap Wiki

UDP Fragment Loss

Problem

The following war story comes to us from Kevin Rudd at IBM.

Kevin's customer reported intermittent throughput drops in his UDP based client server application. With further investigation he correlated the drops outs with an intermittent loss of UDP fragments somewhere in the linux network stack. Keven was able to reproduce the symptoms in his lab and used systemtap to help narrow down the cause of the problem.

Scripts

The following is the tapscript used for debugging:

#include <linux/netdevice.h>

#include <linux/skbuff.h>

#include <linux/ip.h>

function isfrag:long (skbuff:long) %{

unsigned long ptr = (unsigned long)THIS->skbuff;
struct sk_buff * skbuff = (struct sk_buff *)ptr;
if (skbuff->nh.iph->frag_off & htons(0x3FFF)) {
- THIS->retvalue = 1;
} else {
- THIS->retvalue = 0;
} return;

fragment flag

global frag

dump info when a fragment queue expires

probe kernel.function("ip_expire") {

printf ("\n%s:\n", probefunc())
printf ("qp->user = %d\n", $qp->user)
printf ("qp->saddr = %p\n", $qp->saddr)
printf ("qp->daddr = %p\n", $qp->daddr)
printf ("qp->id = 0x%x\n", $qp->id)
printf ("qp->protocol = %d\n", $qp->protocol)
printf ("qp->last_in = %d\n", $qp->last_in)
printf ("qp->len = %d\n", $qp->len)
printf ("qp->meat = %d\n", $qp->meat)
printf ("qp->iif = %d\n", $qp->iif)
}

probe kernel.function("ip_fragment") {

printf ("\n%s: skb==%p, len==%d\n", probefunc(), $skb, $skb->len)

}

probe kernel.function("ip_finish_output") {

if (isfrag($skb)) {
- frag = 1
  dev_name = kernel_string($skb->dst->dev->name)
- printf (" %s: skb==%p, len==%04d, dev==%s\n", probefunc(), $skb, $skb->len, dev_name)
}

}

probe kernel.function("ip_finish_output").return {

delete frag

}

probe kernel.function("neigh_resolve_output") {

if (frag) {
- printf (" %s: skb==%p\n", probefunc(), $skb)
}

}

probe kernel.function("neigh_event_send") {

if (frag) {
- printf (" %s: skb==%p, nud_state==%d, ", probefunc(), $skb, $neigh->nud_state)
}

}

probe kernel.function("neigh_event_send").return {

if (frag) {
- printf ("ret==%d\n", $return)
}

}

Lessons

In my lab testing, I noticed an intermittent loss of fragments when I was doing simple 8k pings with an MTU of 1500 between RHEL4 systems. I decided to write a system tap script to help trace the handling of fragmented messages. What I found was that the fragments were being silently dropped by the sending system. The systemtap output helped to show that it was the neigh_event_send() function that was dropping packets during ARP negotiation if the number on the queue exceeded queue_len:

if (neigh->nud_state == NUD_INCOMPLETE) {
- if (skb) {
  - if (skb_queue_len(&neigh->arp_queue) >=
  - neigh->parms->queue_len) {
  - struct sk_buff *buff;
    - buff = neigh->arp_queue.next;
    - skb_unlink(buff, &neigh->arp_queue);
    - kfree_skb(buff);
    }
    skb_queue_tail(&neigh->arp_queue, skb);
  } rc = 1;
}

The default for queue_len is 3, so only 3 fragments would be queued if the system ARP entry had timed out. The rest would be dropped. I recommended that the customer add the following to their /etc/sysctl.conf file in addition to the other MTU and ipfrag_high_thresh recommendations already made:

# ARP changes to deal with large fragmented messages net.ipv4.neigh.default.unres_qlen = 24

Here is the associated output along with tcpdump data taken at the time of the testing. Running ping from blade3 without a valid neighbor (ARP) entry for blade8: ip_fragment: skb==0xf7048680, len==8028

ip_finish_output: skb==0xf7048680, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf7048680
  - neigh_event_send: skb==0xf7048680, nud_state==0, ret==1
ip_finish_output: skb==0xf6b29d80, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf6b29d80
  - neigh_event_send: skb==0xf6b29d80, nud_state==1, ret==1
ip_finish_output: skb==0xf70e0e80, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf70e0e80
  - neigh_event_send: skb==0xf70e0e80, nud_state==1, ret==1
ip_finish_output: skb==0xf70e0780, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf70e0780
  - neigh_event_send: skb==0xf70e0780, nud_state==1, ret==1
ip_finish_output: skb==0xf6911780, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf6911780
  - neigh_event_send: skb==0xf6911780, nud_state==1, ret==1
ip_finish_output: skb==0xf7048580, len==0628, dev==eth1
- neigh_resolve_output: skb==0xf7048580
  - neigh_event_send: skb==0xf7048580, nud_state==1, ret==1

(note the nud_state of 0 above)

The following is the tcpdump trace that shows only the last 3 of the above fragments being transmitted.
14:33:14 IP (id 32749, offset 4440, length: 1500) blade3 > blade8: icmp
- 14:33:14 IP (id 32749, offset 5920, length: 1500) blade3 > blade8: icmp 14:33:14 IP (id 32749, offset 7400, length: 628) blade3 > blade8: icmp
The next test was right after the ARP entry was created: ip_fragment: skb==0xf721ba80, len==8028
ip_finish_output: skb==0xf721ba80, len==1500, dev==eth1
ip_finish_output: skb==0xf6b42680, len==1500, dev==eth1
ip_finish_output: skb==0xf721b880, len==1500, dev==eth1 i
ip_finish_output: skb==0xf6b42e80, len==1500, dev==eth1
ip_finish_output: skb==0xf70e0780, len==1500, dev==eth1
ip_finish_output: skb==0xf721b280, len==0628, dev==eth1
- 14:33:33 IP (id 32750, offset 0, length: 1500) blade3 > blade8: icmp 1480: echo request seq 0 14:33:33 IP (id 32750, offset 1480, length: 1500) blade3 > blade8: icmp 14:33:33 IP (id 32750, offset 2960, length: 1500) blade3 > blade8: icmp 14:33:33 IP (id 32750, offset 4440, length: 1500) blade3 > blade8: icmp 14:33:33 IP (id 32750, offset 5920, length: 1500) blade3 > blade8: icmp 14:33:33 IP (id 32750, offset 7400, length: 628) blade3 > blade8: icmp 14:33:33 IP (id 2126, offset 0, length: 1500) blade8 > blade3: icmp 1480: echo reply seq 0 14:33:33 IP (id 2126, offset 1480, length: 1500) blade8 > blade3: icmp 14:33:33 IP (id 2126, offset 2960, length: 1500) blade8 > blade3: icmp 14:33:33 IP (id 2126, offset 4440, length: 1500) blade8 > blade3: icmp 14:33:33 IP (id 2126, offset 5920, length: 1500) blade8 > blade3: icmp 14:33:33 IP (id 2126, offset 7400, length: 628) blade8 > blade3: icmp

The final test was done with a valid (although stale) entry on blade3, but an expired entry on blade8

(to force blade8 to drop some fragments in the response): ip_fragment: skb==0xf70e0e80, len==8028

ip_finish_output: skb==0xf70e0e80, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf70e0e80
  - neigh_event_send: skb==0xf70e0e80, nud_state==4, ret==0
ip_finish_output: skb==0xf6b18180, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf6b18180
ip_finish_output: skb==0xf6b29d80, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf6b29d80
ip_finish_output: skb==0xf7048680, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf7048680
ip_finish_output: skb==0xf2a7eb80, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf2a7eb80
ip_finish_output: skb==0xf6b42880, len==0628, dev==eth1
- neigh_resolve_output: skb==0xf6b42880

ip_expire:

qp->user = 0

qp->saddr = 0x94412f09

qp->daddr = 0x8f412f09

qp->id = 0x4f08

qp->protocol = 1

qp->last_in = 1 qp->len = 8008

qp->meat = 3568

qp->iif = 3

14:34:03 IP (id 32751, offset 0, length: 1500) blade3 > blade8: icmp 1480: echo request seq 0

14:34:03 IP (id 32751, offset 1480, length: 1500) blade3 > blade8: icmp

14:34:03 IP (id 32751, offset 2960, length: 1500) blade3 > blade8: icmp 1

4:34:03 IP (id 32751, offset 4440, length: 1500) blade3 > blade8: icmp

14:34:03 IP (id 32751, offset 5920, length: 1500) blade3 > blade8: icmp

14:34:03 IP (id 32751, offset 7400, length: 628) blade3 > blade8: icmp

14:34:03 IP (id 2127, offset 4440, length: 1500) blade8 > blade3: icmp

14:34:03 IP (id 2127, offset 5920, length: 1500) blade8 > blade3: icmp

14:34:03 IP (id 2127, offset 7400, length: 628) blade8 > blade3: icmp

WarStories