UDP Fragment Loss
Problem
The following war story comes to us from Kevin Rudd at IBM.
Kevin's customer reported intermittent throughput drops in his UDP based client server application. With further investigation he correlated the drops outs with an intermittent loss of UDP fragments somewhere in the linux network stack. Keven was able to reproduce the symptoms in his lab and used systemtap to help narrow down the cause of the problem.
Scripts
The following is the tapscript used for debugging:
%{
#include <linux/netdevice.h>
#include <linux/skbuff.h>
#include <linux/ip.h>
%}
function isfrag:long (skbuff:long) %{
unsigned long ptr = (unsigned long)THIS->skbuff;
- struct sk_buff * skbuff = (struct sk_buff *)ptr;
if (skbuff->nh.iph->frag_off & htons(0x3FFF)) {
THIS->retvalue = 1;
THIS->retvalue = 0;
%}
global frag
probe kernel.function("ip_expire") {
- printf ("\n%s:\n", probefunc())
printf ("qp->user = %d\n", $qp->user)
printf ("qp->saddr = %p\n", $qp->saddr)
printf ("qp->daddr = %p\n", $qp->daddr)
printf ("qp->id = 0x%x\n", $qp->id)
printf ("qp->protocol = %d\n", $qp->protocol)
printf ("qp->last_in = %d\n", $qp->last_in)
printf ("qp->len = %d\n", $qp->len)
printf ("qp->meat = %d\n", $qp->meat)
printf ("qp->iif = %d\n", $qp->iif)
- }
probe kernel.function("ip_fragment") {
printf ("\n%s: skb==%p, len==%d\n", probefunc(), $skb, $skb->len)
}
probe kernel.function("ip_finish_output") {
- if (isfrag($skb)) {
- frag = 1
dev_name = kernel_string($skb->dst->dev->name)
printf (" %s: skb==%p, len==%04d, dev==%s\n", probefunc(), $skb, $skb->len, dev_name)
- frag = 1
}
probe kernel.function("ip_finish_output").return {
- delete frag
}
probe kernel.function("neigh_resolve_output") {
- if (frag) {
- printf (" %s: skb==%p\n", probefunc(), $skb)
}
probe kernel.function("neigh_event_send") {
- if (frag) {
printf (" %s: skb==%p, nud_state==%d, ", probefunc(), $skb, $neigh->nud_state)
}
probe kernel.function("neigh_event_send").return {
- if (frag) {
- printf ("ret==%d\n", $return)
}
Lessons
In my lab testing, I noticed an intermittent loss of fragments when I was doing simple 8k pings with an MTU of 1500 between RHEL4 systems. I decided to write a system tap script to help trace the handling of fragmented messages. What I found was that the fragments were being silently dropped by the sending system. The systemtap output helped to show that it was the neigh_event_send() function that was dropping packets during ARP negotiation if the number on the queue exceeded queue_len:
if (neigh->nud_state == NUD_INCOMPLETE) {
- if (skb) {
if (skb_queue_len(&neigh->arp_queue) >=
neigh->parms->queue_len) {
- struct sk_buff *buff;
buff = neigh->arp_queue.next;
skb_unlink(buff, &neigh->arp_queue);
kfree_skb(buff);
skb_queue_tail(&neigh->arp_queue, skb);
- if (skb) {
The default for queue_len is 3, so only 3 fragments would be queued if the system ARP entry had timed out. The rest would be dropped. I recommended that the customer add the following to their /etc/sysctl.conf file in addition to the other MTU and ipfrag_high_thresh recommendations already made:
# ARP changes to deal with large fragmented messages net.ipv4.neigh.default.unres_qlen = 24
Here is the associated output along with tcpdump data taken at the time of the testing. Running ping from blade3 without a valid neighbor (ARP) entry for blade8: ip_fragment: skb==0xf7048680, len==8028
- ip_finish_output: skb==0xf7048680, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf7048680
neigh_event_send: skb==0xf7048680, nud_state==0, ret==1
- neigh_resolve_output: skb==0xf6b29d80
- neigh_event_send: skb==0xf6b29d80, nud_state==1, ret==1
- neigh_resolve_output: skb==0xf70e0e80
neigh_event_send: skb==0xf70e0e80, nud_state==1, ret==1
- neigh_resolve_output: skb==0xf70e0780
- neigh_event_send: skb==0xf70e0780, nud_state==1, ret==1
- neigh_resolve_output: skb==0xf6911780
- neigh_event_send: skb==0xf6911780, nud_state==1, ret==1
- neigh_resolve_output: skb==0xf7048580
- neigh_event_send: skb==0xf7048580, nud_state==1, ret==1
- neigh_resolve_output: skb==0xf7048680
(note the nud_state of 0 above)
- The following is the tcpdump trace that shows only the last 3 of the above fragments being transmitted.
14:33:14 IP (id 32749, offset 4440, length: 1500) blade3 > blade8: icmp
14:33:14 IP (id 32749, offset 5920, length: 1500) blade3 > blade8: icmp 14:33:14 IP (id 32749, offset 7400, length: 628) blade3 > blade8: icmp
- The next test was right after the ARP entry was created: ip_fragment: skb==0xf721ba80, len==8028
- ip_finish_output: skb==0xf721ba80, len==1500, dev==eth1
- ip_finish_output: skb==0xf6b42680, len==1500, dev==eth1
- ip_finish_output: skb==0xf721b880, len==1500, dev==eth1 i
- ip_finish_output: skb==0xf6b42e80, len==1500, dev==eth1
- ip_finish_output: skb==0xf70e0780, len==1500, dev==eth1
- ip_finish_output: skb==0xf721b280, len==0628, dev==eth1
14:33:33 IP (id 32750, offset 0, length: 1500) blade3 > blade8: icmp 1480: echo request seq 0 14:33:33 IP (id 32750, offset 1480, length: 1500) blade3 > blade8: icmp 14:33:33 IP (id 32750, offset 2960, length: 1500) blade3 > blade8: icmp 14:33:33 IP (id 32750, offset 4440, length: 1500) blade3 > blade8: icmp 14:33:33 IP (id 32750, offset 5920, length: 1500) blade3 > blade8: icmp 14:33:33 IP (id 32750, offset 7400, length: 628) blade3 > blade8: icmp 14:33:33 IP (id 2126, offset 0, length: 1500) blade8 > blade3: icmp 1480: echo reply seq 0 14:33:33 IP (id 2126, offset 1480, length: 1500) blade8 > blade3: icmp 14:33:33 IP (id 2126, offset 2960, length: 1500) blade8 > blade3: icmp 14:33:33 IP (id 2126, offset 4440, length: 1500) blade8 > blade3: icmp 14:33:33 IP (id 2126, offset 5920, length: 1500) blade8 > blade3: icmp 14:33:33 IP (id 2126, offset 7400, length: 628) blade8 > blade3: icmp
The final test was done with a valid (although stale) entry on blade3, but an expired entry on blade8
(to force blade8 to drop some fragments in the response): ip_fragment: skb==0xf70e0e80, len==8028
- ip_finish_output: skb==0xf70e0e80, len==1500, dev==eth1
- neigh_resolve_output: skb==0xf70e0e80
neigh_event_send: skb==0xf70e0e80, nud_state==4, ret==0
- neigh_resolve_output: skb==0xf6b18180
- neigh_resolve_output: skb==0xf6b29d80
- neigh_resolve_output: skb==0xf7048680
- neigh_resolve_output: skb==0xf2a7eb80
- neigh_resolve_output: skb==0xf6b42880
- neigh_resolve_output: skb==0xf70e0e80
ip_expire:
qp->user = 0
qp->saddr = 0x94412f09
qp->daddr = 0x8f412f09
qp->id = 0x4f08
qp->protocol = 1
qp->last_in = 1 qp->len = 8008
qp->meat = 3568
qp->iif = 3
14:34:03 IP (id 32751, offset 0, length: 1500) blade3 > blade8: icmp 1480: echo request seq 0
14:34:03 IP (id 32751, offset 1480, length: 1500) blade3 > blade8: icmp
14:34:03 IP (id 32751, offset 2960, length: 1500) blade3 > blade8: icmp 1
4:34:03 IP (id 32751, offset 4440, length: 1500) blade3 > blade8: icmp
14:34:03 IP (id 32751, offset 5920, length: 1500) blade3 > blade8: icmp
14:34:03 IP (id 32751, offset 7400, length: 628) blade3 > blade8: icmp
14:34:03 IP (id 2127, offset 4440, length: 1500) blade8 > blade3: icmp
14:34:03 IP (id 2127, offset 5920, length: 1500) blade8 > blade3: icmp
14:34:03 IP (id 2127, offset 7400, length: 628) blade8 > blade3: icmp