IP Reassembly
The ip_local_deliver() function, defined in net/ipv4/ip_input.c, is called by ip_rcv_finish.(). Its
function is to reassemble IP fragments that are destined for this host and to call
ip_local_deliver_finish () to deliver a complete packet to transport layer.
293 int ip_local_deliver(struct sk_buff *skb)
294 {
295 /*
296 * Reassemble IP fragments.
297 */
The constants IP_MF and IP_OFFSET are defined in include/net/ip.h and are used to access the
fragment management field of the IP header.
73 #define IP_MF 0x2000 /* Flag: "More Fragments" */
74 #define IP_OFFSET 0x1FFF /* "Fragment Offset" part */
When an IP packet has the IP_MF flag set or the 13 bit fragment offset is not 0, a call to the
ip_defrag() function is made. The reason for the or condition is that the last fragment of a
fragmented packet will not have IP_MF set but will have a non−zero offset. If the packet to which
the received fragment belongs is still incomplete ip_defrag() returns NULL. In this case a return is
made immediately. If this fragment completes the packet, a pointer to the reassembled packet is
returned, and the packet is forwarded to ip_local_deliver_finish() via the NF_HOOK() call
specifying the NF_IP_LOCAL_IN chain.
299 if (skb−>nh.iph−>frag_off & htons(IP_MF|IP_OFFSET)) {
300 skb = ip_defrag(skb);
301 if (!skb)
302 return 0;
303 }
When the packet is not fragmented or was completely reassembled by ip_defrag(), a call to
ip_local_deliver_finish is made to deliver it to transport layer.
305 return NF_HOOK(PF_INET, NF_IP_LOCAL_IN, skb,
306 skb−>dev, NULL, ip_local_deliver_finish);
307 }
The remainder of this section is dedicated to the operation of ip_defrag() which is responsible for
the reassembly of fragmented packets and is defined in net/ipv4/ip_fragment.c. Key data
structures are also defined in ip_fragment.c
Each packet that is being reassembled is defined by a struct ipq which is defined in
net/ipv4/ip_fragment.c.
68 struct ipq {
69 struct ipq *next; /* linked list pointers */
70 u32 saddr; /* These fields comprise */
71 u32 daddr; /* the lookup key */
72 u16 id;
73 u8 protocol;
74 u8 last_in;
75 #define COMPLETE 4
76 #define FIRST_IN 2
77 #define LAST_IN 1
78
79 struct sk_buff *fragments; /* linked list of frags*/
80 int len; /* total length of original pkt */
81 int meat; /* number of bytes so far. */
82 spinlock_t lock;
83 atomic_t refcnt;
84 struct timer_list timer; /* when will queue expire? */
85 struct ipq **pprev;
86 int iif;
87 struct timeval stamp;
88 };
Functions of structure elements include:
next: Used to link ipq structures in the same hash bucket.
len: Offset of last data byte in the fragment queue. It is equal to the maximum
value of fragment offset plus fragment length seen so far.
fragments: Points to first element in a list of received fragments.
meat: Sum of the length of the fragments that have been received so far. When the
last fragment has been received and meat == len reassembly has succeded.
last_in: Flags field.
COMPLETE: Fragments queue is complete.
FIRST_IN: First fragment (has offset zero) is on queue.
LAST_IN: Last fragment is on queue.
timer: A timer used for cleaning up an old incomplete fragments queue.
The variable, ip_frag_mem, is used to track the total amount of memory used for packet
reassembly. It is a global defined in net/ipv4/ip_fragment.c and initialized to 0.
/* Memory used for fragments */
130 atomic_t ip_frag_mem = ATOMIC_INIT(0);
The variable sysctl_ipfrag_high_thresh which is mapped in the /proc file system is declared and
initialized in net/ipv4/ip_fragment.c.
/* Fragment cache limits. We will commit 256K at one time.
Should we cross that limit we will prune down to 192K.
This should cope with even the most extreme cases
without allowing an attacker to measurably harm machine
performance.
*/
51 int sysctl_ipfrag_high_thresh = 256*1024;
52 int sysctl_ipfrag_low_thresh = 192*1024;
The ip_defrag() function is passed a pointer to the sk_buff which is known here to contain an
element of a fragmented IP packet.
596 /* Process an incoming IP datagram fragment. */
597 struct sk_buff *ip_defrag(struct sk_buff *skb)
598 {
599 struct iphdr *iph = skb−>nh.iph;
600 struct ipq *qp;
601 struct net_device *dev;
602
603 IP_INC_STATS_BH(IpReasmReqds);
Its first order of business is to determine if there is a shortage of reassembly storage. When the
value ip_frag_mem exceeds the high threshold value (sysctl_ipfrag_high_thresh), a call is made to
the ip_evictor() function so that some partially reassembled packets can be discarded.
605 /* Start by cleaning up the memory. */
606 if (atomic_read(&ip_frag_mem) >
sysctl_ipfrag_high_thresh)
607 ip_evictor();
If the fragment being processed is the first fragment of a new packet to arrive, a queue is created to
manage its reassembly. Otherwise, the fragment is enqueued in the existing queue. The ip_find()
function is responsible for finding the queue to which a fragment belongs or creating a new queue if
that is required. Its operation will be considered later.
609 dev = skb−>dev;
610
611 /* Lookup (or create) queue header */
612 if ((qp = ip_find(iph)) != NULL) {
613 struct sk_buff *ret = NULL;
614
615 spin_lock(&qp−>lock);
616
If the queue was found , the ip_frag_queue() function is used to add the sk_buff to the fragment
queue.
617 ip_frag_queue(qp, skb);
When both the first and last fragments have been received and fragments queue (packet) becomes
complete, ip_frag_reasm is called to perform reassembly. How could meat == len without
FIRST_IN and LAST_IN set.
619 if (qp−>last_in == (FIRST_IN|LAST_IN) &&
620 qp−>meat == qp−>len)
621 ret = ip_frag_reasm(qp, dev);
622
With reassembly complete, the queue is no longer needed and it is destroyed here.
623 spin_unlock(&qp−>lock);
624 ipq_put(qp);
625 return ret;
626 }
In case of any error, the fragment is discarded.
628 IP_INC_STATS_BH(IpReasmFails);
629 kfree_skb(skb);
630 return NULL;
631 }
Finding the ipq that owns the arriving sk_buff
The ip_find() function is defined in net/ipv4/ip_fragment.c. Mapping of a fragment to a struct ipq
is hash based.
/* Find the correct entry in the "incomplete
datagrams" queue for this IP datagram, and create
new one, if nothing is found.
*/
345 static inline struct ipq *ip_find(struct iphdr *iph)
346 {
347 __u16 id = iph−>id;
348 __u32 saddr = iph−>saddr;
349 __u32 daddr = iph−>daddr;
350 __u8 protocol = iph−>protocol;
351 unsigned int hash = ipqhashfn(id, saddr, daddr,
protocol);
The iphashfn() is defined below. It returns a hash value based on identification number, source
address, destination address and protocol number of the fragment.
/*
Was: ((((id) >> 1) ^ (saddr) ^ (daddr) ^ (prot)) &
(IPQ_HASHSZ − 1))
I see, I see evil hand of bigendian mafia. On Intel all
the packets hit one hash bucket with this hash function.
8)
*/
120 static __inline__ unsigned int ipqhashfn(u16 id, u32
saddr, u32 daddr, u8 prot)
121 {
122 unsigned int h = saddr ^ daddr;
123
124 h ^= (h>>16)^id;
125 h ^= (h>>8)^prot;
126 return h & (IPQ_HASHSZ − 1);
127 }
ipq_hash is a hash table with sixty-four buckets used to keep track of various fragments queues.
ip_frag_nqueues denotes the total number of such queues in the hash table. The ipfrag_lock is
a read/write lock used to protect insertion and removal of ipq’s.
90 /* Hash table. */
91
92 #define IPQ_HASHSZ 64
93
94 /* Per−bucket lock is easy to add now. */
95 static struct ipq *ipq_hash[IPQ_HASHSZ];
96 static rwlock_t ipfrag_lock = RW_LOCK_UNLOCKED;
97 int ip_frag_nqueues = 0;
The ip_find() function continues by searching the chain indexed by hash for an ipq that matches
fragment's identification number, source address, destination address and protocol number. If one
is found a pointer to it is returned.
352 struct ipq *qp;
353
354 read_lock(&ipfrag_lock);
355 for(qp = ipq_hash[hash]; qp; qp = qp−>next) {
356 if(qp−>id == id &&
357 qp−>saddr == saddr &&
358 qp−>daddr == daddr &&
359 qp−>protocol == protocol) {
360 atomic_inc(&qp−>refcnt);
361 read_unlock(&ipfrag_lock);
362 return qp;
363 }
364 }
365 read_unlock(&ipfrag_lock);
When the first fragment of a packet to arrives, the search will fail. In this case, ip_frag_create is
called to create a new fragments queue for enqueuing received fragment.
367 return ip_frag_create(hash, iph);
368 }
Creating a new ipq element
The ip_frag_create(), defined in net/ipv4/ip_fragment.c creates a new ipq element and inserts it
into the proper hash chain.
310 /* Add an entry to the ’ipq’ queue for a newly
received IP datagram. */
311 static struct ipq *ip_frag_create(unsigned hash,
struct iphdr *iph)
312 {
313 struct ipq *qp;
314
315 if ((qp = frag_alloc_queue()) == NULL)
316 goto out_nomem;
frag_alloc_queue is an inline function defined as below. atomic_add is used to add size of struct
ipq structure kmalloc'd to atomic_t type variable ip_frag_mem. Recall that ip_frag_mem denotes
the amount of memory used in keeping track of fragments. Why was not the slab allocator used
here?? Rareness of fragmentation??
145 static __inline__ struct ipq *frag_alloc_queue(void)
146 {
147 struct ipq *qp = kmalloc(sizeof(struct ipq),
GFP_ATOMIC);
148
149 if(!qp)
150 return NULL;
151 atomic_add(sizeof(struct ipq), &ip_frag_mem);
152 return qp;
153 }
On return to ip_frag_create the newly created queue is initialized.
318 qp−>protocol = iph−>protocol;
319 qp−>last_in = 0;
320 qp−>id = iph−>id;
321 qp−>saddr = iph−>saddr;
322 qp−>daddr = iph−>daddr;
323 qp−>len = 0;
324 qp−>meat = 0;
325 qp−>fragments = NULL;
326 qp−>iif = 0;
Continuing in ip_frag_create the data and function members of the timer for this queue are
initialized. Note that expires is not set and the timer is not yet added.
328 /* Initialize a timer for this entry. */
329 init_timer(&qp−>timer);
330 qp−>timer.data = (unsigned long) qp; /* pointer to
queue */
331 qp−>timer.function = ip_expire; /* expire
function */
332 qp−>lock = SPIN_LOCK_UNLOCKED;
333 atomic_set(&qp−>refcnt, 1);
ip_frag_intern is called to add the newly created fragments queue to a hash table that
manages all such queues.
335 return ip_frag_intern(hash, qp);
On failing to allocate a fragments queue structure, we return NULL.
337 out_nomem:
338 NETDEBUG(if (net_ratelimit()) printk(KERN_ERR
"ip_frag_create: no memory left !/n"));
339 return NULL
340 }
Inserting the new ipq into the hash chain.
The ip_frag_intern() function inserts the newly created ipq in the proper hash queue.
270 /* Creation primitives. */
271
272 static struct ipq *ip_frag_intern(unsigned int hash,
struct ipq *qp_in)
273 {
274 struct ipq *qp;
275
On an SMP kernel, to avoid a race condition where another CPU creates a similar queue and adds it
to the hash table, a recheck is enforced here. If the queue was added by another CPU, a pointer to
the existing ipq is returned and the newly created ipq is destroyed..
276 write_lock(&ipfrag_lock);
277 #ifdef CONFIG_SMP
278 /* With SMP race we have to recheck hash table,
because such entry could be created on other
cpu, while we promoted read lock to write lock.
281 */
<