DSP

【linux内核·ip碎片重组】IP Reassembly -- 比较透彻的注释解析了IP碎片重组的

2019-07-13 20:46发布

IP Reassembly The ip_local_deliver() function, defined in net/ipv4/ip_input.c, is called by ip_rcv_finish.(). Its function is to reassemble IP fragments that are destined for this host and to call ip_local_deliver_finish () to deliver a complete packet to transport layer. 293 int ip_local_deliver(struct sk_buff *skb) 294 { 295 /* 296 * Reassemble IP fragments. 297 */ The constants IP_MF and IP_OFFSET are defined in include/net/ip.h and are used to access the fragment management field of the IP header. 73 #define IP_MF 0x2000 /* Flag: "More Fragments" */ 74 #define IP_OFFSET 0x1FFF /* "Fragment Offset" part */ When an IP packet has the IP_MF flag set or the 13 bit fragment offset is not 0, a call to the ip_defrag() function is made. The reason for the or condition is that the last fragment of a fragmented packet will not have IP_MF set but will have a non−zero offset. If the packet to which the received fragment belongs is still incomplete ip_defrag() returns NULL. In this case a return is made immediately. If this fragment completes the packet, a pointer to the reassembled packet is returned, and the packet is forwarded to ip_local_deliver_finish() via the NF_HOOK() call specifying the NF_IP_LOCAL_IN chain. 299 if (skb−>nh.iph−>frag_off & htons(IP_MF|IP_OFFSET)) { 300 skb = ip_defrag(skb); 301 if (!skb) 302 return 0; 303 } When the packet is not fragmented or was completely reassembled by ip_defrag(), a call to ip_local_deliver_finish is made to deliver it to transport layer. 305 return NF_HOOK(PF_INET, NF_IP_LOCAL_IN, skb, 306 skb−>dev, NULL, ip_local_deliver_finish); 307 }   The remainder of this section is dedicated to the operation of ip_defrag() which is responsible for the reassembly of fragmented packets and is defined in net/ipv4/ip_fragment.c. Key data structures are also defined in ip_fragment.c Each packet that is being reassembled is defined by a struct ipq which is defined in net/ipv4/ip_fragment.c. 68 struct ipq { 69 struct ipq *next; /* linked list pointers */ 70 u32 saddr; /* These fields comprise */ 71 u32 daddr; /* the lookup key */ 72 u16 id; 73 u8 protocol; 74 u8 last_in; 75 #define COMPLETE 4 76 #define FIRST_IN 2 77 #define LAST_IN 1 78 79 struct sk_buff *fragments; /* linked list of frags*/ 80 int len; /* total length of original pkt */ 81 int meat; /* number of bytes so far. */ 82 spinlock_t lock; 83 atomic_t refcnt; 84 struct timer_list timer; /* when will queue expire? */ 85 struct ipq **pprev; 86 int iif; 87 struct timeval stamp; 88 }; Functions of structure elements include: next: Used to link ipq structures in the same hash bucket. len: Offset of last data byte in the fragment queue. It is equal to the maximum value of fragment offset plus fragment length seen so far. fragments: Points to first element in a list of received fragments. meat: Sum of the length of the fragments that have been received so far. When the last fragment has been received and meat == len reassembly has succeded. last_in: Flags field. COMPLETE: Fragments queue is complete. FIRST_IN: First fragment (has offset zero) is on queue. LAST_IN: Last fragment is on queue. timer: A timer used for cleaning up an old incomplete fragments queue.   The variable, ip_frag_mem, is used to track the total amount of memory used for packet reassembly. It is a global defined in net/ipv4/ip_fragment.c and initialized to 0. /* Memory used for fragments */ 130 atomic_t ip_frag_mem = ATOMIC_INIT(0); The variable sysctl_ipfrag_high_thresh which is mapped in the /proc file system is declared and initialized in net/ipv4/ip_fragment.c. /* Fragment cache limits. We will commit 256K at one time. Should we cross that limit we will prune down to 192K. This should cope with even the most extreme cases without allowing an attacker to measurably harm machine performance. */ 51 int sysctl_ipfrag_high_thresh = 256*1024; 52 int sysctl_ipfrag_low_thresh = 192*1024; The ip_defrag() function is passed a pointer to the sk_buff which is known here to contain an element of a fragmented IP packet. 596 /* Process an incoming IP datagram fragment. */ 597 struct sk_buff *ip_defrag(struct sk_buff *skb) 598 { 599 struct iphdr *iph = skb−>nh.iph; 600 struct ipq *qp; 601 struct net_device *dev; 602 603 IP_INC_STATS_BH(IpReasmReqds); Its first order of business is to determine if there is a shortage of reassembly storage. When the value ip_frag_mem exceeds the high threshold value (sysctl_ipfrag_high_thresh), a call is made to the ip_evictor() function so that some partially reassembled packets can be discarded. 605 /* Start by cleaning up the memory. */ 606 if (atomic_read(&ip_frag_mem) > sysctl_ipfrag_high_thresh) 607 ip_evictor();   If the fragment being processed is the first fragment of a new packet to arrive, a queue is created to manage its reassembly. Otherwise, the fragment is enqueued in the existing queue. The ip_find() function is responsible for finding the queue to which a fragment belongs or creating a new queue if that is required. Its operation will be considered later. 609 dev = skb−>dev; 610 611 /* Lookup (or create) queue header */ 612 if ((qp = ip_find(iph)) != NULL) { 613 struct sk_buff *ret = NULL; 614 615 spin_lock(&qp−>lock); 616 If the queue was found , the ip_frag_queue() function is used to add the sk_buff to the fragment queue. 617 ip_frag_queue(qp, skb); When both the first and last fragments have been received and fragments queue (packet) becomes complete, ip_frag_reasm is called to perform reassembly. How could meat == len without FIRST_IN and LAST_IN set. 619 if (qp−>last_in == (FIRST_IN|LAST_IN) && 620 qp−>meat == qp−>len) 621 ret = ip_frag_reasm(qp, dev); 622 With reassembly complete, the queue is no longer needed and it is destroyed here. 623 spin_unlock(&qp−>lock); 624 ipq_put(qp); 625 return ret; 626 } In case of any error, the fragment is discarded. 628 IP_INC_STATS_BH(IpReasmFails); 629 kfree_skb(skb); 630 return NULL; 631 }   Finding the ipq that owns the arriving sk_buff The ip_find() function is defined in net/ipv4/ip_fragment.c. Mapping of a fragment to a struct ipq is hash based. /* Find the correct entry in the "incomplete datagrams" queue for this IP datagram, and create new one, if nothing is found. */ 345 static inline struct ipq *ip_find(struct iphdr *iph) 346 { 347 __u16 id = iph−>id; 348 __u32 saddr = iph−>saddr; 349 __u32 daddr = iph−>daddr; 350 __u8 protocol = iph−>protocol; 351 unsigned int hash = ipqhashfn(id, saddr, daddr, protocol); The iphashfn() is defined below. It returns a hash value based on identification number, source address, destination address and protocol number of the fragment. /* Was: ((((id) >> 1) ^ (saddr) ^ (daddr) ^ (prot)) & (IPQ_HASHSZ − 1)) I see, I see evil hand of bigendian mafia. On Intel all the packets hit one hash bucket with this hash function. 8) */ 120 static __inline__ unsigned int ipqhashfn(u16 id, u32 saddr, u32 daddr, u8 prot) 121 { 122 unsigned int h = saddr ^ daddr; 123 124 h ^= (h>>16)^id; 125 h ^= (h>>8)^prot; 126 return h & (IPQ_HASHSZ − 1); 127 }   ipq_hash is a hash table with sixty-four buckets used to keep track of various fragments queues. ip_frag_nqueues denotes the total number of such queues in the hash table. The ipfrag_lock is a read/write lock used to protect insertion and removal of ipq’s. 90 /* Hash table. */ 91 92 #define IPQ_HASHSZ 64 93 94 /* Per−bucket lock is easy to add now. */ 95 static struct ipq *ipq_hash[IPQ_HASHSZ]; 96 static rwlock_t ipfrag_lock = RW_LOCK_UNLOCKED; 97 int ip_frag_nqueues = 0; The ip_find() function continues by searching the chain indexed by hash for an ipq that matches fragment's identification number, source address, destination address and protocol number. If one is found a pointer to it is returned. 352 struct ipq *qp; 353 354 read_lock(&ipfrag_lock); 355 for(qp = ipq_hash[hash]; qp; qp = qp−>next) { 356 if(qp−>id == id && 357 qp−>saddr == saddr && 358 qp−>daddr == daddr && 359 qp−>protocol == protocol) { 360 atomic_inc(&qp−>refcnt); 361 read_unlock(&ipfrag_lock); 362 return qp; 363 } 364 } 365 read_unlock(&ipfrag_lock); When the first fragment of a packet to arrives, the search will fail. In this case, ip_frag_create is called to create a new fragments queue for enqueuing received fragment. 367 return ip_frag_create(hash, iph); 368 }   Creating a new ipq element The ip_frag_create(), defined in net/ipv4/ip_fragment.c creates a new ipq element and inserts it into the proper hash chain. 310 /* Add an entry to the ’ipq’ queue for a newly received IP datagram. */ 311 static struct ipq *ip_frag_create(unsigned hash, struct iphdr *iph) 312 { 313 struct ipq *qp; 314 315 if ((qp = frag_alloc_queue()) == NULL) 316 goto out_nomem; frag_alloc_queue is an inline function defined as below. atomic_add is used to add size of struct ipq structure kmalloc'd to atomic_t type variable ip_frag_mem. Recall that ip_frag_mem denotes the amount of memory used in keeping track of fragments. Why was not the slab allocator used here?? Rareness of fragmentation?? 145 static __inline__ struct ipq *frag_alloc_queue(void) 146 { 147 struct ipq *qp = kmalloc(sizeof(struct ipq), GFP_ATOMIC); 148 149 if(!qp) 150 return NULL; 151 atomic_add(sizeof(struct ipq), &ip_frag_mem); 152 return qp; 153 } On return to ip_frag_create the newly created queue is initialized. 318 qp−>protocol = iph−>protocol; 319 qp−>last_in = 0; 320 qp−>id = iph−>id; 321 qp−>saddr = iph−>saddr; 322 qp−>daddr = iph−>daddr; 323 qp−>len = 0; 324 qp−>meat = 0; 325 qp−>fragments = NULL; 326 qp−>iif = 0;   Continuing in ip_frag_create the data and function members of the timer for this queue are initialized. Note that expires is not set and the timer is not yet added. 328 /* Initialize a timer for this entry. */ 329 init_timer(&qp−>timer); 330 qp−>timer.data = (unsigned long) qp; /* pointer to queue */ 331 qp−>timer.function = ip_expire; /* expire function */ 332 qp−>lock = SPIN_LOCK_UNLOCKED; 333 atomic_set(&qp−>refcnt, 1); ip_frag_intern is called to add the newly created fragments queue to a hash table that manages all such queues. 335 return ip_frag_intern(hash, qp); On failing to allocate a fragments queue structure, we return NULL. 337 out_nomem: 338 NETDEBUG(if (net_ratelimit()) printk(KERN_ERR "ip_frag_create: no memory left !/n")); 339 return NULL 340 }   Inserting the new ipq into the hash chain. The ip_frag_intern() function inserts the newly created ipq in the proper hash queue. 270 /* Creation primitives. */ 271 272 static struct ipq *ip_frag_intern(unsigned int hash, struct ipq *qp_in) 273 { 274 struct ipq *qp; 275 On an SMP kernel, to avoid a race condition where another CPU creates a similar queue and adds it to the hash table, a recheck is enforced here. If the queue was added by another CPU, a pointer to the existing ipq is returned and the newly created ipq is destroyed.. 276 write_lock(&ipfrag_lock); 277 #ifdef CONFIG_SMP 278 /* With SMP race we have to recheck hash table, because such entry could be created on other cpu, while we promoted read lock to write lock. 281 */ <