Understanding Linux Network Internals 2005 phần 5 pdf

18.3.3. Record Route Option The purpose of this option is to ask the routers along the way between source and destination to store the IP addresses of the outgoing interfaces they use to forward the packet. Because of limited space in the header, only nine addresses at most can be stored (and even fewer, if the header contains other options). Therefore, the packet arrives with the first nine [*] addresses stored in the option; the receiver has no way of knowing what routers were used after that. Since this option makes the header (and therefore the IP packet) grow along the way, and since other options may be present in the header, the sender is supposed to reserve the space that will be used to store the addresses. If the reserved space becomes full before the packet gets to its destination, the additional addresses are not added to the list even if the maximum size of an IP header would permit it. No errors (ICMP messages) are generated when there is no room to store a new address. For obvious reasons, the sender is supposed to reserve an amount of space that is a multiple of 4 bytes (the size of an IP address). [*] [*] (40-3)/4=9, where 40 is the maximum size of the IP options, 3 is the size of the options header, and 4 is the size of an IPv4 address. [*] The value of length is not an exact multiple of 4 because the option header (type, length, and pointer) is 3 bytes long. This means that the 32-bit IP addresses are inconveniently split across 32-bit word boundaries. Figure 18-7 shows how the IP header portion dedicated to the option changes hop by hop. As each router fills its address, it also updates the pointer field to indicate the end of the data in the option. The offsets at the bottom of the figure start from 1 so that you can compare them to the value of the pointer field. This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Figure 18-7. Example of Record Route option 18.3.4. Timestamp Option This option is the most complicated one because it contains suboptions and, unlike the Record Route option, it handles overflows. To manage those two additional concepts, it needs an additional byte in its header, as shown in Figure 18-8. Figure 18-8. IP Timestamp option header The first three bytes have the same meaning as in the other options: type, length, and pointer. The fourth byte is actually split into two fields of four bits each. The rightmost four bits (the least significant ones) represent a subcommand code that can change the effect of the option. Its possible values are: This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com RECORD TIMESTAMPS Each router records the time at which it received the packet. RECORD ADDRESSES AND TIMESTAMPS Similar to the previous subcommand, but the IP address of the receiving interface is saved, too. RECORD TIMESTAMPS ONLY AT THE PRESPECIFIED SYSTEMS Each router records the time at which it received the packet (as with RECORD TIMESTAMPS), but only at specific IP addresses selected by the sender. In all three cases, the time is expressed in milliseconds (in a 32-bit variable) since midnight UTC of the current day. [*] [*] UTC stands for Universal Time Clock, also called GMT (Greenwich Mean Time). The other four bits represent what is called the overflow field. Because the TIMESTAMP option is used to record information along the route, and because the space available in the IP header for that purpose is limited to 40 bytes, there can be cases where a router is unable to record information for lack of space. While the Record Route option processing simply ignores that case, leaving the receiver ignorant of how many times it happened, the TIMESTAMP option increments the overflow field every time it happens. Unfortunately, overflow is a 4-bit field and therefore can have a maximum value of 15: in modern networks, it itself may easily overflow. When that happens, the router that experiences the overflow has to return an ICMP parameter error message back to the original sender. While the first two suboptions are similar (they differ only in what to save on each hop), the third suboption is slightly different and deserves a few more words. The packet's original sender lists the IP addresses in which it is interested, following each with four bytes of space. At each hop, the option's pointer field indicates the offset of the next 4-byte space. Each router that appears in the address list fills in the appropriate space with a timestamp and updates the pointer field. See Figure 18-9. The underlined hosts in the sequence at the top of the figure are the hosts that add the timestamps. The offsets at the bottom of the figure start from 1 so that you can compare them to the value of the pointer field. 18.3.5. Router Alert Option This option was added to the IP protocol definition in 1995 and is described in RFC 2113. It marks packets that require special handling beyond simply looking at the destination address and forwarding the packet. For instance, the Resource Reservation Protocol (RSVP), which attempts to create better QoS for a stream of packets, uses this option to tell routers that it must treat the packets in that stream in a special way. Right now, the last two bytes have only one assigned value, zero. This simply means that the router should examine the packet. Packets carrying other values are illegal and should be discarded, generating an ICMP error message to the source that generated them. Figure 18-9. Example of storing the Timestamp option for pre-specified systems This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 18.4. Packet Fragmentation/Defragmentation Packet fragmentation and defragmentation is one of the main jobs of the IP protocol. The IP protocol defines the maximum size of a packet as 64 KB, which comes from the fact that the len field of the header, which represents the size of the packet in bytes, is a 16-bit value. However, not many interface types can send packets of a size up to 64 KB. This means that when the IP layer needs to transmit a packet whose size is bigger than the MTU of the egress interface, it needs to split the packet into smaller pieces. We will see later in this chapter that the MTU used is not necessarily the one associated to the egress's device; it could be, for instance, the one associated with the routing table entry used to route the packet. The latter would depend on several factors, one of which is the egress device's MTU. Regardless of how the MTU is computed, the fragmentation process creates a series of equal-size fragments, as shown in Figure 18-10. The MF and OFFSET fields shown in the picture are described later in this section. If the MTU does not divide the original size of the packet exactly, the final fragment is smaller than the others. Figure 18-10. IP packet fragmentation This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com A fragmented IP packet is normally defragmented by the destination host, but intermediate devices that need to look at the entire IP packet may have to defragment it, too. Two examples of such devices are firewalls and Network Address Translation (NAT) routers. Some time ago, it was an acceptable solution for the receiver to allocate a buffer the size of the original IP packet and put fragments there as they arrived. In fact, the receiver might just allocate a buffer of the maximum possible size, because the size of the original IP packet was known only after receiving the last fragment. That simple approach is now avoided because it wastes memory, and a malicious attack could bring a router to its knees just by sending a burst of very small fragments that lie about their original size. Because every IP packet can be fragmented, and because each fragment can be further fragmented along the path for the same reason, there must be a way for the receiver to understand which IP packet each fragment belongs to, and at what position inside the original IP packet each fragment should be placed. The receiver must also be told the original size of the IP packet to know when it has received all of the fragments. This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Several other aspects have to be considered to accomplish fragmentation. When copying the IP header of the original packet into its fragments, the kernel does not copy all of the options, but only those with the copied field set, as described earlier in the section "IP Options." However, when the IP fragments are merged, the resulting IP packet will look like the original one and therefore include all the options again. Moreover, the IP checksum covers only the IP header (the payload is usually covered by the higher-layer protocols). When fragments are created, the headers are all different, so a checksum has to be computed for each one of them, and checked on the receiving side. 18.4.1. Effect of Fragmentation on Higher Layers Fragmenting and defragmenting a packet takes both CPU time and memory. For a heavily loaded server, the extra resources involved may be quite significant. Fragmentation also introduces overhead in the bandwidth used for transmission, because each fragment has to contain both the L2 and L3 headers. If the size of the fragments is small, that overhead can be significant. Higher layers are theoretically unaware of when the L3 layer chooses to fragment a packet. [*] [*] The section "The ip_append_data Function" in Chapter 21 shows how the interface between L3 and L4 has evolved to optimize the fragmentation task for locally generated packets. However even if TCP and UDP are unaware of the fragmentation/defragmentation processes, [] the applications built on top of those two protocols are not. Some have to worry about fragmentation for performance reasons. Fragmentation/defragmentation is theoretically a transparent process, but it can have negative effects on performance because it always adds extra delay. A typical application that is very sensitive to delays, and that therefore tries to avoid fragmentation as much as possible, is a videoconferencing system. If you have ever tried one, or even if you have ever had an international phone call, you know what it means to have too big of a delay: conversing becomes very difficult. Some sources of delay cannot be avoided (such as network congestion, in the absence of robust QoS), but if something can be done to reduce that delay, the applications will take extraordinary steps to do it. Many applications are smart enough to try to avoid fragmentation by taking a few factors into consideration: [] As we will see in the section "Putting Together the Transmission Functions" in Chapter 21, L4 protocols actually provide some options that can influence fragmentation. The kernel, first of all, does not have to simply use the MTU of the egress interface, but can also use a feature called path MTU discovery to discover the largest packet size it can use while avoiding fragmentation along a particular path (see the section "Path MTU Discovery"). The MTU can be set to a fairly safe, small value of 576. This reflects the specification in RFC 791 that each host must be prepared to accept packets of up to 576 octets. This restriction on packet size thus drastically reduces the likelihood of fragmentation. Many applications end up using that MTU by default, if not explicitly configured to use a different value. When a sender decides to use a packet size smaller than its available MTU just to avoid fragmentation, it must also entail the same overhead of including extra headers that fragmentation requires. However, avoiding fragmentation by routers along the way reduces processing considerably along the route and therefore can be critical for improving response time. 18.4.2. IP Header Fields Used by Fragmentation/Defragmentation Here are the fields of the IP header that are used to handle the fragmentation/defragmentation process. We will see how they are used in Chapter 22. This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com DF (Don't Fragment) There are cases where fragmentation may be bad for the upper layers. For instance, interactive, streaming multimedia can produce terrible performance if it is fragmented. And sometimes, the transmitter knows that the receiver has a simple, lightweight IP protocol implementation and therefore cannot handle defragmentation. For such purposes, a field is provided in the IP packet header to say whether fragmentation is allowed. If the packet exceeds the MTU of some link along the path, it is dropped. The section "Path MTU Discovery" shows a use for this flag associated with path MTU discovery. MF (More Fragments) When a node fragments a packet, it sets this flag to TRUE in each fragment except the last. The recipient knows the size of the original, unfragmented packet when it receives the last fragment created from this packet, even if some fragments have not been received yet. Fragment Offset This represents the offset within the original IP packet to place the fragment. It is a 13-bit field. Since len is a 16-bit field, fragments always have to be created on 8-byte boundaries and the value of this field is read as a multiple of 8 bytes (that is, shifted left 3 bits). An offset of 0 indicates that this fragment is the first within the packet; that information is important because the first fragment contains header information related to the entire original packet. ID IP packet ID, which is the same for all fragments of an IP packet. It is thanks to this parameter that the receiver knows what fragments should be rejoined. We will see how the value of this field is chosen in the section "Long-Living IP Peer Information" in Chapter 23. Linux stores the last ID used in a structure named inet_peer where it stores information about the remote hosts with whom it is communicating. 18.4.3. Examples of Problems with Fragmentation/Defragmentation Fragmentation is a pretty simple process: the node simply has to choose the right value to fit the MTU. It should not come as a surprise that most of the issues have to do with defragmentation. In the next two sections, we cover two of the most common issues: handling retransmissions and reassembling packets properly, along with the special problem of Network Address Translation (NAT). Another reason not to use fragmentation is that it is incompatible with congestion control algorithms. 18.4.3.1. Retransmissions I said earlier that an IP packet cannot be delivered to the next-higher layer until it has been completely defragmented. However, this does not mean that fragments are kept in the host's memory indefinitely. Otherwise, it would be very easy to render a host unusable through a simple Denial of Service (DoS) attack. A fragment might not be received for several reasons: for instance, it might be dropped along the way by a router that has run out of memory to store it due to congestion, it might become corrupted and be discarded due to the CRC (error check), or it could be held up by a firewall because the firewall wants to view the header in the first fragment before forwarding any fragments. Therefore, each router and host has a timer that cleans up the resources used by the fragments of an IP packet if some fragments are not received within a given amount of time. If a sender could tell that a fragment was lost or dropped along the path, it would be nice if the sender could retransmit just the missing This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com fragment. This is completely unfeasible to implement, though. A sender cannot know even whether its packet was fragmented by a router later on in the path, much less what the fragments are. So each sender must simply wait for a higher layer to tell it to resend an entire packet. A retransmitted packet does not reuse the same ID as the original. However, it is still possible for a host to receive copies of the same IP fragment with the same packet ID, so a host must be able to handle this situation. Note that the same fragment may be received multiple times even without retransmissions: a common example is when there's a loop at the L2 layer. We saw this case in Part IV. This waste provides another good reason to avoid fragmentation at the source and to try to use packet sizes that minimize the likelihood of fragmentation along the way if delays are bad for the application (e.g., in videoconferencing software). Since the kernel cannot swap its data out to disk (it swaps only user-space data), the memory waste due to handling fragments has a heavy impact on router performance. Linux puts a limit on the amount of memory usable by fragments, as described in the section "Tuning via /proc Filesystem" in Chapter 23. Since IP is a connectionless protocol, there is no flow control and it is up to the upper-layer protocols (or the applications) to take care of losses. Some applications, of course, do not care much about the loss of data, and others do. Let's suppose the upper layer detects the loss of some data by some means (for instance, with a timer that expires due to the lack of acknowledgment) and tries a retransmission. Since it is not possible to selectively resend only the missing fragments, the L4 protocol has to retransmit the entire IP packet. Each retransmission can lead to some special conditions that have to be handled by the receiver side (and sometimes by intermediate routers as well when the latter implement some form of firewalling that requires packets to be defragmented). Here are some of them: Overlapping A fragment could contain some of the data that already arrived in a previous packet. Retransmitted packets have a different ID and therefore their fragments are not supposed to be mixed with the fragments of a previous transmission. However, a buggy operating system that does not use a different ID for retransmitted packets, or the wraparound problem I'll introduce in the next section, can make overlapping possible. Duplicates This can be considered a special case of overlapping, where the two fragments are identical. A fragment is considered a duplicate if it starts at the same offset and it has the same length. There is no check on the actual payload content. Unless you are in the middle of a security attack, there is no reason why payload content should change between retransmissions of the same packet. The L2 loop mentioned previously can also be a source of duplicates. Reception once reassembly is already complete In this case, the IP layer considers the fragment the first of a new IP packet. If all of the new fragments are not received, the IP layer will simply clean up the duplicates during its garbage collection process; otherwise, it re-creates the whole packet and it is the job of the upper-layer protocol to recognize the packet as a duplicate. Things can get more complicated if you consider that fragments can get fragmented, too. 18.4.3.2. Associating fragments with their IP packets Because fragments could arrive out of order, defragmentation is a complex process that requires each packet to be recognized and put in its proper place as it arrives. The insert, delete, and merge operations must be easy and quick. To identify the IP packet a fragment belongs to, the kernel takes the following parameters into consideration: Source and destination IP addresses This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com IP packet ID L4 protocol Unfortunately, it is possible for different packets to share all of these parameters. For instance, two different senders could happen to choose the same packet ID for packets that happen to arrive at the same time. One might suppose that the source IP addresses would distinguish the packets, but what if both hosts sat behind a NAT router that put its own IP address on the packets? There is no way the recipient IP layer can distinguish fragments under these conditions. You cannot count on the IP ID field either, because it is a 16-bit field and can therefore wrap around pretty quickly on a fast network. Since the IP ID field plays a central role in the defragmentation process, let's see how IP fragments are organized in memory and how the IP IDs are generated. The most obvious implementation of an IP ID generator would be one that increments a global counter and uses it as the ID each time the IP layer is asked to send a packet. This would assure sequential IDs and easy implementation. This simple model, however, has some problems: For all possible higher-layer protocols to share a global ID, some sort of locking mechanism would be required (especially in multiprocessor machines) to prevent race conditions. However, the use of such a lock would limit symmetric multiprocessing (SMP) scalability. IDs would be predictable, which would lead to some well-known methods of attacking a machine. The ID value could wrap around quickly and lead to duplicate IDs. Because the ID field is a 16-bit value, allowing a total of 65,535 unique numbers, nodes with high traffic and fast connections might find themselves reusing the same ID for a new packet before the old one has reached its destination. For instance, with an average packet size of 512 bytes, a gigabit interface would send 65,535 packets in half a second. A highly loaded server could easily wrap around a global IP ID counter in less than 1 second! Thus, we have to accept the likelihood that the IP layer occasionally mixes together data from completely different packets. There is something wrong. Only the higher layers can fix the problemusually with error checking. The following section shows one way in which Linux reduces the likelihood of (but does not solve) the wraparound problem and ID prediction. The section "Selecting the IP Header's ID Field" in Chapter 23 shows the precise algorithm and code. 18.4.3.3. Example of IP ID generation The wraparound problem is partially addressed by means of multiple, concurrent, global counters. Instead of a global IP ID, the Linux kernel keeps a different one for each destination IP address (up to the maximum number of possible IP destinations). Note that by using multiple IP IDs, you make the IDs take a little longer to wrap around, but eventually they will do so anyway. Figure 18-11 shows an example. Let's suppose we have traffic addressed to two servers with addresses IP1 and IP2. Let's suppose also that for each IP address we have different independent streams of traffic, such as HTTP, Telnet, and FTP. Because the IP IDs are shared by all the streams of traffic going to the same destination, the packets will have sequential IDs if you look at traffic to the destination as a whole, but the traffic of each application will not have sequential IDs. For instance, the IP packets to destination IP1 that are generated by a Telnet session are not sequential. Note that this is merely the solution chosen by Linux, and is not a standard. Other alternatives are available. 18.4.3.4. Example of unsolvable defragmentation problem: NAT Despite all manner of cleverness at the IP layer, the rules of fragmentation lead to potential situations that the IP layer cannot solve. Figure 18-12 shows one of them. Let's suppose that R is a router doing NAT for all the hosts on its network. To be more precise, let's suppose R did masquerading: [*] the source IP addresses in the headers of the IP packets generated by the hosts in the internal network and addressed to the Internet are replaced with router R's IP address, 140.105.1.1. [] This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com [...]... unregistered ChmMagic, please go to http://www.bisenter.com to register it Thanks [*] What Linux calls masquerading is also commonly called Port Address Translation (PAT) [ ] Note that since the return traffic from the Internet and addressed to the hosts in the internal network will all have a destination IP address of 140.1 05. 1.1, R uses the destination UDP/TCP port number to find the right internal host to... You might well be able to increase the PMTU and still not have fragmentation A simple example is where two Ethernet LANs are connected by a router On both sides of the network, the MTU is 1 ,50 0, but hosts of each LAN use the MTU of 57 6 to talk to the hosts of the other LAN because they are not directly connected This is not optimal If you increase the size of the packets in a probe to their optimal... using ip route to set the PMTU, it is possible to lock it with thelock keyword The following example adds a route to the 10.10.1.0/24 network via the next hop gateway 100.100.100.1 and locks the PMTU to 750 bytes: ip route add 10.10.1.0/24 via 100.100.100.1 mtu lock 750 If the PMTU you are supposed to use as a consequence of a received ICMP FRAGMENTATION NEEDED message is smaller than the minimum allowed... problem arises when the two IP packets transmitted by R get fragmented before arriving at server S In this case, server S receives fragments with the same source and destination IP address (140.1 05. 1.1, 151 .41.21.194) and the same IP ID (1,000), and therefore tries to put them together and potentially mixes the fragments of two different IP packets As a consequence of this, both of the packets will... moc.fdpopmis.www//:ptth - noisreV deretsigernU tilpS dna egreM FDP opmiS This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it Thanks 18 .5 Checksums A checksum is a redundant field used by network protocols to recognize transmission errors Some checksums cannot only detect errors, but also automatically fix errors of certain types The idea behind a checksum is simple... go to http://www.bisenter.com to register it Thanks Chapter 19 Internet Protocol Version 4 (IPv4): Linux Foundations and Features The previous chapter laid out what an operating system needs to do to support the IP protocol; this chapter introduces the data structures and basic activities through which Linux supports IP, such as how ingress IP packets are delivered to the IP reception routine, how the... The Simple Network Management Protocol (SNMP) employs a type of object called a Management Information Base (MIB) to collect statistics about systems A data structure called ipstats_mib keeps statistics about the IP layer The section "IP Statistics" in Chapter 23 covers this structure in more detail in_device structure The in_device structure stores all the IPv4-related configuration for a network device,... aspects of the IPv4 implementation we discuss in this part of the book Firewalling, essentially, hooks into certain places in the network stack code that packets always pass through when the packets or the kernel meet certain conditions; at those points, the firewall allows network administrators to manipulate the contents or disposition of the traffic Those points in the kernel, as shown in Figure 18-1... is smaller than 5 it means there is an error The second check in the if statement is rather fussy Currently there are two versions of the IP protocol: IPv4 and IPv6 The if statement makes sure the packet is an IPv4 packet But because the two protocols are handled by two different functions, the ip_rcv function should never have been called for IPv6 in the first place if (iph->ihl < 5 || iph->version... from the NF_IP_PRE_ROUTING point within the network stack (which means the packet was received but no routing decision was taken yet) If you decide not to drop the packet, execute ip_rcv_finish." return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish); See the earlier section "Interaction with Netfilter" for background information 19.2 .5 The ip_rcv_finish Function ip_rcv did not do . has reached its destination. For instance, with an average packet size of 51 2 bytes, a gigabit interface would send 65, 5 35 packets in half a second. A highly loaded server could easily wrap around. http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 18 .5. Checksums A checksum is a redundant field used by network protocols to recognize transmission. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF

Understanding Linux Network Internals 2005 phần 5 pdf

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Understanding Linux Network Internals

Table of Contents

Copyright

Preface

The Audience for This Book

Background Information

Organization of the Material

Conventions Used in This Book

Using Code Examples

We'd Like to Hear from You

Safari Enabled

Acknowledgments

Part I: General Background

Chapter 1. Introduction

Section 1.1. Basic Terminology

Section 1.2. Common Coding Patterns

Section 1.3. User-Space Tools

Section 1.4. Browsing the Source Code

Section 1.5. When a Feature Is Offered as a Patch

Chapter 2. Critical Data Structures

Section 2.1. The Socket Buffer: sk_buff Structure

Section 2.2. net_device Structure

Section 2.3. Files Mentioned in This Chapter

Chapter 3. User-Space-to-Kernel Interface

Section 3.1. Overview

Tài liệu cùng người dùng

Tài liệu liên quan