Understanding Linux Network Internals 2005 phần 9 ppsx

128 390 0
Understanding Linux Network Internals 2005 phần 9 ppsx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

33.3. Major Cache Operations The protocol-independent (DST) part of the cache is a set of dst_entry data structures. Most of the activities in this chapter happen through a dst_entry structure. The IPv4 and IPv6 data structures rtable and rt6_info both include a dst_entry data structure. The dst_entry structure offers a set of virtual functions in a field named dst_ops, which allows higher-layer protocols to run protocol-specific functions that manipulate the entries. The DST code is located in net/core/dst.c and include/net/dst.h. All the routines that manipulate dst_entry structures start with a dst_ prefix. Note that even though they operate on dst_entry structures, they actually affect the outer rtable structures, too. DST is initialized with dst_init, invoked at boot time by net_dev_init (see Chapter 5). 33.3.1. Cache Locking Read-only operations, such as lookups , use a different locking mechanism from read-write operations such as insertion and deletion, but they naturally have to cooperate. Here is how they are handled: Read-only operations These use the routines presented in the section "Cache Lookup" and are protected by a read-copy-update (RCU) read lock, as in the following snapshot: rcu_read_lock( ); perform lookup rcu_read_unlock( ); This code actually does no locking, because read operations can proceed simultaneously without interfering with each other. Read-write operations The insertion of an entry (see the section "Adding Elements to the Cache") and the deletion of an entry (see the section "Deleting DST Entries") use the spin lock embedded in each bucket's element and shown in Figure 33-1. Note that the provision of a per-bucket lock lets different processors write simultaneously to different buckets. Chapter 1 explains the RCU algorithm used to implement locking in the routing table cache, and how read-write spin locks coexist with RCU. 33.3.2. Cache Entry Allocation and Reference Counts A memory pool used to allocate new cache entries is created by ip_rt_init at boot time. Cache entries are allocated with dst_alloc, which returns a This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com void pointer that is cast by the creator to the right data type. Despite the function's name, it does not allocate dst_entry structures, but instead allocates the larger entries that contain those structures: rtable structures for IPv4 (as shown in Figure 33-1), rt6_info for IPv6, and so on. Because the function can be called to allocate structures of different sizes for different protocols, the size of the structure to allocate is indicated through an entry_size virtual function, described in the section "Interface Between the DST and Calling Protocols." 33.3.3. Adding Elements to the Cache Every time a cache lookup required to route an ingress or egress packet fails, the kernel consults the routing table and stores the result into the routing cache. The kernel allocates a new cache entry with dst_alloc, initializes some of its fields based on the results from the routing table, and finally calls rt_intern_hash to insert the new entry into the cache at the head of the bucket's list. A new route is also added to the cache upon receipt of an ICMP REDIRECT message (see Chapter 25). Figures 33-2(a) and 33-2(b) shows the logic of rt_intern_hash. When the kernel is compiled with support for multipath caching, a cache miss may lead to the insertion of multiple routes into the cache, as discussed in the section "Multipath Caching." The function first checks whether the new route already exists by issuing a simple cache lookup. Even though the function was called because a cache lookup failed, the route could have been added in the meantime by another CPU. If the lookup succeeds, the existing cached route is simply moved to the head of the bucket's list. (This assumes the route is not associated with a multipath route; i.e., that its DST_BALANCED flag is not set.) If the lookup fails, the new route is added to the cache. As a simple way to keep the size of the cache under control, rt_intern_hash TRies to remove an entry every time it adds a new one. Thus, while browsing the bucket's list, rt_intern_hash keeps track of the most eligible route for deletion and measures the length of the bucket's list. A route is removed only from those that are eligible for deletion (that is, routes whose reference counts are 0) and when the bucket list is longer than the configurable parameter ip_rt_gc_elasticity. If these conditions are met, rt_intern_hash invokes the rt_score routine to choose the best route to remove. rt_score ranks routes, according to many criteria, into three classes, ranging from most-valuable routes (least eligible to be removed) to least-valuable routes (most eligible to be removed): [*] [*] See the section "Examples of eligible cache victims" in Chapter 30. Figure 33-2a. rt_intern_hash function This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Routes that were inserted via ICMP redirects, are being monitored by user-space commands, or are scheduled for expiration. Output routes (the ones used to route locally generated packets), broadcast routes, multicast routes, and routes to local addresses (for packets generated by this host for itself). All other routes in decreasing order of timestamp of last use: that is, least recently used routes are removed first. rt_score simply stores the time the entry has not been used in the lower 30 bits of a local 32-bit variable, then sets the 31 st bit for the first class of routes and the 32 nd bit for the second class of routes. The final value is a score that represents how important that route is considered to be: the lower the score, the more likely the route is to be selected as a victim by rt_intern_hash. Figure 33-2b. rt_intern_hash function This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 33.3.4. Binding the Route Cache to the ARP Cache Most routing cache entries are bound to the ARP cache entry of the route's next hop. This means that a routing cache entry requires either an existing ARP cache entry or a successful ARP lookup for the same next hop. In particular, the binding is done for output routes used to route locally generated packets (identified by a NULL ingress device identifier) and for unicast forwarding routes. In both cases, ARP is asked to resolve the next hop's L2 address. Forwarding to broadcast addresses, multicast addresses, and local host addresses does not require an ARP resolution because the addresses are resolved using other means. Egress routes that lead to broadcast and multicast addresses do not need associated ARP entries, because the associated L2 addresses can be derived from the L3 addresses (see the section "Special Cases" in Chapter 26). Routes that lead to local addresses do not need ARP either, because packets matching the route are delivered locally. ARP binding for routes is created by arp_bind_neighbour. When that function fails due to lack of memory, rt_intern_hash forces an aggressive garbage collection operation on the routing cache by calling rt_garbage_collect (see the section "Garbage Collection"). The aggressive garbage collection is done by lowering the thresholds ip_rt_gc_elasticity and ip_rt_gc_min_interval and then calling rt_garbage_collect. The garbage collection is tried only once, and only when rt_intern_hash has not been called from software interrupt context, because otherwise, it would be too costly in CPU time. Once garbage collection has completed, the insertion of the new cache entries starts over from the cache lookup step. 33.3.5. Cache Lookup Anytime there is a need to find a route, the kernel consults the routing cache first and falls back to the routing table if there is a cache miss. The routing table lookup process is described in Chapter 35; in this section, we will look at the cache lookup. The routing subsystem provides two different functions to do route lookups , one for ingress and one for egress: ip_route_input Used for input traffic, which could be either delivered locally or forwarded. The function determines how to handle generic packets (whether to deliver locally, forward, drop, etc.) but is also used by other subsystems to decide how to handle their ingress traffic. For instance, ARP uses this function to see whether an ARPOP_REQUEST should be answered (see Chapter 28). ip_route_output_key Used for output traffic, which is generated locally and could be either delivered locally or transmitted out. Possible return values from the two routines include: 0 The routing lookup was successful. This case includes a cache miss that triggers a successful routing table lookup. -ENOBUF The lookup failed due to a memory problem. -ENODEV This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com The lookup key included a device identifier and it was invalid. -EINVAL Generic lookup failure. The kernel also provides a set of wrappers around the two basic functions, used under specific conditions. See, for example, how TCP uses ip_route_connect and ip_route_newports. Figure 33-3 shows the internals of two main routing cache lookup routines. The egress function shown in the figure is _ _ip_route_output_key, which is indirectly called by ip_route_output_key. Figure 33-3. (a) ip_route_input_key function; (b) _ _ip_route_output_key function The routing cache is used to store both ingress and egress routes, so a cache lookup is tried in both cases. In case of a cache miss, the functions call ip_route_input_slow or ip_route_output_slow, which consult the routing tables via the fib_lookup routine that we will cover in Chapter 35. The names of the functions end in _slow to underline the difference in speed between a lookup that is satisfied from the cache and one that requires a query of the routing tables. The two paths are also referred to as the fast and slow paths. Once the routing decision has been taken, through either a cache hit or a routing table, and resulting either in success or failure, the lookup routines return the input buffer skb with the skb->dst->input and skb->dst->output virtual functions initialized. skb->dst is the cache entry that satisfied the routing request; in case of a cache miss, a new cache entry is created and linked to skb->dst. The packet will then be further processed by calling either one or both of the virtual functions skb->dst->input (called via a simple wrapper named dst_input) and skb->dst->output (called via a wrapper named dst_output). Figure 18-1 in Chapter 18 shows where those two virtual functions are invoked in the IP stack, and what routines they can be initialized to depending on the direction of the traffic. This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Chapter 35 goes into detail on the slow routines for the routing table lookups. The next two sections describe the internals of the two cache lookup routines in Figure 33-3. Their code is very similar; the only differences are: On ingress, the device of the ingress route needs to match the ingress device, whereas the egress device is not yet known and is therefore simply compared against the null device (0). The opposite applies to egress routes. In case of a cache hit, the functions update the in_hit and out_hit counters, respectively, using the RT_CACHE_STAT_INC macro. Statistics related to both the routing cache and the routing tables are described in Chapter 36. Egress lookups need to take the RTO_ONLINK flag into account (see the section "Egress lookup"). Egress lookups support multipath caching, the feature introduced in the section "Cache Support for Multipath" in Chapter 31. 33.3.5.1. Ingress lookup ip_route_input is used to route ingress packets. Here is its prototype and the meaning of its input parameters: int ip_route_input(struct sk_buff *skb, u32 daddr, u32 saddr, u8 tos, struct net_device *dev) skb Packet that triggered the route lookup. This packet does not necessarily have to be routed itself. For example, ARP uses ip_route_input to consult the local routing table for other reasons. In this case, skb would be an ingress ARP request. saddr daddr Source and destination addresses to use for the lookup. tos TOS field, a field of the IP header. dev Device the packet was received from. ip_route_input selects the bucket of the hash table that should contain the route, based on the input criteria. It then browses the list of routes in that bucket one by one, comparing all the necessary fields until it either finds a match or gets to the end without a match. The lookup fields passed as input to ip_route_input are compared to the fields stored in the fl field [*] of the routing cache entry's rtable, as shown in the following code extract. The bucket (hash variable) is chosen through a combination of input parameters. The route itself is represented by the rth variable. This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com [*] See the description of the flowi structure in the section "Main Data Structures" in Chapter 32. hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos); rcu_read_lock( ); for (rth = rcu_dereference(rt_hash_table[hash].chain; rth; rth = rcu_dereference(rth->u.rt_next)) { if (rth->fl.fl4_dst == daddr && rth->fl.fl4_src == saddr && rth->fl.iif == iif && rth->fl.oif == 0 && #ifdef CONFIG_IP_ROUTE_FWMARK rth->fl.fl4_fwmark == skb->nfmark && #endif rth->fl.fl4_tos == tos) { rth->u.dst.lastuse = jiffies; dst_hold(&rth->u.dst); rth->u.dst._ _use++; RT_CACHE_STAT_INC(in_hit); rcu_read_unlock( ); skb->dst = (struct dst_entry*)rth; return 0; } RT_CACHE_STAT_INC(in_hlist_search); } rcu_read_unlock( ); In the case of a cache miss for a destination address that is multicast, the packet is passed to the multicast handler ip_route_input_mc if one of the following two conditions is met, and is dropped otherwise: The destination address is a locally configured multicast address. This is checked with ip_check_mc. The destination address is not locally configured, but the kernel is compiled with support for multicast routing (CONFIG_IP_MROUTE). This decision is shown in the following code: if (MULTICAST(daddr)) { struct in_device *in_dev; rcu_read_lock( ); if ((in_dev = _ _in_dev_get(dev)) != NULL) { int our = ip_check_mc(in_dev, daddr, saddr, skb->nh.iph->protocol); if (our #ifdef CONFIG_IP_MROUTE || (!LOCAL_MCAST(daddr) && IN_DEV_MFORWARD(in_dev)) #endif ) { rcu_read_unlock( ); return ip_route_input_mc(skb, daddr, saddr, tos, dev, our); } } rcu_read_unlock( ); return -EINVAL; } This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Finally, in the case of a cache miss for a destination address that is not multicast, ip_route_input calls ip_route_input_slow, which consults the routing table: return ip_route_input_slow(skb, daddr, saddr, tos, dev); } 33.3.5.2. Egress lookup _ _ip_route_output_key is used to route locally generated packets and is very similar to ip_route_input: it checks the cache first and relies on ip_route_output_slow in the case of a cache miss. When the cache supports Multipath, a cache hit requires some more work: more than one entry in the cache may be eligible for selection and the right one has to be selected based on the caching algorithm in use. The selection is done with multipath_select_route. More details can be found in the section "Multipath Caching." Here is its prototype and the meaning of its input parameters: int _ _ip_route_output_key(struct rtable **rp, const struct flowi *flp) rp When the routine returns success, *rp is initialized to point to the cache entry that matched the search key flp. flp Search key. A successful egress cache lookup needs to match the RTO_ONLINK flag, if it is set: !((rth->fl.fl4.tos ^ flp->fl4_tos) & (IPTOS_RT_MASK | RTO_ONLINK))) The preceding condition is true when both of the following conditions are met: The TOS of the routing cache entry matches the one in the search key. Note that the TOS field is saved in the bits 2, 3, 4 and 5 of the 8-bit tos variable (as shown in Figure 18-3 in Chapter 18). [*] [*] The TOS field, as shown in Figure 18-3 in Chapter 18, is an 8-bit field, of which bit 0 is ignored and bit 1 through 7 are used. However, the routing code uses only the bits 1, 2, 3 and 4. It does not take the precedence component (bits 5, 6, 7) into consideration for egress routes. Those bits are masked out with the macro RT_TOS. The RTO_ONLINK flag is set on both the routing cache entry and the search key or on neither of them. You will see the RTO_ONLINK flag in the section "Search Key Initialization" in Chapter 35. The flag is passed via the TOS variable, but it has nothing to do with the IP header's TOS field; it simply uses an unused bit of the TOS field (see Figure 18-1 in Chapter 18). When the flag is set, it means the destination is located in a local subnet and there is no need to do a routing lookup (or, in other words, a routing lookup could fail but that would not be a problem). This is not a flag the administrator sets when configuring routes, but it is used when doing routing lookups to specify that the route type searched must have scope RT_SCOPE_LINK, which means the destination is directly connected. This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com The flag is then saved in the associated routing cache entries when they are created. Lookups with the RTO_ONLINK flag set are made, for example, by the following protocols: ARP When an administrator manually configures an ARP mapping, the kernel makes sure that the IP address belongs to one of the locally configured subnets. For example, the command arp -s 10.0.0.1 11:22:33:44:55:66 adds the mapping of 10.0.0.1 to 11:22:33:44:55:66 to the ARP cache. This command would be rejected by the kernel if, according to its routing table, the IP address 10.0.0.1 did not belong to one of the locally configured subnets (see arp_req_set and Chapter 26). Raw IP and UDP When sending data over a socket, the user can set the MSG_DONTROUTE flag. This flag is used when an application is transmitting a packet out from a known interface to a destination that is directly connected (there is no need for a gateway), so the kernel does not have to determine the egress device. This kind of transmission is used, for instance, by routing protocols and diagnostic applications. This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com [...]... database would look like after configuration of the following two multipath routes: #ip route add 10.0.1.0/24 mpath wrandom nexthop via 192 .168.1.1 weight 1 nexthop via 192 .168.2.1 weight 2 #ip route add 10.0.2.0/24 mpath wrandom nexthop via 192 .168.1.1 weight 5 nexthop via 192 .168.2.1 weight 1 The database is actually not built right away when the multipath routes are defined: it is populated at lookup... notification chain netdev_chain, introduced in Chapter 4 The only two events the DST is interested in are the ones generated when a network device goes down (NEtdEV_DOWN) and when a device is unregistered (NEtdEV_UNREGISTER) You can find the complete list of NETDEV_XXX events in include /linux/ notifier.h When a device becomes unusable, either because it is not available anymore (for instance, it has been unregistered... upper protocol about the unreachability of a given IPv4 address, it calls dst_link_failure for the associated dst_entry structure (remember that cached routes are associated with IP addresses, not with networks), which will invoke the ipv4_link_failure routine registered by IPv4 viaipv4_dst_ops It is also possible for the calling protocol to intervene directly in DST's behavior For example, when IPv4... reachable through a different device with a better route An IP address is added to or removed from a device We saw in the sections "Adding an IP address" and "Removing an IP address" in Chapter 32 that Linux creates a special route for each locally configured IP address When an address is removed, any associated route in the cache also has to be removed The removed address was most likely configured... a cleanup to keep down memory use by restricting the cache to a fixed size gc_thresh is configurable via /proc (see the section " Tuning via /proc Filesystem" in Chapter 36) The next section gives the internals of rt_garbage_collect 33.7.2 rt_garbage_collect Function moc.fdpopmis.www//:ptth - noisreV deretsigernU tilpS dna egreM FDP opmiS This document was created by an unregistered ChmMagic, please... eligibility criteria The number of entries to remove (goal) depends on how heavily loaded the hash table is The goal is to expire entries faster when the table is more heavily loaded With the help of Figure 33 -9, let's clarify some of the thresholds used byrt_garbage_collect to define goal: The size of the hash table is rt_hash_mask+1, or 2rt_hash_log rt_garbage_collect is called when the number of entries in... ip_rt_gc_elasticity*(2rt_hash_log), which by default is eight times the size of the hash table, the cache is considered to be dangerously large and the garbage collection starts setting goal more aggressively Figure 33 -9 Garbage collection thresholds Once the thresholds have been defined, rt_garbage_collect browses the hash table elements looking for victims The table is not simply browsed from the first to the last . mpath wrandom nexthop via 192 .168.1.1 weight 1 nexthop via 192 .168.2.1 weight 2 #ip route add 10.0.2.0/24 mpath wrandom nexthop via 192 .168.1.1 weight 5 nexthop via 192 .168.2.1 weight 1 The database. into detail on the slow routines for the routing table lookups. The next two sections describe the internals of the two cache lookup routines in Figure 33-3. Their code is very similar; the only. conditions. See, for example, how TCP uses ip_route_connect and ip_route_newports. Figure 33-3 shows the internals of two main routing cache lookup routines. The egress function shown in the figure is

Ngày đăng: 13/08/2014, 04:21

Từ khóa liên quan

Mục lục

  • Understanding Linux Network Internals

  • Table of Contents

  • Copyright

  • Preface

    • The Audience for This Book

    • Background Information

    • Organization of the Material

    • Conventions Used in This Book

    • Using Code Examples

    • We'd Like to Hear from You

    • Safari Enabled

    • Acknowledgments

    • Part I:  General Background

      • Chapter 1.  Introduction

        • Section 1.1.  Basic Terminology

        • Section 1.2.  Common Coding Patterns

        • Section 1.3.  User-Space Tools

        • Section 1.4.  Browsing the Source Code

        • Section 1.5.  When a Feature Is Offered as a Patch

        • Chapter 2.  Critical Data Structures

          • Section 2.1.  The Socket Buffer: sk_buff Structure

          • Section 2.2.  net_device Structure

          • Section 2.3.  Files Mentioned in This Chapter

          • Chapter 3.  User-Space-to-Kernel Interface

            • Section 3.1.  Overview

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan