Characterizing Performance: Latency and Effective Bandwidth

The routing, switching, and arbitration functionality described above introduces some additional components of packet transport latency that must be taken into account in the expression for total packet latency. Assuming there is no conten-tion for network resources—as would be the case in an unloaded network—total packet latency is given by the following:

Here T_R, T_A, and T_S are the total routing time, arbitration time, and switching time experienced by the packet, respectively, and are either measured quantities or calculated quantities derived from more detailed analyses. These components are added to the total propagation delay through the network links, T_TotalProp, to give the overall time of flight of the packet.

The expression above gives only a lower bound for the total packet latency as it does not account for additional delays due to contention for resources that may occur. When the network is heavily loaded, several packets may request the same network resources concurrently, thus causing contention that degrades perfor-mance. Packets that lose arbitration have to be buffered, which increases packet latency by some contention delay amount of waiting time. This additional delay is not included in the above expression. When the network or part of it approaches saturation, contention delay may be several orders of magnitude greater than the total packet latency suffered by a packet under zero load or even under slightly loaded network conditions. Unfortunately, it is not easy to compute analytically the total packet latency when the network is more than moderately loaded. Measurement of these quantities using cycle-accurate simulation of a detailed network model is a better and more precise way of estimating packet latency under such circumstances. Nevertheless, the expression given above is useful in calculating best-case lower bounds for packet latency.

Latency Sending overhead (T_TotalProp+T_R+T_A+T_S) Packet size Bandwidth

--- Receiving overhead

+ + +

For similar reasons, effective bandwidth is not easy to compute exactly, but we can estimate best-case upper bounds for it by appropriately extending the model presented at the end of the previous section. What we need to do is to find the nar-rowest section of the end-to-end network pipe by finding the network injection bandwidth (BWNetworkInjection), the network reception bandwidth (BW NetworkRecep-tion), and the network bandwidth (BW_Network) across the entire network interconnect-ing the devices.

The BWNetworkInjection can be calculated simply by multiplying the expression for link injection bandwidth, BWLinkInjection, by the total number of network injec-tion links. The BWNetworkReception is calculated similarly using BWLinkReception, but it must also be scaled by a factor that reflects application traffic and other character-istics. For more than two interconnected devices, it is no longer valid to assume a one-to-one relationship among sources and destinations when analyzing the effect of flow control on link reception bandwidth. It could happen, for example, that sev-eral packets from different injection links arrive concurrently at the same reception link for applications that have many-to-one traffic characteristics, which causes contention at the reception links. This effect can be taken into account by an aver-age reception factor parameter, σ, which is either a measured quantity or a calcu-lated quantity derived from detailed analysis. It is defined as the average fraction or percentage of packets arriving at reception links that can be accepted. Only those packets can be immediately delivered, thus reducing network reception bandwidth by that factor. This reduction occurs as a result of application behavior regardless of internal network characteristics. Finally, BW_Network takes into account the internal characteristics of the network, including contention. We will progressively derive expressions in the following sections that will enable us to calculate this as more details are revealed about the internals of our black box interconnection network.

Overall, the effective bandwidth delivered by the network end-to-end to an application is determined by the minimum across the three sections, as described by the following:

Let’s use the above expressions to compare the latency and effective bandwidth of shared-media networks against switched-media networks for the four intercon-nection network domains: OCNs, SANs, LANs, and WANs.

Example Plot the total packet latency and effective bandwidth as the number of intercon-nected nodes, N, scales from 4 to 1024 for shared-media and switched-media OCNs, SANs, LANs, and WANs. Assume that all network links, including the injection and reception links at the nodes, each have a data bandwidth of 8 Gbps, and unicast packets of 100 bytes are transmitted. Shared-media networks share one link, and switched-media networks have at least as many network links as Effective bandwidth = min BW( NetworkInjection,BW_Network,σ BW× NetworkReception)

min N( ×BWLinkInjection,BW_Network,σ N× ×BWLinkReception)

there are nodes. For both, ignore latency and bandwidth effects due to contention within the network. End nodes have per-packet sending and receiving overheads of x + 0.05 ns/byte and 4/3(x) + 0.05 ns/byte, respectively, where x is 0 μs for the OCN, 0.3 μs for the SAN, 3 μs for the LAN, and 30 μs for the WAN, and inter-connection distances are 0.5 cm, 5 m, 5000 m, and 5000 km, respectively. Also assume that the total routing, arbitration, and switching times are constants or functions of the number of interconnected nodes: T_R = 2.5 ns, T_A = 2.5(N) ns, and T_S = 2.5 ns for shared-media networks and T_R = T_A = T_S = 2.5(log₂ N) ns for switched-media networks. Finally, taking into account application traffic charac-teristics for the network structure, the average reception factor, σ, is assumed to be N^–1 for shared media and polylogarithmic (log₂ N)^–1/4 for switched media.

Answer All components of total packet latency are the same as in the example given in the previous section except for time of flight, which now has additional routing, arbitration, and switching delays. For shared-media networks, the additional delays total 5 + 2.5(N) ns; for switched-media networks, they total 7.5(log₂ N) ns.

Latency is plotted only for OCNs and SANs in Figure F.9 as these networks give the more interesting results. For OCNs, T_R, T_A, and T_S combine to dominate time of flight and are much greater than each of the other latency components for a moderate to large number of nodes. This is particularly so for the shared-media

Figure F.9 Latency versus number of interconnected nodes plotted in semi-log form for OCNs and SANs. Routing, arbitration, and switching have more of an impact on latency for networks in these two domains, particularly for networks with a large number of nodes, given the low sending and receiving overheads and low propagation delay.

Latency (ns)

10,000

1000

4 100

Number of nodes (N)

512 1024

256 128 64 32 16 8 SAN— shared OCN— shared SAN— switched OCN— switched

network. The latency increases much more dramatically with the number of nodes for shared media as compared to switched media given the difference in arbitration delay between the two. For SANs, T_R, T_A, and T_S dominate time of flight for most network sizes but are greater than each of the other latency com-ponents in shared-media networks only for large-sized networks; they are less than the other latency components for switched-media networks but are not negli-gible. For LANs and WANs, time of flight is dominated by propagation delay, which dominates other latency components as calculated in the previous section;

thus, T_R, T_A, and T_S are negligible for both shared and switched media.

Figure F.10 plots effective bandwidth versus number of interconnected nodes for the four network domains. The effective bandwidth for all shared-media net-works is constant through network scaling as only one unicast packet can be received at a time over all the network reception links, and that is further limited by the receiving overhead of each network for all but the OCN. The effective bandwidth for all switched-media networks increases with the number of inter-connected nodes, but it is scaled down by the average reception factor. The receiving overhead further limits effective bandwidth for all but the OCN.

Figure F.10 Effective bandwidth versus number of interconnected nodes plotted in semi-log form for the four network domains. The disparity in effective bandwidth between shared- and switched-media networks for all inter-connect domains widens significantly as the number of nodes in the network increases. Only the switched on-chip network is able to achieve an effective bandwidth equal to the aggregate bandwidth for the parameters given in this example.

Effective bandwidth (Gbits/sec)

10,000

1000

100

1 0.1

0.01

Number of nodes (N)

1200 1000

800 600

400 200

OCN— switched SAN— switched LAN— switched WAN— switched OCN— shared SAN— shared LAN— shared WAN— shared

Given the obvious advantages, why weren’t switched networks always used?

Earlier computers were much slower and could share the network media with lit-tle impact on performance. In addition, the switches for earlier LANs and WANs took up several large boards and were about as large as an entire computer. As a consequence of Moore’s law, the size of switches has reduced considerably, and systems have a much greater need for high-performance communication.

Switched networks allow communication to harvest the same rapid advance-ments from silicon as processors and main memory. Whereas switches from tele-communication companies were once the size of mainframe computers, today we see single-chip switches and even entire switched networks within a chip. Thus, technology and application trends favor switched networks today. Just as single-chip processors led to processors replacing logic circuits in a surprising number of places, single-chip switches and switched on-chip networks are increasingly replacing shared-media networks (i.e., buses) in several application domains. As an example, PCI-Express (PCIe)—a switched network—was introduced in 2005 to replace the traditional PCI-X bus on personal computer motherboards.

The previous example also highlights the importance of optimizing the rout-ing, arbitration, and switching functions in OCNs and SANs. For these network domains in particular, the interconnect distances and overheads typically are small enough to make latency and effective bandwidth much more sensitive to how well these functions are implemented, particularly for larger-sized networks.

This leads mostly to implementations based mainly on the faster hardware solu-tions for these domains. In LANs and WANs, implementasolu-tions based on the slower but more flexible software solutions suffice given that performance is largely determined by other factors. The design of the topology for switched-media networks also plays a major role in determining how close to the lower bound on latency and the upper bound on effective bandwidth the network can achieve for OCN and SAN domains.

The next three sections touch on these important issues in switched networks, with the next section focused on topology.

When the number of devices is small enough, a single switch is sufficient to interconnect them within a switched-media network. However, the number of switch ports is limited by existing very-large-scale integration (VLSI) technol-ogy, cost considerations, power consumption, and so on. When the number of required network ports exceeds the number of ports supported by a single switch, a fabric of interconnected switches is needed. To embody the necessary property of full access (i.e., connectedness), the network switch fabric must provide a path from every end node device to every other device. All the connections to the net-work fabric and between switches within the fabric use point-to-point links as opposed to shared links—that is, links with only one switch or end node device on either end. The interconnection structure across all the components—includ-ing switches, links, and end node devices—is referred to as the network topology.

F.4 Network Topology

The number of network topologies described in the literature would be diffi-cult to count, but the number that have been used commercially is no more than about a dozen or so. During the 1970s and early 1980s, researchers struggled to propose new topologies that could reduce the number of switches through which packets must traverse, referred to as the hop count. In the 1990s, thanks to the introduction of pipelined transmission and switching techniques, the hop count became less critical. Nevertheless, today, topology is still important, particularly for OCNs and SANs, as subtle relationships exist between topology and other network design parameters that impact performance, especially when the number of end nodes is very large (e.g., 64 K in the Blue Gene/L supercomputer) or when the latency is critical (e.g., in multicore processor chips). Topology also greatly impacts the implementation cost of the network.

Topologies for parallel supercomputer SANs have been the most visible and imaginative, usually converging on regularly structured ones to simplify routing, packaging, and scalability. Those for LANs and WANs tend to be more haphaz-ard or ad hoc, having more to do with the challenges of long distance or connect-ing across different communication subnets. Switch-based topologies for OCNs are only recently emerging but are quickly gaining in popularity. This section describes the more popular topologies used in commercial products. Their advan-tages, disadvanadvan-tages, and constraints are also briefly discussed.

在文檔中 Interconnection Networks (頁 26-31)