• 沒有找到結果。

Proposed Algorithm: Downward-Lateral Deterministic Routing

Chapter 3   Transport-Layer Assisted Routing

3.3  Proposed Algorithm: Downward-Lateral Deterministic Routing

The proposed 3D routing algorithm in TLAR is the combination of downward routing and a deterministic routing (DLDR). The downward routing is used for moving packets up and down in the vertical direction. The lateral deterministic routing is used for routing packets in the lateral direction. The path diversity is two because we can select to route in the source layer or the bottom layer. For reducing the computational complexity of checking rout ability, we adapt XY routing, a dimension-ordered routing (DOR), as the deterministic routing.

Fig. 3-6 (a) TLAR examples, and (b) checking dependency.

An example of routability is shown in Fig. 3-6(a). There are three kinds of destination routers. First, the gray blocks are throttled destinations. The messages toward these destinations are kept in message queue until destinations are routable.

Otherwise, the packets will be blocked in the network because the destination router is not active. Second, the white blocks are routable destinations with XY routing; an example path is shown by the green line. Third, the pale blue blocks are destinations

29

those are only downward-routable. An example of the path of downward routing is shown by the dotted line. Conclusively, if the path is lateral-first routable, the packet first traverses through the lateral path in the source layer. Then, the packet goes up or down to the destination router. Otherwise, the downward path is the only path, so the packet first traverses to the bottom layer and is routed laterally in the bottom layer.

Then, the packet goes up to the destination router.

When topology is changed, the routing mode must be decided again for each destination, and the decisions are saved in the network interface. The controller in transport layer checks if there is any fully-throttled router on the paths based on the topology table. The checking of the routability of all destinations in the source layer can be done by using the incremental breadth-first search (BFS) style, as shown in Fig. 3-6(b). The dependency is based on XY-routing, and the prerequisites that a node routable is its previous node also routable. For an 8x8x4 network, the operation is completed in 63 cycles.

3.4 Summary

In this chapter, we propose the Transport-Layer Assisted Routing (TLAR) scheme. TLAR can make sure the delivery packets can reach destination router successfully, and the operation flow and checking of TLAR is also described here.

Based on TLAR, we propose DLDR for sending packets with XYZ routing or downward routing.

30

Chapter 4

Performance Evaluation

4.1 Setting of Simulation Environmentst

Currently there is no real chip implementation of 3D NoC systems, so we start from modeling the 2D NoC system implemented in [30] and stack it to multiple layers. For network simulation, we start from Noxim [28], and we extend it to the third dimension. For temperature simulation, we use Hotspot [29], and We adopt the tile geometry and power model of Intel’s 80-core processor [30]. We first add the model of basic 3D router and the Dimensionally-Decomposed (DimDe) router [23], and modified Noxim to generate the 3D-NoC architecture and the floorplan based on user-defined parameters of dimension. During network traffic simulation, a power trace is generated based on the power model of the NoC. The power trace and physical floorplan are input to the thermal model. In the proposed simulator, the tile geometry and power model are based on Intel’s 80-core chip. Fig. 4-1 shows the construction of the co-simulation model and Fig. 4-2 shows the floor plan as we used.

We adopt the basic wormhole flow control and use random arbitration for switch allocation

We construct 3D 8×8×4 NoC, and the packet length is randomly from 2 to 10 flits. The queue depth of each input channel is 16 flits, and the link level flow control protocol is full hand-shake request and ack. Because TSVs generally have high bandwidth, a crossbar-based vertical connection is assumed in our 3D router [10]. For each tile in the NoC, the tile area is 2.0mm × 1.5mm and the router area is 0.65mm ×

31

0.53mm.

Fig. 4-1 Framework of co-simulation platform.

Fig. 4-2 Construction of the model of a 4x4x4 3D NoC with simplified tile model from. [30]

To keep the performance indices representative and comparison as fair as possible, several modifications of the simulator are required for modeling the TLAR.

In Noxim[28], the statistics of received packet number, packet latency, and network throughput are based on the received packets during the simulation period while the network is assumed stable. The payloads toward the fully throttled destinations are hold in the transport layer and not packetized, and only the deliverable payloads are packetized and injected to the network. Originally the injection rate is simulated by

Network Model Noxim (3D)

Power Model Intel 80‐core NoC

Thermal Model HotSpot (3D) Traffic Activity

Power trace

Temperature

32

generating Poisson arrived packets of given traffic distribution of destinations over the network. The network injection rate of the active routers follows the index by escaping the packets to the throttled destinations and regenerating more packets toward non-throttled destinations. Because we assume application layer and transport layer share the topology information, the packet injection process of the throttled router is paused until it is not throttled. In this setting, the total injection rate of the network can be obtained by multiplying injection rate and the number of the active routers, and the statistics of performance indices are not affected by throttling.

In this chapter we show the performance of the proposed TLAR algorithms.

We use two throttling cases of vertical throttling: (a) 1 throttled router, and (b) 2 2x2 throttled region. In case (a), the one throttled routers is located in the center of the most top layer of the 8x8x4 network. In case (b), 8 1x1x3 pillars are throttled on the diagonal line of the upper three layers of the 8x8x4 network. And they are shown in Fig. 4-3.

Fig. 4-3 Throttling cases: (a) 1 router (b) 2 2*2*3 routers.

33

4.2 Traffic balance and rate of transmitting packet under different routings

First we use statistical traffic load distribution (STLD) [34] and decision distribution to show the network loading. All the experiment in Fig. 4-4 and Fig. 4-5 use the same injection rate that makes average latency of TLAR-DLDR twice the zero load latency.

Fig. 4-4(a) shows the STLD of the baseline downward routing and Fig. 4-4(b) shows the TLAR-DLDR algorithm, which is the combination of downward routing and TLAR. Though there is only one router throttled in the network, some packets have to be routed downward through the bottom layer. The congestion degree of DLDR in the bottom layer is reduced and the loading of the work is more balanced by using proposed DLDR. The packets in the upper layers are more balanced, because the congestion in the bottom layer is relaxed. We use Fig. 4-4(c) to show distribution of the routing mode decision. In DLDR scheme, 80% packets are determined to route on the deterministic paths in the source layers and 20% packets are routed in the downward mode. We can prove we decrease the rate of downward routing by proposed DLDR.

34

Fig. 4-4 (a) Statistical traffic load distribution (STLD) of conventional design; (b) STLD of proposed TLAR framework with DLDR algorithm; (c) latency vs. network injection rate.

Fig. 4-5 shows similar results when we increase the number of disconnected throttled routers and the size of the region. We increase the number of throttled routers from 1 router to 24 routers. The congestion degree of downward in the bottom layer is still larger than DLDR, because downward cannot balance the loading of the bottom layer. The packets the upper layers are more because the congestion in the bottom layer is relaxed by using of our proposed DLDR algorithm. We use Fig. 4-5(c) to show distribution of the routing mode decision. In DLDR scheme, 40% packets are determined to route on the deterministic paths in the source layers, and 60% packets are routed in the downward mode. The DLDR algorithm has fewer packets choosing the downward mode, so DLDR is more vertical balanced than downward routing.

35

0%

20%

40%

60%

80%

100%

XY (Lateral‐FIRST) Downward (Z‐First)

Fig. 4-5 (a) Statistical traffic load distribution (STLD) of conventional design; (b) STLD of proposed TLAR framework with DLDR algorithm; (c) latency vs. network injection rate.

We use Table 4-1 to show the statistics of the statistical traffic load distribution (STLD). As we can see, the mean packet number is increased by adopting the TLAR scheme and both total and inter-layer standard deviations are reduced by applying DLDR algorithm. The statistics is corresponded to the performance simulations, which are shown in Fig. 4-6. In the case of 1 throttled router, there are 255 active routers injecting packets to the network. For the case in Fig .4-3(b), only 232 routers are active. These active routers can transmit (receive) the packets to(from) the network. Because of the more balanced loading of the network, the DLDR has better performance than the baseline algorithm downward routing in both 1 and 2 2x2x3 throttling cases. The throughput in Fig. 4-6(a) is improved by 95% by adopting the

36

DLDR algorithms. In Fig. 4-6(b), the throughput is improved by 70%. We can see our proposed algorithm DLDR is outperforming than downward routing.

Table 4-1 Statistics of statistical traffic load distribution.

One Throttled router Two 2x2x3 Throttled Pillars

Fig. 4-6 Average latency vs injection rate with (a) one router throttled and (b) two 2x2x3 pillars throttled.

37

4.2.1 Network Sustainability and Degree of Graceful Degradation

Here we show the network sustainability and the degree of graceful degradation.

Network sustainability describes the total throughput provided by the network while some parts of the network are not working. If a router is throttled, it cannot provide the bandwidth for packet delivery. With higher network sustainability, the 3D NoC can provide larger throughput when there are routers fully throttled. Because we cannot simulate all throttling cases, we simulate extreme cases from on throttled routers to 7x7x3 throttled routers, as shown in Fig. 4-7. As shown in Fig. 4-8, all algorithms degrade as the size of throttled region increases. Here we want to observe the throughput degradation in the different case of vertical load balancing of different irregular topologies occurred in NSI-mesh. We start from the 1x1x1 throttling case and then extend to the 7x7x3 throttling case. All the throttled regions are located in the center of the xy-plane. In all cases, the DLDR algorithms have better performance the conventional reactive downward routing. In comparison with conventional reactive downward routing, the proposed TLAR-DLDR can averagely improve the sustainable throughput from 85.5% to 48%.

1x1x1 1x1x2 1x1x3 2x2x3 4x4x3 6x6x3 7x7x3

Fig. 4-7 Different throttling cases.

38

1x1x1 1x1x2 1x1x3 2x2x3 4x4x3 6x6x3 7x7x3 Network Throughput (Flits/Cycles)

Size of throttled region

Downward TLAR‐DLDR

Fig. 4-8 Network sustainability of NSI-mesh 3D NoC, uniform traffic offered.

4.2.2 Run-Time Temperature and Throughput

Here we show the real case for simulating TLAR in run-time thermal management. The simulation setup is as same, and the only difference is that we do not fixe location of throttled routers in simulation. Thermal-Aware Vertical Throttling (TAVT) is adopted in RTM to throttle overheat routers, and TLAR will detect throttled routers and detour them. As shown in Fig. 4-9, the total simulation cycle in network simulator is M+1000K cycles. The total simulation time for temperature is 10 seconds, and is divided into 1000 10ms intervals for observing the transient-state temperature. For each 10ms interval, our network simulator uses K cycles to evaluate the transient power. The network simulator first runs for M cycles to warm up the network, and the thermal simulator setups the ambient temperature Tamb and initializes the temperature distribution T0. For each 10ms interval, the network simulator uses K cycles to estimate the power distribution, which is denoted as P(t ,t+10). Then the thermal simulator is called to estimate the transient-state temperature Tt+10 based on given short-term power distribution P(t ,t+10) and the

39

beginning temperature distribution Tt. In this simulation, K=50000 cycles for reaching steady state of traffic between thermal checking, and we focus on transient temperature and throughput of each interval.

Fig. 4-9 Mutual-coupling co-simulation for throughput and temperature evaluation.

We show the temperature and numbers of throttled routers from 7.1s to 7.6s.

As shown in Fig. 4-10(a), the transient temperature is below thermal limit, because TAVT effectively controls the temperature. The number of throttled routers means the changing of topology, from regular to irregular and back to regular. The throughput of downward routing is shown in Fig. 4-10(b), and the average throughput between 7.1s to 7.6s is 15.4 (Flits/Cycles).

Fig. 4-11(a) shows the temperature and throughput of TLAR with same simulation setup. The average temperature of TLAR is 0.15°C more than downward routing, which is relatively small to changing of temperature. However, as shown in Fig. 4-11(b), average throughput of TLAR is 25.4 (Flits/Cycles), which is improved 66% in comparison with downward routing. The temperature of TLAR is a little

40

higher than traditional downward results from the larger throughput. TLAR delivery more packets, so the power is higher. Therefore the transient temperature is high, too.

0

Fig. 4-10 (a) Temperature and numbers of throttled routers of downward routing. (b) Throughput of downward routing. (c) Statistics of downward routing.

41

Fig. 4-11 (a) Temperature and numbers of throttled routers of TLAR (b) Throughput of TLAR. (c) Statistics of TLAR.

42

4.3 Summary

In this chapter, we see the performance results of TLAR. For traffic loading distribution, the proposed TLAR can balance vertical loading than downward routing because of more lateral packets used for routing. The throughput is also higher than downward routing for fixed or non-fixed throttled region of routers. Finally, we simulate TLAR on real case. Although TLAR has a little higher temperature, the throughput result is better than downward routing.

43

Chapter 5 Architecture Design for

Transport-Layer Assisted Routing

For successful data delivery and performance consideration, we propose transport layer assisted routing (TLAR) scheme. Here we propose a low cost and low latency architecture of TLAR.

5.1 Traditional Architecture and Dataflow

NoC is composed of five layers [33]: Application layer, Transport layer, Network layer, Data link layer and Physic layer. For architecture design, we focus on transport layer and network layer, because we use transport layer to assist network layer for routing decision, and we should consider are overhead comparing to traditional design. The implementation is composed of a router, which transfer hop by hop, and a network interface (NI), which implements the interface to the IP modules.

The traditional architecture is show in Fig. 5-1.

Network interface is allocated in transport layer, as shown in Fig. 5-2.

Network interface (NI) is the component that provides the conversion of the packet-based communication of the NoC to the higher-level, which communicate between application layer and network layer. Ni packetize message from application layer to packet transmitting to network layer. In Fig. 5-1, this is the traditional NI + router, and it can maintain traditional function of network interface, but cannot maintain the transport-layer assisted routing.

44

Fig. 5-1 Architecture of network interface and router in the traditional 3D NoC.

Routers deal with packet form network interface, and choose routing path to route packet to destination router, which transmit flits rather than packets, as shown in Fig. 5-2. Because the transport layer provides packets to network layer, routers should find routing path to destination hop by hop. Because traditional NI and router cannot maintain TLAR scheme, we propose TLAR network interface and dual mode router architecture for TLAR scheme, and we discuss them in following sub-chapter.

45

Fig. 5-2 Data flow between application layer, transport layer and network layer.

5.2 Network Interface Design

5.2.1 Control Logic and FSM

We know network interface is established in transport layer, and we should maintain original function of network interface and our proposed transport-layer assisted routing. Because traditional network interface cannot maintain TLAR scheme, we propose TLAR network interface, as shown in Fig. 5-3. We can divide our network interface to four major parts:

 Baseline Datapath and Tx/Rx Queues (Tx/Rx): Tx deals with the message from application layer and packetize the payloads in to packets to network layer. In contrast, Rx receives packet from network layer, de-packetizes, and combines to message to application layer. Tx and Rx require data queue for storing payloads and packets respectively.

 Topology Table (TT): This table stores 1-bit throttling information of each destination, and updates on each topology change. Application layer and

46

transport layer share this information. TT is required for all NSI-mesh networks to solve the problem of (i) Source router is not serving and (ii).

Destination router is not serving. Direct implementation of TT requires XYZ bits for an X-by-Y-by-Z 3D NoC.

 Routing Mode Memory (RMM): RMM is required to reduce the timing overhead of checking routing mode for each packet. The mode for each destination is checked once as topology changing and stored in RMM. Before injecting a packet to network layer, the correspond routing mode is queried from RMM. Direct implementation of RMM also requires XYZ bits for an X-by-Y-by-Z 3D NoC.

 Control Logic (CL): In baseline NI, CL controls the functionality of Tx/Rx.

For TLAR network interface, CL also includes the TLAR routing mode checking, and controllers for reconfiguring the topology table. Finite-State Machines (FSMs) are to implement in CL for timing and signal controls.

Fig. 5-3 Proposed architecture of TLAR.

47

5.2.2 Techniques of Memory Reduction

To prevent area overhead, we propose two memory reduction techniques to reduce area. The proposed memory reduction techniques are based on the three characteristics of NSI-mesh of TAVT-RTM: 1) TAVT never throttles the router in the bottom layer; 2) if a router is throttled, all the routers above it are throttled; 3) if a router is not throttled, all the routers below it are not throttled.

For topology table (TT), if the throttling can be applied to all routers, 1-bit information is required for each destination in TT. Because of the throttling characteristics (2) and (3), we only need to store which layer is the top of the non-throttled routers, as the green nodes shown in Fig. 5-4(a). Therefore, the number of bits can be reduced from XYZ to XY log2(Z) , as shown in Fig. 5-4(b) and Fig. 5-4 (c). For example, for (x,y,z) direction = (N,M,K) of 3D NoC, we originally need N*M*K bits to store topology table(only one bit to indicate throttling or not). By introducing the characteristic of throttling, we only need N*M*log2(K) bits. There are some examples shown in Table 5-1.

48

Fig. 5-4 (a) Reduce the size of topology table by storing the first non-throttled layer for each XY location. (b) Direct implementation. (c) Implementation with proposed TT reduction technique.

Table 5-1 Topology table comparison.

For NxMxK 3D NoC

Original topology table (N*M*K)

Improved topology table (N*M*(log2k))

Reduction

8x8x2 3D NoC 128 64 50%

8x8x4 3D NoC 256 128 50%

8x8x8 3D NoC 512 192 62.5%

The other memory reduction is about routing mode memory. For routing mode memory (RMM), TLAR only requires bits to store the routing modes for all

49

destinations. We use the example in Fig. 5-5(a) to illustrate the reason. Because all the source destination pairs (s,d0), (s,d1), (s,d2), and (s,d3) have the same source layer for the lateral-first path, their routing modes are identical. Therefore, CL can obtain the routing mode of the destination (x,y,z) by querying RMM for the entry at (x,y), as shown in Fig. 5-5(b). Therefore, we originally need N*M*K bits to store topology table for an N*M*K 3D NoC. However, we now only need N*M bits to store routing mode decision by introducing the memory reduction technique. There are some examples shown in Table 5-2.

Fig. 5-5 (a) The source destination pairs (s,d0), (s,d1), (s,d2), (s,d3) has the same source layer for lateral-first path, so their routing modes are identical. (b) For an X-Y-Z network, the size of RMM is

XY bits.

50

8x8x8 3D NoC 512 64 87.5%

5.3 Dual-Mode Router Design

The proposed dual-mode router is shown in Fig. 6. The router is based on the wormhole flow control, which is broadly adopted in NoC routers for its low memory requirement. The router is consisted of five major functional modules; 1) routing computation logic (RC); 2) switch allocation logic (SA); 3) crossbar switch (CS); 4) input queues (IQs); and 5) inter-router physical channels (ICs and OCs). The router is two-stage pipelined, and further pipeline is achievable for higher performance. In comparison to the 2D router, the 3D router requires extra two physical channels for vertical connections. Consequently, the size of CS increases from 5×5 to 7×7, and the number of IQs is increased from 5 to 7.

The proposed dual-mode router is shown in Fig. 6. The router is based on the wormhole flow control, which is broadly adopted in NoC routers for its low memory requirement. The router is consisted of five major functional modules; 1) routing computation logic (RC); 2) switch allocation logic (SA); 3) crossbar switch (CS); 4) input queues (IQs); and 5) inter-router physical channels (ICs and OCs). The router is two-stage pipelined, and further pipeline is achievable for higher performance. In comparison to the 2D router, the 3D router requires extra two physical channels for vertical connections. Consequently, the size of CS increases from 5×5 to 7×7, and the number of IQs is increased from 5 to 7.

相關文件