Chapter 2 Related Works and Problem Description9
2.5 Summary
We reference related works for thermal-aware 3D NoC, and discover a new problem: Non-Stationary Irregular Mesh (NSI-Mesh). We find four problems of delivery packets, and also show how to solve them by using transport layer and network layer. Finally, we conclude that we need to joint transport layer and network layer to solve the problems caused by NSI-Mesh.
21
Chapter 3
Transport-Layer Assisted Routing
Transport-layer assisted routing is composed of the transport layer assisted routing schemes and the dual-mode routing algorithms. Transport layer shares topology information with network layer for high performance in NSI-mesh. Network layer follows the initial routing decision provided by transport layer, and tries to balance the lateral traffic loading. In this chapter, we introduce the proposed Transport-Layer Assisted Routing (TLAR) schemes and algorithms.
3.1 Operation Flow of Transport-Layer
The proposed operation flow of Transport-layer assisted routing is shown as following.
The system of 3D NoC is switching between the normal stage and the reconfiguration stage. In normal stage the 3D NoC works as usual irregular or regular mesh network. In this stage, we assume distributed thermal sensing mechanism is embedded in the network for each router to obtain its own temperature, and each router has a timer for synchronizing their operation stages. After N-cycle normal stage, the network enters the R-cycle reconfiguration stage. The reconfiguration stage means that we should prepare some management and controller, which let 3D NoC remains execution in normal work. In comparison with the cycle number in normal operation stage, the cycle number required for reconfiguration is very small. Here we
22
assume the network is operated at 1GHz. In each 10ms interval, 104 cycles is absolutely sufficient for each tile to reconfigure, and N is around 107. The reconfiguration stage only occupies 0.1% of the total available time, so the overhead of reconfiguration is negligible. If the interval is 100ms, the overhead is 0.01%, which is more negligible. The reconfiguration stage, shown in Fig. 3-1, consists of three sub-stage: (i) cleaning up and policy determination; (ii) synchronization of topology information; (iii) routing mode checking and throttling. The detail is described as following:
Fig. 3-1 Network states and operation stages in transforming topology for run-time thermal management.
(i) Cleaning up and policy determination: In order to make sure packet
transport t0 destination router successfully in next normal work, the network has to be cleaned up before topology changing. In this stage, the packetization of the payloads from transport layer to network layer is paused. As shown in Fig. 3-2, the payloads stay in the transmitter payload queue. In this stage, we should not only stop transmitting packet form transport layer to network layer,
23
but also deal with the rest packet still in network layer. It means transmitter packet queue will become empty after a small period of time. In the meanwhile, the distributed thermal-aware controller in each tile should determine the throttling of the router within the tile for the next normal stage.
The implementation of thermal-aware management can be in the transport layer controller or in the application layer as a software routine. No matter which layer the policy is determined, the application layer and transport layer share the information of control policy of this tile. The important thing for us is that the new throttling emerges for the next normal work stage, and we should guarantee no packet still in network layer is blocked in next normal work stage.
Fig. 3-2 Block diagram of transport layer in the tile of thermal-aware 3D NoC.
24
(ii) Synchronization of topology information: If we trigger throttling, we
should let every router in 3D NoC know which router is throttled and how the topology change in next normal work. In this stage, all routers have to transmit packets containing their throttling information to all their upstream and downstream routers. No matter in current normal stage the router is fully throttled or not, it is not throttled in this sub-stage. Because all the routers are not throttled in this sub-stage, the network is regular mesh in each layer. We can see topology table in Fig. 3-2 which is shared by application layer and transport layer. In this topology table, each router requires one bit for representing the state of each router in the next normal stage. If a router is fully throttled in the next normal stage, the corresponding bit will be toggled to inactive. Otherwise the bit will be active. Then the information of topology is synchronized to each tile. The technology of transmitting throttling information is not our consideration. Because the throttling is triggered by 10ms, the transmission of throttling information is just up to hundreds of cycles, and it is just 0.1% of 10ms. We have 99.99% time of normal work, and we collect correct throttling information and make correct routing selection. We can see in Fig. 3-3, so the transmission of throttling information is not our problem in NSI-mesh.
25
Fig. 3-3 Required time of transmitting throttling information.
(iii) Decisions of routing mode and throttling: In this stage, the throttling of
router is applied now. If all routers in 3D NoC are not throttled, the routing is just like in regular mesh. But when throttling is trigger, we need determine routing mode for transmission toward each destination router in the transport layer. If the source router is throttled, the payload will stay in the transmitter payload queue. If the source router is not throttled, we should execute transport-layer assisted routing (TLAR) to check all routing mode of destination router for ensuring no packet is blocked by run-time thermal management. After executing TLAR, network goes back to the normal stage, and the packet injection continues for the tile where the router is not throttled.
26
3.2 Proposed Framework of Transport Layer Assisted Routing
To correctly select a path that makes packet delivery success, we propose the Transport Layer Assisted Routing Scheme (TLAR) for packets with non-fully throttled source and destination routers. The routing in TLAR is based on our previously proposed downward routing, which is a combination of vertical routing and lateral routing. The key idea of TLAR is that the throttling information in transport layer is used to assist the selection of layer for lateral routing and the decision of routing algorithm in network layer. The selection and decision results, what we defined as the routing mode, are saved in packet header when the packet is injected to the network layer, and then the routers follow the mode to route.
Fig. 3-4 Framework of proposed transport layer assisted routing scheme. The determination of lateral routable relies on the throttling information in transport layer.
Fig. 3-4 shows the flow chart of path selection in TLAR. The checking of lateral path is done during the reconfiguration stage for each destination in the transport
27
layer above the source router. For the packet which is going to a lateral routable destination, it is routed first laterally. Otherwise downward path is selected because it is guaranteed routable. As shown in Fig. 3-2, the overhead of TLAR is the small memory for storing the checking results as the routing modes. In normal operation, the transport layer controller reads the routing mode from the memory and set the packetizer. Then the payload is packetized with the routing mode specified in header..
Fig. 3-5 Path selection of proposed transport layer assisted routing scheme.
In TLAR, packets change their z-location only when it is at the source or the destination xy-location, and the selection of routing path is dependent on the relative vertical location of source and destination, as shown in Fig. 3-5. As mentioned before, the routers in the bottom layer are never throttled. Therefore if source and destination routers are not fully throttled, the vertical path and lateral path through bottom layer will be guaranteed routable. If there is no fully throttled router on the non-guaranteed path, TLAR chooses this path for lateral routing. Owing to the bandwidth required for downward routing, TLAR prevents to choose layers below source router and above the bottom layer for lateral routing. Checking if the lateral path is routable for these layers also multiples of the computation overhead for path selection. Any lateral path locates above source router is forbidden owning to the limitation of turn model. As the proof in our previous works [33], the combined routing is deadlock-free if the lateral routing is deadlock-free and we remove the {UN, UE, US, UW} turns.
28
3.3 Proposed Algorithm: Downward-Lateral Deterministic Routing (DLDR)
The proposed 3D routing algorithm in TLAR is the combination of downward routing and a deterministic routing (DLDR). The downward routing is used for moving packets up and down in the vertical direction. The lateral deterministic routing is used for routing packets in the lateral direction. The path diversity is two because we can select to route in the source layer or the bottom layer. For reducing the computational complexity of checking rout ability, we adapt XY routing, a dimension-ordered routing (DOR), as the deterministic routing.
Fig. 3-6 (a) TLAR examples, and (b) checking dependency.
An example of routability is shown in Fig. 3-6(a). There are three kinds of destination routers. First, the gray blocks are throttled destinations. The messages toward these destinations are kept in message queue until destinations are routable.
Otherwise, the packets will be blocked in the network because the destination router is not active. Second, the white blocks are routable destinations with XY routing; an example path is shown by the green line. Third, the pale blue blocks are destinations
29
those are only downward-routable. An example of the path of downward routing is shown by the dotted line. Conclusively, if the path is lateral-first routable, the packet first traverses through the lateral path in the source layer. Then, the packet goes up or down to the destination router. Otherwise, the downward path is the only path, so the packet first traverses to the bottom layer and is routed laterally in the bottom layer.
Then, the packet goes up to the destination router.
When topology is changed, the routing mode must be decided again for each destination, and the decisions are saved in the network interface. The controller in transport layer checks if there is any fully-throttled router on the paths based on the topology table. The checking of the routability of all destinations in the source layer can be done by using the incremental breadth-first search (BFS) style, as shown in Fig. 3-6(b). The dependency is based on XY-routing, and the prerequisites that a node routable is its previous node also routable. For an 8x8x4 network, the operation is completed in 63 cycles.
3.4 Summary
In this chapter, we propose the Transport-Layer Assisted Routing (TLAR) scheme. TLAR can make sure the delivery packets can reach destination router successfully, and the operation flow and checking of TLAR is also described here.
Based on TLAR, we propose DLDR for sending packets with XYZ routing or downward routing.
30
Chapter 4
Performance Evaluation
4.1 Setting of Simulation Environmentst
Currently there is no real chip implementation of 3D NoC systems, so we start from modeling the 2D NoC system implemented in [30] and stack it to multiple layers. For network simulation, we start from Noxim [28], and we extend it to the third dimension. For temperature simulation, we use Hotspot [29], and We adopt the tile geometry and power model of Intel’s 80-core processor [30]. We first add the model of basic 3D router and the Dimensionally-Decomposed (DimDe) router [23], and modified Noxim to generate the 3D-NoC architecture and the floorplan based on user-defined parameters of dimension. During network traffic simulation, a power trace is generated based on the power model of the NoC. The power trace and physical floorplan are input to the thermal model. In the proposed simulator, the tile geometry and power model are based on Intel’s 80-core chip. Fig. 4-1 shows the construction of the co-simulation model and Fig. 4-2 shows the floor plan as we used.
We adopt the basic wormhole flow control and use random arbitration for switch allocation
We construct 3D 8×8×4 NoC, and the packet length is randomly from 2 to 10 flits. The queue depth of each input channel is 16 flits, and the link level flow control protocol is full hand-shake request and ack. Because TSVs generally have high bandwidth, a crossbar-based vertical connection is assumed in our 3D router [10]. For each tile in the NoC, the tile area is 2.0mm × 1.5mm and the router area is 0.65mm ×
31
0.53mm.
Fig. 4-1 Framework of co-simulation platform.
Fig. 4-2 Construction of the model of a 4x4x4 3D NoC with simplified tile model from. [30]
To keep the performance indices representative and comparison as fair as possible, several modifications of the simulator are required for modeling the TLAR.
In Noxim[28], the statistics of received packet number, packet latency, and network throughput are based on the received packets during the simulation period while the network is assumed stable. The payloads toward the fully throttled destinations are hold in the transport layer and not packetized, and only the deliverable payloads are packetized and injected to the network. Originally the injection rate is simulated by
Network Model Noxim (3D)
Power Model Intel 80‐core NoC
Thermal Model HotSpot (3D) Traffic Activity
Power trace
Temperature
32
generating Poisson arrived packets of given traffic distribution of destinations over the network. The network injection rate of the active routers follows the index by escaping the packets to the throttled destinations and regenerating more packets toward non-throttled destinations. Because we assume application layer and transport layer share the topology information, the packet injection process of the throttled router is paused until it is not throttled. In this setting, the total injection rate of the network can be obtained by multiplying injection rate and the number of the active routers, and the statistics of performance indices are not affected by throttling.
In this chapter we show the performance of the proposed TLAR algorithms.
We use two throttling cases of vertical throttling: (a) 1 throttled router, and (b) 2 2x2 throttled region. In case (a), the one throttled routers is located in the center of the most top layer of the 8x8x4 network. In case (b), 8 1x1x3 pillars are throttled on the diagonal line of the upper three layers of the 8x8x4 network. And they are shown in Fig. 4-3.
Fig. 4-3 Throttling cases: (a) 1 router (b) 2 2*2*3 routers.
33
4.2 Traffic balance and rate of transmitting packet under different routings
First we use statistical traffic load distribution (STLD) [34] and decision distribution to show the network loading. All the experiment in Fig. 4-4 and Fig. 4-5 use the same injection rate that makes average latency of TLAR-DLDR twice the zero load latency.
Fig. 4-4(a) shows the STLD of the baseline downward routing and Fig. 4-4(b) shows the TLAR-DLDR algorithm, which is the combination of downward routing and TLAR. Though there is only one router throttled in the network, some packets have to be routed downward through the bottom layer. The congestion degree of DLDR in the bottom layer is reduced and the loading of the work is more balanced by using proposed DLDR. The packets in the upper layers are more balanced, because the congestion in the bottom layer is relaxed. We use Fig. 4-4(c) to show distribution of the routing mode decision. In DLDR scheme, 80% packets are determined to route on the deterministic paths in the source layers and 20% packets are routed in the downward mode. We can prove we decrease the rate of downward routing by proposed DLDR.
34
Fig. 4-4 (a) Statistical traffic load distribution (STLD) of conventional design; (b) STLD of proposed TLAR framework with DLDR algorithm; (c) latency vs. network injection rate.
Fig. 4-5 shows similar results when we increase the number of disconnected throttled routers and the size of the region. We increase the number of throttled routers from 1 router to 24 routers. The congestion degree of downward in the bottom layer is still larger than DLDR, because downward cannot balance the loading of the bottom layer. The packets the upper layers are more because the congestion in the bottom layer is relaxed by using of our proposed DLDR algorithm. We use Fig. 4-5(c) to show distribution of the routing mode decision. In DLDR scheme, 40% packets are determined to route on the deterministic paths in the source layers, and 60% packets are routed in the downward mode. The DLDR algorithm has fewer packets choosing the downward mode, so DLDR is more vertical balanced than downward routing.
35
0%
20%
40%
60%
80%
100%
XY (Lateral‐FIRST) Downward (Z‐First)
Fig. 4-5 (a) Statistical traffic load distribution (STLD) of conventional design; (b) STLD of proposed TLAR framework with DLDR algorithm; (c) latency vs. network injection rate.
We use Table 4-1 to show the statistics of the statistical traffic load distribution (STLD). As we can see, the mean packet number is increased by adopting the TLAR scheme and both total and inter-layer standard deviations are reduced by applying DLDR algorithm. The statistics is corresponded to the performance simulations, which are shown in Fig. 4-6. In the case of 1 throttled router, there are 255 active routers injecting packets to the network. For the case in Fig .4-3(b), only 232 routers are active. These active routers can transmit (receive) the packets to(from) the network. Because of the more balanced loading of the network, the DLDR has better performance than the baseline algorithm downward routing in both 1 and 2 2x2x3 throttling cases. The throughput in Fig. 4-6(a) is improved by 95% by adopting the
36
DLDR algorithms. In Fig. 4-6(b), the throughput is improved by 70%. We can see our proposed algorithm DLDR is outperforming than downward routing.
Table 4-1 Statistics of statistical traffic load distribution.
One Throttled router Two 2x2x3 Throttled Pillars
Fig. 4-6 Average latency vs injection rate with (a) one router throttled and (b) two 2x2x3 pillars throttled.
37
4.2.1 Network Sustainability and Degree of Graceful Degradation
Here we show the network sustainability and the degree of graceful degradation.
Network sustainability describes the total throughput provided by the network while some parts of the network are not working. If a router is throttled, it cannot provide the bandwidth for packet delivery. With higher network sustainability, the 3D NoC can provide larger throughput when there are routers fully throttled. Because we cannot simulate all throttling cases, we simulate extreme cases from on throttled routers to 7x7x3 throttled routers, as shown in Fig. 4-7. As shown in Fig. 4-8, all algorithms degrade as the size of throttled region increases. Here we want to observe the throughput degradation in the different case of vertical load balancing of different irregular topologies occurred in NSI-mesh. We start from the 1x1x1 throttling case and then extend to the 7x7x3 throttling case. All the throttled regions are located in the center of the xy-plane. In all cases, the DLDR algorithms have better performance the conventional reactive downward routing. In comparison with conventional reactive downward routing, the proposed TLAR-DLDR can averagely improve the sustainable throughput from 85.5% to 48%.
1x1x1 1x1x2 1x1x3 2x2x3 4x4x3 6x6x3 7x7x3
Fig. 4-7 Different throttling cases.
38
1x1x1 1x1x2 1x1x3 2x2x3 4x4x3 6x6x3 7x7x3 Network Throughput (Flits/Cycles)
Size of throttled region
Downward TLAR‐DLDR
Fig. 4-8 Network sustainability of NSI-mesh 3D NoC, uniform traffic offered.
4.2.2 Run-Time Temperature and Throughput
Here we show the real case for simulating TLAR in run-time thermal management. The simulation setup is as same, and the only difference is that we do not fixe location of throttled routers in simulation. Thermal-Aware Vertical Throttling (TAVT) is adopted in RTM to throttle overheat routers, and TLAR will detect throttled routers and detour them. As shown in Fig. 4-9, the total simulation cycle in
Here we show the real case for simulating TLAR in run-time thermal management. The simulation setup is as same, and the only difference is that we do not fixe location of throttled routers in simulation. Thermal-Aware Vertical Throttling (TAVT) is adopted in RTM to throttle overheat routers, and TLAR will detect throttled routers and detour them. As shown in Fig. 4-9, the total simulation cycle in