Fluidity Mechanism in TDM-BiNoC - Time Division Multiplexing on BiNoC

CHAPTER 3 MODELLING OF BIDIRECTIONAL-CHANNEL NOC

3.3 Time Division Multiplexing on BiNoC

3.3.3 Fluidity Mechanism in TDM-BiNoC

Conventional congestion control scheme is using buffer fill-level information which indicates what level a buffer is filled. Fluidity concept was presented by Tsai et al. in [22]. It is intuitive to conclude that fluidity can reflect congestion degree in a buffer without explanation. Beside the buffer fill-level, fluidity concept not only points out the capacity of a buffer but also indicates how fluent packets in the buffer are. In other words, a buffer containing many packets can still has a high frequency to pass packets and fluidity concept can represent such case.

The color depth of these states is different. The more deep the color, the less fluid the buffer is. In the Inactive state, the buffer is empty such that any incoming flit of a packet can go through the buffer quickly. Once the buffer is not empty, the current state is transferred into Fluid0. Fluid0 will be transferred into a more un-fluent state, Fluid1, if no flit passes out during a pre-defined period. The transition of Fluid1 and Nonfluid is similar with Fluid0. A current state is transferred according to the behavior of a buffer such that this FSM can reflect the fluidity of the buffer. All of the above three states are transferred to Inactive when all flits pass out.

Fig. 3-11. Finite State Machine of Fluidity.

Our TDM-BiNoC uses fluidity as index information to allocate a time unit to each router. The router with lower fluidity will acquire more time units to deliver.

However, as shown in Fig. 3-12, the packets from a low-fluidity router may suffer blockage and becomes un-fluent again. Then, we allocate more time units for those packets in vain. Therefore, we modify the arbitration in the crossbar to help our TDM-BiNoC. The arbiter in a crossbar needs to give the router with a lower fluidity a higher priority. By doing this, we can make packets flow smoothly in the BiNoC.

Fig. 3-12. Stuck Problem in TDM-BiNoC

The aim of BI-Routing and TDM-BiNoC is to increase load balance in a communication network, but they are achieving such goal in different aspects.

BI-Routing owns more routing paths to select, thus increasing balance in the whole chip. As for TDM-BiNoC, it considers the router-to-router load balance.

Unfortunately, it is hard to combine BI-Routing with TDM-BiNoC, because both of BI-Routing and TDM-BiNoC use the same bidirectional characteristic of a channel.

They cannot cooperate with each other, and bidirectional channel cannot afford over-exploitation. The detailed simulation results will be given in Chapter 5.

CHAPTER 4

ROUTER ARCHITECTURE

An NoC is composed of several basic nodes. Fig. 4-1 shows a basic node in BiNoC, where data generated by an IP are transformed into packets and sent to the Network Interface. Then, on-chip communication is done by the On-Chip Router. We will present our router architecture in this chapter. The original BiNoC router [11]

composed of several basic components will be presented in Section 4.1. Two versions of this router, modified with BI-Routing and TDM-BiNoC, will be presented in Section 4.2. In Chapter 5, we will show the simulation results of BiNoC with 64 routers.

Fig. 4-1. A Node in Bidirectional Network-on-Chip.

4.1 BiNoC Router Architecture

The block diagram of a BiNoC router is shown in Fig. 4-2. Packets conveyed in the data path are controlled by control blocks. Datapath is composed of an InOut Buffer, an Input Buffer Unit, and a Crossbar. Control blocks contain a Routing Computation Unit, a Switch Allocator, a Request Manager, and a Channel Controller.

Besides, we use registers in our design to prevent the router from glitch and to pipeline our router design. Pipeline can reduce the critical timing of a router design, and improve throughput. We use a pipeline of three stages in our design: routing computation, switch arbitration, and flit transmission. We do not use virtual channel in our router to save area, lower power consumption, and reduce latency.

Fig. 4-2. Block Diagram of BiNoC Router.

anarchy that the output data may influence the input blocks in the neighboring router.

An InOut Buffer as shown in Fig. 4-3 will solve such problem. The InOut Buffer is composed of two tri-state buffers. Either an output enable or an input enable can be asserted, which controls the connection of a bidirectional channel to the router.

InOut Buffer

Input Data

Output Data

Dir_select

Fig. 4-3. InOut Buffer Block.

Buffers can be constructed by centralized buffers, independent buffers at the input port, or independent buffers at the output port. We use buffers constructed by independent buffers at each input port in this work, each of which is a First-In-First-Out buffer composed of shift registers.

4.1.2 Routing Computation Unit

We can implement a routing computation unit by two kinds of mechanisms:

table-based routing and algorithm-based routing. For table-based routing, packets get decisions by a look-up table at the source node or at each node along the route to compute their destination depending on the application. The major advantage of table-based routing is its generality, where source-table routing only computes the destination of a packet once, and node-table routing is more appropriate for adaptive routing. A routing table can support any routing relation and suitable for any topology by simply reprogramming the contents of the table.

Algorithmic routing implements the routing algorithm as a combinational logic circuit dedicated to the routing strategy and topology. Fig. 4-4 shows the algorithmic routing mechanism implementing an XY routing algorithm. This architecture uses six comparators and one direction selector. For every header flit, the related locations of the current router and the destination router will be reset by comparing the destination router ID and the current router ID, and then the Direction Selector will pick an optimal direction. Owing to the lower area overhead needed in an NoC, we use an algorithmic routing mechanism instead of a table-based routing mechanism in our Routing Computation Unit.

Fig. 4-4. Algorithmic XY Routing Mechanism.

4.1.3 Request Manager and Channel Controller

As mentioned in Section 2-1, BiNoC is a request-based design. If a router requests to use a channel with low priority, it must make sure that the downstream router needs not to use it. All requests are sent by the Request Manager. The concept of Request Manager is very simple. That is, it will send a request to the low-priority channel if the router has more than one packet to deliver. Otherwise, the request manager will just send a request to the high priority channel. Our Channel Controller is implemented with a high-priority and a low-priority channel-control FSMs as shown in Fig. 4-5. Both of the FSMs have three states: wait, free, and idle

˙ ^Idle state: The channel cannot deliver data, and it is being used to receive data.

˙ ^Wait state: An intermediate state from the idle state to the free state.

(a)

Fig. 4-5. FSM for (a) High Priority FSM and (b) Low priority FSM.

A bidirectional channel is controlled by a high-priority FSM and a low-priority FSM. The two FSMs will coordinate with each other such that only one direction is used. In other words, there exists not the case of two free states. If one FSM is at a free state, another FSM is at an idle or a wait state according to the transferring condition. The high-priority FSM uses a free state as the default state. If the neighbor router wants to deliver a packet, input_req will be asserted and if there is no channel request in this router, FSM will be transformed into an idle state. Then, if there is a channel request in this router, FSM will be transformed into a wait state right away. After two clock cycles of waiting for the neighbor FSM to complete its operation, the FSM in this router is transformed into a free state. As to the low-priority FSM, the transferring condition is stricter. The wait state in the low-priority FSM may turn back to an idle state when any input request occurs.

4.1.4 Switch Allocator and Crossbar

After a routing direction is determined, packets contend for the channel. Therefore, we need an arbiter to allocate the channel bandwidth to the requesters. We use ten arbiters in our Switch Allocator block. All requests to an output channel will be arbitrated with an arbiter, and these requests may be masked by the channel available signal. These arbiters are implemented by a matrix-arbiter as shown in Fig. 4-6. A matrix arbiter implements a least-recently served priority scheme by updating a triangular array of state bits ω for < . The state bits in row © and column j show that request © takes priority over request j. We only update the upper triangular portion of the matrix, because the value at the lower triangular portion is just the inverse of the upper one. After a request is granted, the bit in that row is cleared, and the bit in that column is set to give that request the lowest priority since it was the most recently served. Notice that not all the state bit values are legal for a request. For example, if ω = ω = 1 and ω = 0 and requests 0, 1, and 2 are all asserted, the request will disable each other. Matrix arbiter is easy, inexpensive to implement, and provides strong fairness [6].

Fig. 4-6. Architecture of Matrix Arbiter.

We need a Crossbar to connect every input buffer and every output channel such that flits can be switched. In this work, we use multiplexers to implement our Crossbar as shown in Fig. 4-7. The crossbar consists of ten 5-to-1 multiplexers, where each multiplexer corresponds to an output port. The selecting signal comes from the Switch Allocator. Though the input of a BiNoC crossbar is twice than that in an NoC, the area overhead of a crossbar is acceptable. However, for n inputs, an n² area is needed to contain the n² crosspoints, and another n² area is needed to hold n multiplexers. This is another reason that we do not use virtual channel, because it will make the cost of crossbar too high to have more input buffers in each direction of the channel. Our router must be small enough to fit in an NoC.

Fig. 4-7. An Example of 5x5 Multiplexer-Based Crossbar.

4.2 Implementations of BI-Routing and TDM-BiNoC Routers

We need to modify the Routing Computation Unit, Switch Allocator, and Request Manager to implement a BI-Routing router as shown in Fig. 4-8. We replace the algorithmic-based OE-routing in the Routing Computation Unit with our BI-Routing algorithm. Thus, we modify the selecting function as follows. We prefer directions that do not consider to take Rule 1. After all the channels of the selected direction are occupied, we can then use the direction that observes Rule 1. The Routing Computation Unit outputs a routing direction and a signal called reverse_channel. This signal is asserted when the result of the Routing Computation Unit is claimed to use a reverse channel, and transmitted to Switch Allocator, such that the packets can be delivered via this reverse channel, and the signal be transmitted to the downstream router to acquire the reverse channel.

We will present the implementation of our TDM-BiNoC router in the following paragraph. Compared with this TDM-BiNoC router, the overhead of BI-Routing router is lower. More discussion and results will be presented in Chapter 5.

Fig. 4-8. Block Diagram of BI-Routing Router.

We carry out our TDM-BiNoC router based on the architecture as shown in Fig.

4-8. The Input Buffer Unit is equipped with a fluidity FSM to monitor the buffer as mentioned in Section 3.2.3. Hence, FSM can release the fluidity information to a directional fluidity generator (DF-Generator). This DF-Generator uses both the fluidity information from Input Buffer Unit and the result from the Routing Computation Unit as inputs so that it can compute the information on directional fluidity. Here the directional fluidity is defined as the total fluidity of packets which would be delivered to each direction of the channel. The directional fluidity is passed on to every neighboring router. Therefore, given such directional fluidity information, a router has its local congestion information and neighboring router’s congestion information for each direction of the channel. A Request Controller is added to implement our TDM-concept. The Request Controller arbitrates the accessing time

slot to control the direction of a channel based on the directional fluidity information.

The Request Controller will cooperate with another Request Controller in the neighboring router. Thus, the channel can switch direction standing on the congestion consideration. We show the block diagram of the TDM-BiNoC router in Fig. 4-9 and its router interface in Fig. 4-10.

Fig. 4-9. Block Diagram of TDM-BiNoC Router.

Fig. 4-10. Router Interface in TDM-BiNoC

In the Router Interface of the TDM-BiNoC as shown in Fig. 4-10, a Directional Fluidity signal is added to control the communication between each pair of neighboring routers. Obviously, the wiring area is increased heavily because this cost exists at every direction of a channel in a router. However, we can control the width of directional fluidity in the design to gain the best performance. We will show the detailed area overhead in Chapter 5.

CHAPTER 5 EXPERIMENTAL RESULTS

In order to compare the performance and hardware cost between the several routing algorithms and our bidirectional routing algorithm, we implemented a cycle-accurate simulation environment in HDL language, Verilog [23]. We simulated routing algorithms in an 8x8 mesh NoC composed of 64 routers without consideration of the effect of processing elements.

In the following sections, we first introduce the performance evaluation, synthetic traffic patterns and real traffic patterns in Section 5.1. Simulation results with several routing algorithms are given in Section 5.2, and simulation results with TDM-BiNOC in Section 5.3. Implementation overhead is presented in Section 5.4.

5.1 Background of Network Simulation

We introduce the background of network simulation in this section. First, performance evaluation is presented in Subsection 5.1.1. Then, traffic patterns are presented in Subsection 5.1.2.

5.1.1 Performance of Interconnection Networks.

Performance of an interconnection network can be described by its latency vs.

injection rate curve as shown in Fig. 5.1. At low injection rate, latency is close to a zero-load latency, . Zero-load latency means that the injected packet never contends for network resources with other packets. As the injection rate increasing, latency goes to infinity at the saturation throughput, λ . Saturation throughput can be slightly less than the bound of Θ if a flow control method is applied to the network, but it is limited by the topology bound of 2 ⁄ , and no routing algorithm R can exceed this limitation.

Fig. 5-1. Latency vs. Injection Rate Curve [6].

Although the above latency graph reveals the performance of a network at its extremes, it does not show the average behavior of the network. The throughput of a network as shown in Fig. 5-2 is linear at low injection rate until the injection rate reaches a saturation point. Using this information, we can calculate the average behavior of a network.

Fig. 5-2. Throughput vs. Injection Rate Curve [6].

5.1.2 Synthetic Traffic Patterns and Real Traffic Patterns

Our simulation environment comprised an 8x8 mesh array that can handle both synthetic traffic patterns [24] and real traffic patterns [25] as the simulation inputs.

Three types of synthetic traffic patterns were used to run simulations, including

randomized destination with a probability based on the injection rate. In the transpose traffic, a node at a source with coordinate (i, j) will sent a packet to a destination with coordinate (j,i). In the hotspot traffic, 20% of the packets change their destination to some selected hotspots while the remaining 80% of the traffic keep uniform. In this work, we chose (3, 3), (3, 2), (3, 1), (3, 0) as hotspots.

In addition to the synthetic traffics, we used Embedded System Synthesis Benchmarks Suite (E3S) benchmarks to demonstrate the performance variations in real traffics. Three applications of E3S, auto-indust-cords, consumer-cords, and telecom-cords, are mapped to a 5x5, a 4x4, and a 6x6 mesh networks respectively. In this work, we adopted a common Simulated Annealing algorithm to map task graphs given in E3S to our NoC.

5.2 Simulation Results of Routing Algorithms

We simulated XY-Routing, west first routing algorithm (WF-Routing), odd even routing (OE-Routing), and our bidirectional routing (BI-Routing) algorithms and show the results in this section. The packets in our experiments were composed of 16 flits with one header flit and one tail flit. The capacity of the buffer in each of the 5 directions of channels was 8 flits. Considering the practicability of NoC, we used wormhole switching to manage the buffer instead of virtual channel, because the area overhead of virtual channel is pretty much. We simulated our network by injecting loads, from 20 flits per clock cycle to 500 flits per clock, at every node. For each injection rate, the simulation time was 25000 clock cycles. The results of latency in three synthetic traffic patterns are shown in Fig. 5-3, and the results of throughput are shown in Fig. 5-4.

We define an average latency as the average transmission delay needed for a flit generated form a process element to arrive at its destination node.

The formula of average latency is defined as follows:

= ^∑ _{_} , (5-1)

where the numerator is the total sum of all the flit transmission delays, and execution_time is the total simulation time. Throughput is the data rate that an NoC can communicate. The formula of throughput is defined as follows:

=^∑ ^_ , (5-2)

where received_flits represents the total amount of flits consumed in each destination terminal.

The simulation results show that our bidirectional routing, Bi-Routing, has the best performance among the four algorithms. XY-Routing outperforms OE-Routing and WF-Routing because XY-Routing can distribute packets evenly in the uniform traffic condition. This part of results is the same as in [10], [15]. Our bidirectional routing algorithm still had better saturation throughput than XY-Routing, about a 6.9% improvement, as shown in Fig. 5-3. However, the throughput of bidirectional routing decreases much more than XY-Routing in high injection rate.

The transpose traffic pattern and hotspot traffic pattern are asymmetric patterns, and they are close to the real-case traffics, because in SoC most IPs have communication with the main CPU core. Adaptive routing algorithms perform better than XY-Routing in transpose traffic pattern and hotspot traffic pattern, because adaptive routing algorithms have more paths to route. Our bidirectional routing algorithm had a 14.78% and a 16.51% improvements over the odd-even routing algorithm, respectively, in transpose traffic and hotspot traffic. Since our bidirectional routing has more paths to route and it can spread the traffic load to the whole chip. Fig.

5-5 shows the flits distribution in the network.

Fig. 5-3. Latency versus Injection Rate under OE-Routing, WF-Routing, XY-Routing, and BI-Routing.

Fig. 5-4. Throughput versus Injection Rate under OE-Routing, WF-Routing, XY-Routing, and BI-Routing.

(a)

(b)

Fig. 5-5. Flit Distribution Graph under (a) Odd-Even Routing Algorithm and (b) Bidirectional Routing Algorithm.

Figure 5-5 shows the flits distribution at 0.288 injection rate in uniform traffic pattern. The coordinates in the graph is the position in the network. We can observe

在文檔中雙向通道網路晶片之模塑 (頁 58-0)