CHAPTER 4 ROUTER ARCHITECTURE
4.1 BiNoC Router Architecture
The block diagram of a BiNoC router is shown in Fig. 4-2. Packets conveyed in the data path are controlled by control blocks. Datapath is composed of an InOut Buffer, an Input Buffer Unit, and a Crossbar. Control blocks contain a Routing Computation Unit, a Switch Allocator, a Request Manager, and a Channel Controller.
Besides, we use registers in our design to prevent the router from glitch and to pipeline our router design. Pipeline can reduce the critical timing of a router design, and improve throughput. We use a pipeline of three stages in our design: routing computation, switch arbitration, and flit transmission. We do not use virtual channel in our router to save area, lower power consumption, and reduce latency.
Fig. 4-2. Block Diagram of BiNoC Router.
anarchy that the output data may influence the input blocks in the neighboring router.
An InOut Buffer as shown in Fig. 4-3 will solve such problem. The InOut Buffer is composed of two tri-state buffers. Either an output enable or an input enable can be asserted, which controls the connection of a bidirectional channel to the router.
InOut Buffer
Input Data
Output Data
Dir_select
Fig. 4-3. InOut Buffer Block.
Buffers can be constructed by centralized buffers, independent buffers at the input port, or independent buffers at the output port. We use buffers constructed by independent buffers at each input port in this work, each of which is a First-In-First-Out buffer composed of shift registers.
4.1.2 Routing Computation Unit
We can implement a routing computation unit by two kinds of mechanisms:
table-based routing and algorithm-based routing. For table-based routing, packets get decisions by a look-up table at the source node or at each node along the route to compute their destination depending on the application. The major advantage of table-based routing is its generality, where source-table routing only computes the destination of a packet once, and node-table routing is more appropriate for adaptive routing. A routing table can support any routing relation and suitable for any topology by simply reprogramming the contents of the table.
38
Algorithmic routing implements the routing algorithm as a combinational logic circuit dedicated to the routing strategy and topology. Fig. 4-4 shows the algorithmic routing mechanism implementing an XY routing algorithm. This architecture uses six comparators and one direction selector. For every header flit, the related locations of the current router and the destination router will be reset by comparing the destination router ID and the current router ID, and then the Direction Selector will pick an optimal direction. Owing to the lower area overhead needed in an NoC, we use an algorithmic routing mechanism instead of a table-based routing mechanism in our Routing Computation Unit.
Fig. 4-4. Algorithmic XY Routing Mechanism.
4.1.3 Request Manager and Channel Controller
As mentioned in Section 2-1, BiNoC is a request-based design. If a router requests to use a channel with low priority, it must make sure that the downstream router needs not to use it. All requests are sent by the Request Manager. The concept of Request Manager is very simple. That is, it will send a request to the low-priority channel if the router has more than one packet to deliver. Otherwise, the request manager will just send a request to the high priority channel. Our Channel Controller is implemented with a high-priority and a low-priority channel-control FSMs as shown in Fig. 4-5. Both of the FSMs have three states: wait, free, and idle
˙ Idle state: The channel cannot deliver data, and it is being used to receive data.
˙ Wait state: An intermediate state from the idle state to the free state.
(a)
Fig. 4-5. FSM for (a) High Priority FSM and (b) Low priority FSM.
40
A bidirectional channel is controlled by a high-priority FSM and a low-priority FSM. The two FSMs will coordinate with each other such that only one direction is used. In other words, there exists not the case of two free states. If one FSM is at a free state, another FSM is at an idle or a wait state according to the transferring condition. The high-priority FSM uses a free state as the default state. If the neighbor router wants to deliver a packet, input_req will be asserted and if there is no channel request in this router, FSM will be transformed into an idle state. Then, if there is a channel request in this router, FSM will be transformed into a wait state right away. After two clock cycles of waiting for the neighbor FSM to complete its operation, the FSM in this router is transformed into a free state. As to the low-priority FSM, the transferring condition is stricter. The wait state in the low-priority FSM may turn back to an idle state when any input request occurs.
4.1.4 Switch Allocator and Crossbar
After a routing direction is determined, packets contend for the channel. Therefore, we need an arbiter to allocate the channel bandwidth to the requesters. We use ten arbiters in our Switch Allocator block. All requests to an output channel will be arbitrated with an arbiter, and these requests may be masked by the channel available signal. These arbiters are implemented by a matrix-arbiter as shown in Fig. 4-6. A matrix arbiter implements a least-recently served priority scheme by updating a triangular array of state bits ω for < . The state bits in row © and column j show that request © takes priority over request j. We only update the upper triangular portion of the matrix, because the value at the lower triangular portion is just the inverse of the upper one. After a request is granted, the bit in that row is cleared, and the bit in that column is set to give that request the lowest priority since it was the most recently served. Notice that not all the state bit values are legal for a request. For example, if ω = ω = 1 and ω = 0 and requests 0, 1, and 2 are all asserted, the request will disable each other. Matrix arbiter is easy, inexpensive to implement, and provides strong fairness [6].
Fig. 4-6. Architecture of Matrix Arbiter.
We need a Crossbar to connect every input buffer and every output channel such that flits can be switched. In this work, we use multiplexers to implement our Crossbar as shown in Fig. 4-7. The crossbar consists of ten 5-to-1 multiplexers, where each multiplexer corresponds to an output port. The selecting signal comes from the Switch Allocator. Though the input of a BiNoC crossbar is twice than that in an NoC, the area overhead of a crossbar is acceptable. However, for n inputs, an n2 area is needed to contain the n2 crosspoints, and another n2 area is needed to hold n multiplexers. This is another reason that we do not use virtual channel, because it will make the cost of crossbar too high to have more input buffers in each direction of the channel. Our router must be small enough to fit in an NoC.