Chapter 4: Two-Level FIFO Buffer Design for Routers
4.2 Buffer Implementations and Architectures
The queuing buffer is adopted for routers or network interfaces to store un-routed data. Buffer size and management are directly linked to the flow control policy which affects OCIN performance and resource utilization [4.3]. Buffer architectures can be classified by their location and circuit implementation of buffers. Queuing buffers consume the most area and power among composing blocks in OCINs [4.5], [4.11].
However, insufficient buffer size induces head-of-line blocking problems. Fig. 4.3 shows an example of the head-of-line blocking problem. When head data of a virtual channel cannot be routed and data behind the head data are occupying queuing buffers, network performance is decreased. Nevertheless, head-of-line blocking problems reduce the network performance and increase power consumed during on-chip data communication. Therefore, head-of-line blocking is a key factor when evaluating different buffer architectures.
Fig. 4.3 Head-of-line blocking problem induced by insufficient buffer.
The buffer circuits can be implemented using registers (flip-flops) or SRAM according to the buffer sizes. For large capacity queuing, the SRAM-based queuing buffer with separated read/write ports is preferred over a register-based buffer [4.12], [4.13]. However, SRAM incurs large latency overhead [4.5]. For achieving high-performance OCINs, register-based buffers are usually realized in the routers with small buffer sizes. Since register-based implementations have a limited capacity
due to rapid increasing power consumption and circuit area [4.6], [4.10]. In most OCINs, register-based buffers are adopted to provide high bandwidth of on-chip data communication. Consequently, register-based buffers can be classified into four different implementations — (a) Shift Register, (b) Bus-In Shift-Out Register, (c) Bus-In Bus-Out Register, and (d) Bus-In MUX-ut Register [4.1].
(a)
FIFO Cell FIFO Cell FIFO Cell FIFO Cell Reset
FIFO Cell FIFO Cell FIFO Cell FIFO Cell
Reset
Fig. 4.4 Different buffer implementation (a) Shift Register (b) Bus-In Shift-Out Register (c) Bus-In Bus-Out Register (d) Bus-In MUX-Out Register.
Fig. 4.4(a) shows a conventional shift register. When a consumer sends a request to a buffer, a shift register will enable all registers and shift the data to the output port.
Indeed, implementing a shift register is less complicated than implementing others.
However, intermediate empty cells induced by different packet in/out rates temporally influence the network performance by adding unnecessary latency. Nevertheless, shifting all registers in a buffer consumes a huge amount of power. Implementing a shift register on a chip is not desirable due to unnecessary latency and massive power consumption. Fig. 4.4(b) shows the Bus-In Shift-Out Register, which only shifts full cells to remove intermediate empty bubbles. An arrival packet can be stored in the
empty cell behind the full cells. Hence, this register can remove unnecessary latency and power consumption caused by empty bubbles. However, as queuing capacity increases, the driving ability of the sender should be increased for large fan-outs.
Furthermore, a bus-in shift-out still consumes large amounts power by shifting all occupied cells. To reduce power consumption during shifting operation, Fig. 4.4(c) shows a Bus-In Bus-Out Register, and all register outputs are connected to a shared output bus via tri-state buffers. The writing and reading tokens constructed in rings are the head and tail of full cells, respectively. The tri-state buffers are controlled by the reading token for reading the first-in packet, while the writing token activates the register, which is behind the full cells, to store the input packet. As queuing capacity increases, the capacitance of shared input/output buses also increase, especially the output bus. The parasitic capacitance of tri-state buffers will increase both delay and power consumption. Therefore, the Bus-In MUX-Out Register with output multiplexers can be utilized to eliminate the parasitic capacitance of tri-state buffers.
Fig. 4.4(d) illustrates the Bus-In MUX-Out Register. Additionally, a bus-in MUX-out register needs an extra adder as a pointer and to calculate output packet address.
Depending on the location of queuing buffers, buffers can be placed before or after the interconnection matrix in a router; these buffers are input buffer and output buffer, respectively. To be sure, input buffers and output buffers differ. If a data word is de-layed in a router with input buffers, it will stall all data words arriving at the same input. None can be processed until the first data word has been forwarded successfully.
With output buffers, this situation differs because switching is performed prior to buffering. If a router cannot send data through one of its outputs, the buffers at that output will fill up. However, congestion on outputs has no immediate influence on inputs; that is, successive data words can still be received. An architectural
disadvantage of output buffering is that in one cycle, data from multiple input ports may be written to the same output port. Nevertheless, a multiple-access buffer can be implemented in parallel at the output to deal with this shortcoming. Both output buffers and input buffers can cause the head-of-line blocking problem and stall input data. Fig. 4.5 shows the input buffers, middle buffers and output buffers in routers.
During middle buffering, the buffer placement moves to the middle of switching circuits. Middle buffer architectures have O(N2) buffer blocks for an N-port router, while input and output buffering architecture only have O(N) buffer blocks. The middle buffer architecture, however, can reduce the effects of head-of-line blocking via multiple virtual channels during switching. This is a trade-off between traffic problems and buffer sizes.
Buffer
Fig. 4.5 Diagram of input buffer, middle buffer and output buffer.
Since buffer resources are costly in resource-constrained OCIN environments, minimizing buffer size without adversely affecting performance is essential. However, based on observed traffic patterns, buffer size and architecture cannot be changed dynamically during operation. Therefore, some approaches [4.6], [4.7] optimize pre-determined buffer size during the design stage via a detailed analysis of application-specific traffic patterns. Additionally, static virtual channel allocation techniques were proposed to optimize the performance, area and power for target
applications based on the traffic characteristics [4.9], [4.14].
MUX
DE-MUX Switching Muxing
Fig. 4.6 Concepts of (a) dynamic virtual channel allocation (b) centralized shared buffer.
For general-purpose and reconfigurable SoC executing different applications, advanced buffer architectures maximize the utilization of buffers under different traffic patterns in NoC applications. As virtual channels are not equally used in different applications, dynamically allocated multi-queue (DAMQ) buffer schemes were proposed to share a common buffer [4.15]-[4.18]. However, these approaches are not suited to OCIN implementation, which is typically resource-constrained [4.19].
Moreover, NoC applications are intolerant of large latency against the quality of service constraint. Hence, in view of resource and latency overhead, dynamic virtual channel allocation schemes were proposed to maximize throughput for resource-constrained OCIN [4.19]-[4.23]. Fig. 4.6(a) shows the concept of dynamic virtual channel allocation techniques to share the virtual channels and arbitrate output packets based on the traffic conditions. The dynamic virtual channel regulator (ViChaR) proposed in [4.19] introduced a unified buffer structure that dynamically allocated virtual channels and buffer resources based on network traffic patterns. The ViChaR has the unified buffer structure and unified control logic. The unified buffer structure shares buffers in virtual channels for each input port. Additionally, the unified control logic controls the arriving/departing pointers and virtual channel
allocation of each virtual channel via virtual channel control tables and dispensers.
However, the hardware overhead would increase non-linearly. In view of this, other dynamically-allocated virtual channel architectures were proposed by inspecting the physical link state and speculating the packet transferring [4.20]-[4.23]. However, when the shared buffers of an input port are full, these approaches do not provide a mechanism for accessing the buffers of other virtual channels at other input ports.
Furthermore, the performance of these dynamical virtual channel allocation schemes is also limited since the resource-constraints of the pointers and virtual channel control tables.
Fig. 4.6(b) shows the centralized shared buffer architecture that maximizes buffer utilization [4.24]-[4.26]. Shared buffer architectures are implemented by centralized buffer organizations, which dynamically alter buffer size for different channels. The input packets from different ports can access all buffers without any head-of-line blocking. This architecture enhances OCIN performance regardless of traffic type.
Shared buffering, in addition, achieves the best buffer utility with the fewest memory elements. The centralized shared buffer architectures enhance the buffer utilization via allocation tables [4.25], [4.26]. Nevertheless, the control mechanisms of these shared buffer architectures are more complex than those of other buffer architectures and increase the pipeline stages. Hence, the new proposed data-link two-level FIFO buffer architecture is utilized as the shared buffer architecture to simplify the shared buffer architecture and achieve better performance than other buffer architectures while not increasing buffer size.