Chapter 4 Column-Based Low Power Design Techniques
4.4 Simulation Results and Analysis
4.4.2 Simulation Result for DAPC
The standby power without and with power gating is analyzed. Fig. 4.14 shows the standby power under different don’t-care pattern when Flag=1. For preserving the stored data from disturbance, the gating transistors cannot be turned off except that the don’t-care state is true. Hence, only one side power source of don’t-care cells is floating during data retention mode. From the figuration, even though the leakage power increases slightly due to additional power switch circuits when TCAM cells do not have any don’t-care data. The data-aware power control still have 7.8% leakage power reduction compared to conventional TCAM cells without power gating, when half of data are don’t-care state.
600 800 1000 1200 1400 1600 1800 2000 2200
Without gating
0% 25% 50% 75%
(Percentage of Don´t-care Data) 100%
With DAWA power gating
Leakage Power (μW)
-3.0% -7.8% -12.7% -17.5%
Flag=1 without data disturb for read operation
Fig. 4.14Leakage power consumption under different don’t-care pattern when Flag=1.
On the other hand, when Flag=0, the TCAM macro will not perform the read operation. Therefore, data-aware power control will cut off all the power sources of
storage cells to save standby power whether the stored data is destroyed or not. Fig.
4.15 shows the standby power under different don’t-care pattern when Flag=0.
Because both the dynamic power gating devices of storage and don’t-care cells depends on the don’t-care bits of bottom row in the bank. The leakage power decreases as the percentage of don’t-care data increases. This trend can be proved in Fig. 4.15. When half of data are don’t-care, the data-aware power control scheme results in 28.9% lower leakage power compared to conventional TCAM. Moreover, when proportion of don’t-care data is 75%, the leakage power reduction is even higher than 40%. Besides, from Fig. 4.14 and Fig. 4.15, we can conclude that the column-based data-aware power control can save standby power effectively no matter the flag signal is high or low.
600 800 1000 1200 1400 1600 1800 2000 2200
Without gating
0% 25% 50% 75%
(Percentage of Don´t-care Data) 100%
With DAWA power gating
Leakage Power (μW)
-13.5% -28.9% -44.7% -59.8%
Flag=0 with data disturb for non-read operation
Fig. 4.15 Leakage power consumption under different don’t-care pattern when Flag=0.
4.5 Summary
The column-based low power design is described with the ripple bit-line scheme, don’t-care-based ripple search-line scheme and column-based data-aware power control. For dynamic power, the ripple bit-line and ripple search-line are utilized to enhance power reduction without performance penalty and save additional process cost of global search-line by connecting the banks in serial. Furthermore, the don’t-care-based ripple search-line scheme decreases the switching activity and switching capacitance of local search-line to save more power due to continuous don’t-care pattern. As technologies advance, leakage currents increasingly dominate the overall power consumption of nano-scale technologies. Accordingly, the column-based data-aware power control is employed to realize the static power reduction by gating devices. Based on the don’t-care bits and input data, the power control dynamically adjusts the voltages for the left and right half-cells of both don’t-care cells and storage cells. Therefore, it also can improve write-ability and SNM for read/search/data-retention operation. And the timing of the power switching is tolerant to PVT variation and VT scatter by the replica circuitry.
Chapter 5 Implementation of 256x40 and 256x144 Energy-Efficient TCAM Macro in UMC 40nm LP CMOS Process
With the manifest of the shortcomings of the existing IP, a new protocol, known as IPv6 (IP version 6), has been defined to ultimately replace IPv4 [5.1]. The addresses in the new Internet protocol are 128 or 144 bit long, whereas they are 32 bit long in the current IPv4-based Internet. Therefore, we have designed the 256x144 TCAM macro with power saving techniques for IPv6 application. These power saving techniques are also exploited in 256x40 TCAM macro. In this chapter, the 256x40 and 256x144 energy-efficient TCAM macros are implemented using UMC 40nm low power (LP) CMOS process. The specification and floor-planning of 256x40 TCAM macro are described in section 5.1 and 5.2. For different size of TCAM array, the match-line scheme requires some modifications of butterfly connection. In order to further reduce power consumption, we also utilize shared BL/DL and interleaving vertical global-lines techniques. Both of these design implementations are discussed in section 5.3 and 5.4. Section 5.5 shows the simulation results and analysis from 256x40 and 256x144 TCAM macros. Finally, this chapter draws the conclusions in section 5.6.
5.1 Specification of Energy-Efficient TCAM Macro
The size of the TCAM macro is 256-word x 40-bit indicating 256x40 TCAM cells are utilized, and the TCAM array is divided into 16 banks. Each TCAM cell is composed of two SRAM cells, the storage cell and the don’t-care cell. In this TCAM
macro, 8-bit address signals, Addr [7:0], are used to access one of the 256 entries in the read/write operation. Therefore, Addr [7:4] indicate the 16 banks in the TCAM array, and Addr [3:0] point toward 16 words in the bank. In addition, each TCAM entry contains 40-bit TCAM cells. Accordingly, the bit-width of the write-in data (In [39:0]), read-out data (DOUT [39:0]) and search data (Sin [39:0]) are all 40-bit. During the search operation, all 256 entries are compared to the search data within 1 cycle, and 256 comparison results are generated simultaneously as the search outputs (SOUT [255:0]).
The input and output pins of this TCAM macro are listed in Table 5.1 and Table 5.2.
Table 5.1 Descriptions of input pins.
Input Pin Name Description
Vdd, Gnd Power pins
Addr [7:0] 8-bit address signals for accessing one of the 256 entries (words) during the read operation or the write operation
MODE
MS/MD selection (accessing the storage cells or the don’t-care cells) in the selected entry during the read operation or the write operation
In [39:0] Data input for the write operation Sin [39:0] Search input for the search operation
CEN Chip Enable, the three operations, read/write/search, are activated when CEN is high
SEN
Search enable, the search operation is activated when SEN is high. And the read/write operations are activated when SEN is low.
WEN
Read/Write selection, the write operation is activated when WEN is high. And the read operation is activated when WEN is low.
FLAG
Readout Flag, if the flag is low the data in the storage cell will be disturbed if the don’t-care cell on the lowest entry in the same bank (in the same column) is true. The readout data will be unknown while the flag is low.
Table 5.2 Descriptions of output pins.
Output Pin Name Description
DOUT [39:0] 40-bit read-out data
SOUT [255:0] 256-bit search output while comparing 256 entries in a search operation
Due to shared BL/DL, the read or write operation cannot be completed within one cycle. An extra bit, Mode, is utilized to access the storage cells or the don’t-care cells in a TCAM entry. If Mode is high, the don’t-care cells are selected to perform read/write operation. On the other hand, if Mode is low, the storage cells are selected. For various applications of TCAM, our design has another extra control signal, Flag. Based on the continuous don’t-care X pattern and pre-fix pattern, the Flag signal is designed to destroy the storage data while the don’t-care data is 1 and the storage data will not be read. When Flag is low for some application without read operation, the datum in the storage cell will be destroyed if the don’t-care cell on the lowest entry in the same bank (in the same column) is true. In a TCAM cell, the destroyed storage datum will not affect the search functionality because the don’t-care cell is true based on the continuous don’t-care X pattern. In contrast, when Flag is high, the data stored in the storage cells will be robust enough to prevent from disturbance. Then the read data can be propagated through ripple bit-line scheme successfully.
Generally, the TCAM macro is operated in three modes: Write Mode, Read Mode, and Search Mode. In the write and read operations, the functionality of the TCAM macro is operated like an ordinary memory. That is to say, data is manipulated in the TCAM array as the same way in the SRAM array. Different from the SRAM, the TCAM array has the extra operation mode, Search Mode. In the search operation, the input data sent into TCAM array and are compared with all the stored data in the
TCAM simultaneously. After that, all rows which match with input data are sent to the address priority encoder. When multiple matched rows pass through the address priority encoder, an appropriate address for the longest prefix is sent to the output. Thus, in the TCAM architecture, large amount of comparison operations are active to identify all data stored in the TCAM array during a search operation.
The three operations are controlled by the three signals (CEN, SEN, WEN). When CEN is low, then the TCAM macro is in standby mode. The priority of these three
control signals is CEN > SEN > WEN. Table 5.3 lists the truth table of the three modes.
Besides, the timing diagrams of corresponding signals for different operations are shown in Fig. 5.1, Fig. 5.2 and Fig. 5.3.
Table 5.3 Truth table of three modes.
Operation CEN SEN WEN
Standby 0 X X
Search 1 1 X
Write 1 0 1
Read 1 0 0
CLOCK
WEN
MODE
wen_in mode_in
External signalInternal signal
Write Cycle
IN[39:0]
Addr2[7:0]
IN2[7:0]
ADDR[7:0]
addr_in data_in[39:0]
Addr2[7:0]
IN2[7:0]
CLOCK
pre-chargeBL Read Cycle
Addr4[7:0]
A4[7:0]
External signalInternal signal
DOUT[39:0] Dout4[7:0]
Fig. 5.2 Timing diagram of reading storage/don’t-care cells.
CLOCK
Fig. 5.3 Timing diagram of search operation.
5.2 Architecture & Floor-planning of TCAM Macro
In recently years, TCAMs have been popularly used in network routers for packet forwarding and packet classification. Network routers forward data packets from an incoming port to an outgoing port, using an address-lookup function [5.2]. Fig. 5.4
schematically depicts a simplified block diagram of the proposed TCAM macro for IP lookup tables. The search data are broadcast onto the search-lines to the TCAM array.
Each stored word has a match-line that indicates whether the search word is identical to the stored word (matching) or not (mismatching, or “a miss”). The match-lines are fed to the encoder that generates a binary matching location that corresponds to the most-direct routing. In TCAM applications, where more than one word may match, a priority encoder is employed instead of a simple encoder. A priority encoder identifies the location that is matched with the highest priority to map the result of matching, such that words in lower address locations have higher priority. The overall function of TCAM is to take a search word and return the matching memory location. But in our TCAM design, the address priority encoder is not included in the implementation of the 256x40 TCAM macro.
Address_out
Address Decoder
TCAM Array 256 words x 40 bits
Bit Line Pre-charge, Write Driver & Search Driver
Word Match Circuits Address Priority Encoder
Read Sense Amps.
SEN Wen CENCLK
Control Circuits
Address [7:0]
IN [39:0] SIN [39:0]
DOUT [39:0]
SOUT [255:0]
Mode Flag
Fig. 5.5 demonstrates the floor-plan of proposed 256x40 TCAM macro. With continuous scaling of CMOS technology, the increasing contribution of parasitic capacitances and series resistances has become a challenge. Meanwhile, parasitic capacitances are charged/dis-charged during the device switching [5.3], [5.4]. Thus, the global control circuitry is placed in the center of macro to enable shorter interconnection of global routing. In the ripple bit-line and ripple search-line scheme, 256 entries are separated into 16 local banks. Each 16 bit-cells in the same column is arranged with local evaluation circuit and ripple buffers to cope with leakage current problem and don’t-care based power control. For reducing the propagation time of ripple scheme, the search input data and write input data are sent from middle of array to decrease propagation distance.
Fig. 5.5 The floor-plan of energy-efficient 256x40 TCAM macro.
5.3 Butterfly Match-Line Design for 256x40 and 256x144
As mentioned in section 3.4, the basic concept behind the butterfly match-line scheme is that it tries to increase the parallelism of the search operation. The number of stage and number of cell in segment both impact the critical delay of comparison.
Therefore, the different size of match-line requires different butterfly connection to optimize search performance. In 144-bit TCAM cells, the match-line is folded into four sub match-line in six stages, as shown in Fig. 5.6. Each circle denotes a TCAM segment, which contains six TCAM cells and a dynamic circuit. If one of the TCAM segments is mismatched with search data, the mismatching signal can be propagated to turn off more TCAM segments than conventional PF-CDPD match-line dose. All the search operations behind this mismatched segment are terminated. Accordingly, by intersecting the interlaced connection as Fig. 5.6, the 144-bit TCAM match-line increases the dependence between the four parallel sub match-liens to reduce power consumption.
Stage-1 Stage-2 Stage-3 Stage-4 Stage-5 Stage-6
TCAM segment (6 TCAM cells)
Fig. 5.6 Butterfly match-line scheme for 144-bit TCAM cells.
In 40-bit TCAM cells, there are two ways to achieve the same goal of butterfly match-line scheme. Both of them are shown in Fig. 5.7 (a) and Fig. 5.7 (b). Each TCAM segment in 40-bit match-line contains five TCAM cells and a dynamic circuit.
The match-line in Fig. 5.7 (a) uses four parallel segments in each stage and merges the segment outputs into the four fan-ins NOR gate to generate the final matching result. Hence, the critical delay of two stages match-line is (2Tseg + 2TNOR4) On the other hand, the match-line in Fig. 5.7 (b) adopts the three-stage butterfly connection.
Since the match-lines enter to search operation simultaneously in the same bank, the parallelism degree of first stage is two instead of three to decrease trigger loading of match-line pre-charge signal. Even though the three-stage butterfly match-line has one more Tseg delay than two-stage match-line, both delay of NOR2 and NOR3 are shorter than NOR4. Thus, the difference of the search delay of two types match-line
Fig. 5.7 (a) Two-stage (b) Three-stage butterfly match-line scheme for 40-bit TCAM cells.
However, the NOR gates of butterfly connection have to collect the information about the mismatching associated with the previous stage. The four fan-ins NOR gate exploit in two-stage match-line requires large driving capacity to trigger the four segments in the subsequent stage, especially in low power CMOS process. Otherwise,
the slew rate of NOR4 is too small to degrade search performance. Hence, the power and area overheads of the NOR gates with four fan-ins and four fan-outs are larger than NOR2 and NOR3. Besides, if the TCAM segment in first stage is mismatched, the three-stage match-line can turn off two more segments than can two-stage match-line. Therefore, we adopt the three-stage butterfly match-line scheme in the 256x40 TCAM design to trade little search delay for more power saving.
5.4 Design Implementation in UMC 40nm LP CMOS Process
Fig. 5.8 (a) Typical TCAM cell. (b) TCAM cell with shared BL/DL.
A binary CAM cell stores either a logic “0” or a logic “1”. Different from binary CAM, the ternary CAM has three possible state: logic “0”, logic”1” and don’t-care X.
To store a ternary value, a TCAM cell contains two-bit storage memory and a 1-bit
comparison circuit. Fig. 5.8 (a) shows a typical AND-type TCAM cell. For writing stored datum and don’t-care datum at the same time, except to bit-line and search-line pairs, the typical TCAM cell has additional complementary don’t-care lines to transfer the don’t-care data. Hence, there are three pairs vertical line of one cell.
In order to decrease cell area overhead and save additional metal layer for high density design, the bit-line and don’t-care line are combined in our TCAM cell as shown in Fig. 5.8 (b). At the same time, the shared word-line between two memory cells is separated to WL_c and WL_d for storage cell and don’t-care cell, respectively.
In other TCAM cell designs, they often merge don’t-care line and search-line that will worsen the propagation delay of search data due to increased capacitance. Although the capacitance of shared bit-lines is also increased, search operation is the major work of TCAM instead of read/write operations. Besides, the numbers of vertical control and input register are reduced. Hence, the shared BL/DL can save area overhead of not only TCAM cell but also global control circuits.
5.4.2 Interleaving Vertical Lines
Fig. 5.9 Coupling capacitance.
Moore’s law continues to drive technology scaling to deliver increased density
and integration in CMOS technology. The interconnection noise will become increasingly large due to the effects of coupling capacitance and other factors. A coupling capacitance, as shown in Fig. 5.9, between two conductors introduces noise that degrades the signal integrity. It leads to a rise on the spurious pulse on a neighboring wire, if it has a static value or causes delayed transition. Besides mutual capacitance, crosstalk is also determined by the ratio of the mutual to the sum of self and mutual capacitance (to ground). As technology scales, the spacing between conductors in circuits decreases, increasing crosstalk and other sources of interconnection noise as the wires become more compact and closer to one another [i5]. This high density TCAM design contributes to long interconnections and a great amount of vertical lines that can increase crosstalk. Crosstalk is a major source of timing uncertainty in circuits and it is more prevalent than process variation.
Ripple Buffer
Fig. 5.10 (a) Coupling Effect of conventional vertical lines. (b) Interleaving vertical lines.
Fig. 5.10 (a) reveals the coupling effect of conventional TCAM design. The cells in the same column have at least four vertical long wires, one pair bit-line and one pair search-line. The bit-line pair (BL and BLB) is implemented with one metal layer, and the search-line pair (SL and SLB) shares another one metal layer. Referring to Fig.
5.10 (a), each SL is close to SLB of neighboring column to induce coupling effect.
Because of the presence of the coupling capacitance, the switching of search-line pair results in functional degradation and power consumption.
With limited process cost, the technique of interleaving vertical lines is presented to decrease coupling effect, as shown in Fig. 5.10 (b). Instead of using the same metal layer, the SL exchanges the metal layer with BL. Thus, SL shares the same metal layer with BLB. Doing this results in reducing capacitance of search-line pairs without area overhead and mitigating the interconnect noise due to increased distance of neighboring wire.
5.4.3 Cell Layout
Fig. 5.11 exhibits the layout view of a 1-bit TCAM cell. A TCAM cell is composed two SRAM cells and a comparison circuit. Based on the power gating technique of data-aware power control, each TCAM cells needs additional two metal layers to route extra virtual power sources for storage cell and don’t-care cell, respectively. To reduce the power dissipation, the ripple bit-line and ripple search-line schemes propagate the write data and search data by local line pairs bank by bank
Fig. 5.11 exhibits the layout view of a 1-bit TCAM cell. A TCAM cell is composed two SRAM cells and a comparison circuit. Based on the power gating technique of data-aware power control, each TCAM cells needs additional two metal layers to route extra virtual power sources for storage cell and don’t-care cell, respectively. To reduce the power dissipation, the ripple bit-line and ripple search-line schemes propagate the write data and search data by local line pairs bank by bank