• 沒有找到結果。

Chapter 2 Overvuew of Low Power CAM/TCAM Design

2.5 Low Power Design Techniques for CAM/TCAM Macro

2.5.4 Variability-Tolerance CAM Cells with NOR-type Match-lines

Fig. 2.27 (a) NVT-BCAM cell with NOR-type match-line. (b) Read/Write timing sequence of the NVT-BCAM cell.

Within-chip variability has become a serious problem in modern nano-scale technologies, which is particular true for semiconductor memory designs. The variability-tolerant BCAM cell is designed by separating the read port from the write port such that the sizing for read static noise margin and write trip voltage is decoupled [2.59]. Fig. 2.27 (a) shows the N-type variability-tolerant BCAM (NVT-BCAM) cell with an NOR-type match-line. An MN4 is added in the comparator for performing the read operation. Fig. 2.27 (b) shows the timing sequence of the read and write operation of the NVT-BCAM cell. Consider that the NVT-BCAM cell executes a write operation. The WWL is pulled to high and the Din is put on the

bit-lines. Then the Din can be stored in the SRAM storage. Consider that the NVT-BCAM cell performs a read operation. The bit-lines are pre-charged to VDD first and then the RWL is enabled.

If the state of SRAM storage is logic 1, then the MN2 and MN4 are turned on. So, the charge of the bit-line Bb is charged to logic 0 through the path MN2 MN4 VSS and the data 1 is read. On the contrary, if the state of SRAM storage is logic 0, then the charge of the bit-line B is charged to logic 0 through the path MN1 MN4 VSS and data 0 is read. As consequence, the sizing of a cell for read static noise margin and write trip voltage is decoupled by separating the read port from the write port of the cell.

By reusing the comparison logic of a BCAM cell as the read port, moreover, only an additional transistor and a read word-line are needed. Experimental results show that the NVT-BCAM cells can provide food read static noise margin and write trip voltage with lower area cost in comparison with the typical CAM cell.

Chapter 3 Energy-Efficient Match-Line Schemes

The NOR-type match-line scheme provides high search performance, but its cost is a large amount of power dissipation. While, the AND-type match-line scheme trades the performance for reducing the switching feature. As the range of the CAM /TCAM application grows, energy efficiency becomes the critical issue. Thus, we propose the 16T AND-type TCAM cell with P-type comparison circuits to save match-line power while maintaining good search speed especially in low power process.

On the other hand, leakage currents, charge sharing, and coupling noise all increase the soft-error rate of dynamic circuits with the advance of technology. They worsen not only the performance but also the functionality of the TCAM macro. To conquer these problems, our TCAM design employs the XOR-based conditional keepers and butterfly match-line scheme to support noise-tolerant, high-speed and low power TCAM. In addition, we also modify the NOR gate of butterfly match-line scheme to improve stability of dynamic circuits. These designs are described in detail below.

3.1 Conventional NAND-Type Match-Line Schemes

In the typical NAND-type (AND-type) match-line structure, a number of cells are cascaded to form the match-line. In the case of a match, all bits of stored data match all bits of search data, the match-line is discharged to ground. In the case of a mismatch, the match-line is remained at VDD when the stored data are not identical to the search data in every bit. Generally, the matching probability is less than

mismatching. Therefore, the NAND-type has low power feature but a longer search time owing to the deep fan-in circuits. To solve this problem, there are many works presented to increase the search speed of the NAND-type match-line. They include the pseudo-footless clock-data pre-charge dynamic (PF-CDPD) match-line scheme [3.1] as shown in Fig. 3.1, the range matching scheme [3.2], and the tree-style NAND-type match-line scheme [3.3]. All of their concepts are to separate the match-line schemes into several segments. And we will review the details of PF-CDPD in this section.

12 TCAM segment

6 TCAM Cells TCAM cell

6 6 6 6

12 TCAM segment

6 6 6

critical path

ML

out

TCAM segment

Fig. 3.1 PF-CDPD And-type match-line scheme.

Fig. 3.2 illustrates the dynamic AND gate with four-input is transformed to clock-data pre-charge dynamic (CDPD) circuits [3.4], [3.5]. The n-stage pseudo-footless clock-and-data pre-charge dynamic (PF-CDPD), as shown in Fig. 3.3, combine the operation and the characteristics of CDPD and AND-type match-line scheme to decrease power and search delay. Typically, search operation is divided into two phases, pre-charge and evaluation. At the beginning of pre-charge phase, the clock triggers the floating node (C1) of the first stage to high instead of global

match-line. And output of the first stage goes low to trigger the next floating node (C2). Therefore, all floating nodes (C1~Cn) are charged to high in pre-charge phase.

During the evaluation phase, the output of each stage just depends on the result of the previous stage. For instance, if the comparisons result of the first stage is match, all NMOS of this stage will be turned on. Then the first output goes high to enable the second comparison stage. On the other hand, if the comparisons result of the first stage is mismatch, the output of this stage keeps at low voltage to disable the second comparison stage.

A

CLK OUT

B C D

CLK

A B

C D

OUT

Fig. 3.2 Transfer dynamic logic into clock-and-data pre-charge dynamic (CDPD) circuits.

MLpre

MLout

1st stage 2nd stage 3rd stage nth stage

C1 C2 C3 Cn

Accordingly, there are several advantages of CAM/TCAM which adopts PF-CDPD circuits. First, the match-line divided into many segments results the size of serial comparison transistors being unnecessary too large in the same search time criteria. Second, the switching capacitances are reduced effectively because of the smaller comparison transistors. Moreover, the switching capacitances of match-line are also decreased due to separated match-lines rather than deep logic depth. Third, the evaluation operation of PF-CDPD match-line is enabled or disabled depending on the output of preceding stages. That is to say, if the stored data and search data are mismatch, the output will disable all comparison operations in after stages to avoid unnecessary switching. In consequence, PF-CDPD circuits contribute to enhance search time and save power consumption.

3.2 And-Type TCAM Cell with P-Type Comparison Circuits

With the progress of process technology scaling down, designing reliable circuits have to face many challenges, including charge sharing effect, increasing leakage current and decreasing Ion-Ioff ratio. These all limit circuit operations. Especially in low power process, conventional AND-type CAM/TCAM cell is not suitable anymore because decreasing Ion-Ioff ratio destroy the functionality of search operations.

Many works have been devoted to the design of CAM/TCAM cell to increase the swing voltage or reduce the search delay of comparison circuits. The work in [3.4]

uses the 4-transistor CMOS XNOR function to restore full voltage swing, but increase the capacitance of the stored node and single bit-line. Also to drive the local match-line for full swing operation, the design in [3.6] utilizes the XOR CAM cell with transmission gates. And swapping the inputs of XOR gated enables improved

slopes on SLs with faster compare delay. However, the area overhead results from extra 2 PMOS transistors. At the same time, higher complexity of wire routing is required for swapped XOR cell. The novel 16T AND-type TCAM cell with P-type comparison circuits (as shown in Fig. 3.4) provide larger swing voltage of comparison circuits and without additional transistor count. Following analysis is based on UMC 40nm low power CMOS process.

Search-line pair

LML

BL/DL BL/DL

WL_c

WL_d Qi

M1

QjB Qj

M2 M3

M4

Regular Vt

MOSFET Low Vt

MOSFET

Fig. 3.4 16T AND-type TCAM cell with P-type comparison circuits.

The 16T AND-type TCAM cell is composes of the two traditional 6T SRAM which are all typically minimum-size to maintain high cell density and four comparison transistors, M1 through M4, to implement the comparison between stored data and search data. In our design, we adopt the PMOS (M1 and M2) to trigger the pass transistors (M3) of local match-line (LML). This change considers the decreasing Ion-Ioff ratio. Fig. 3.5 depicts the drain current versus gate voltage for 65nm standard

process (65nm SP) and 40nm low power process (40nm LP), and the shadowed region also displays the upper bound current and lower bound current of keeper. If we want to design reliable TCAM cell, then the turned on current of LML should be larger than upper bound. Contrarily, the turned off current of LML should be below lower bound current of keeper. In standard process, the Ion-Ioff ratio is large enough to ensure robust search operation. But in low power process, high threshold voltage decreases the drain current and shrinking Ion-Ioff ratio leads to recognize match and mismatch state hardly.

From current curve in Fig. 3.5, NMOS is realized to have smaller matching current

Current range of keeper (charge sharing, 40nm LP)

Current range of keeper (charge sharing, 65nm SP)

MLg

Fig. 3.5 Drain current versus gate voltage for different technology.

Although this cell is similar to conventional TCAM cell, PMOS comparison circuits perform correct functionality and reduce delay of search output due to

improving Ion-Ioff ratio. During match searching operation, PMOS transistors provide full VDD for M3 to discharge LML with stronger drain current. For mismatch state, it also suppresses charge sharing effect for AND-type match-line due to no need of large pass transistors (M3 and M4). Consequently, the tolerance to noise and variation of match-line keeper are increased. Comparing to conventional TCAM cell, the AND-type TCAM cell with P-type comparison circuits does not increase the cell area.

Meanwhile, it can be applied in either the binary CAM or the ternary CAM.

3.3 XOR-based Conditional Keeper

3.3.1 Circuit Implementation

floating node MLpre

TCAM Cell

MLout XOR-Based conditional keeper

Fig. 3.6 AND-type match-line with XOR-based conditional keeper.

Table 3.1 Control organism of XOR-based conditional keeper.

CLK Floating Node Control Signal on gate of keeper Low Low Low, to speed up the process of pre-charge Low High High, to avoid the impact on performance at the

very beginning of evaluation

High Low High, keeper should be off

High High Low, keeper should be activated to enhance the capability of noise immunity

conventional keepers perform more poorly in terms of propagation delay and power consumption. Accordingly, a XOR-based conditional keeper has been presented in [3.7]. The main idea of the proposed XOR-based conditional keeper is to ensure that the keeper does not be turned on in the dynamic circuit at the beginning of the evaluation phase. Fig. 3.6 and Table 3.1 present the control signals and their corresponding keeper states.

CLK

Input floating

node XOR output

Keeper on Keeper on

Slightly turn on

Fig. 3.7 The diagram of XOR-based conditional keeper.

The match-line starts the pre-charge cycle by setting both the pre-charge signal and floating node to low voltage. Concurrently, the conditional keeper should be turned on to accelerate the pre-charge procedure. When the match-line pre-charge signal is low and the floating node goes high, the pre-charge process completes and the circuit is ready to be evaluated. Since the match-line is pre-charged high, the conditional keeper should be turned off, preventing any impact on the delay and any unnecessary power consumption.

Evaluation of match-line starts from the pre-charge signal low to high. At the beginning of the evaluation process, floating node maintains the high voltage. But it will eventually be at the appropriate voltage as long as the delay of the XOR gate exceeds the propagation delay of the dynamic circuits. Note that the delay time of the

dynamic circuits is shorter than that of the XOR gate, the conditional keeper is slightly turned on at the beginning of the evaluation process. At the end of evaluation phase, the conditional keeper is fully turned on or off as determined by the final search output that is stored in the floating node. If the floating node is kept high, reflecting a mismatching state, the conditional keeper will be turned on to assist keeping voltage at high. While a match-line in the match state, the pre-charge signal is high and the floating node is pulled toward ground level, the evaluation mode has been completed and the final value stored in the floating node is low. Consequently, the conditional keeper should be fully turned off. An XOR gate is required to generate the desired control signals. The timing diagram for the XOR-based conditional keeper is shown in Fig. 3.7.

3.3.2 Design Analysis

(a) (b)

Fig. 3.8 (a) Search time (b) Power consumption versus UNG margin for different keepers.

We take the design of 8-bit AND-type match-line as an example. There are four different types of match-line scheme mainly adopted for the performance comparison.

The first design is match-line scheme with conventional keeper, which configuration

is in Fig. 3.1. The second design employs weak keeper to match-line scheme. The third design reduces search delay by the twin transistors technique of match-line. The last one is the proposed AND-type match-lines scheme with XOR-based conditional keeper. During the noise tolerance comparison, what we concern about is not the actual size of the keeper device or actual size of twin transistors but the ability to resist noises.

This ability is verified by the widely used Unity Noise Gain (UNG) margin [3.8].

Fig. 3.8 (a) and Fig. 3.8 (b) summarize the simulation result, where the search time and power consumption versus unity noise gain margin for four types of AND-type match-line, respectively. When UNG is at 810mV, using XOR-based keeper achieves 19.2% improvement on search time and 3.5% reduction on power saving compared to conventional keeper up-sizing. Based on the same condition, compared to weak keeper, we obtain 27.1% improvement on search time and 8.9% reduction on power saving. Even though the twin transistors technique is suitable for deep fan-in dynamic circuits, the performance is worse than XOR-based keeper. According to the simulation results, it was observed that the delay of search increased by 16.3% and consumption of power also increased by 8.9%, compared to XOR-based keeper when UNG is at 810mV. It is a good tradeoff to use the design of XOR-based conditional keeper. Because it only sacrifices 1.8% and 1.0% area overhead compared to conventional keeper and weak keeper, respectively.

3.4 Butterfly Match-Line Scheme

In this section, the butterfly match-line scheme is presented. By increasing the parallelism of the search operation, the butterfly match-line scheme improves search performance. Meanwhile, it reduces power consumption in a manner that depends on

the interlaced pipeline since the butterfly connection turns off more TCAM segments than PF-CDPD match-line does. In addition, we enhance noise tolerance in dynamic circuits by adjusting the NOR gate in butterfly match-line.

3.4.1 Organization

Stage-1 Stage-2 Stage-3 Stage-4 Stage-5 Stage-6

TCAM

Fig. 3.9 Butterfly match-line scheme.

Fig. 3.9 demonstrates the simplified butterfly match-line circuits, which is based on the PF-CDPD match-line scheme [3.1]. Each circle represents a TCAM segment, (6Tseg+5TNOR2+TNOR4) compared to the conventional PF-CDPD scheme. The Tseg is the discharging time of a TCAM segment. Note that Tseg is much larger than the delay of NOR gates. In order to reduce the power consumption, a butterfly connection is made among these four independent sub match-lines by intersecting to the interlaced connection, as shown in Fig. 3.9. The first stage of match-line (Seg-1 to Seg-4) is active every evaluation phase. When at least one of the TCAM segment is mismatched

with search data, this mismatching signal will be propagated to after sub match-line due to butterfly connection and the search operations behind this mismatched segment are terminated. Thus, the butterfly match-line scheme turns off more TCAM segments than the conventional PF-CDPD match-line scheme does.

Butterfly match-line scheme not only achieves high performance with high degree of parallelism but also improves power reduction by exploiting interlaced connections.

Such a match-line can be implemented using full connections between two stages and thereby feed the mismatching information into the subsequent stages. However, it requires a NOR gate with four fan-ins to collect the information with the previous stage regardless of the state of sub segment. Furthermore, this NOR-gate must provide large driving capacity to trigger the four segments in the subsequent stage. Although it can turn off two more segments than can butterfly match-line scheme, the power and performance overheads of the NOR gates with four fan-ins and four fan-outs will dominate the critical path of match-line. Accordingly, the butterfly connection can turn off the segments behind the mismatching segment most efficiently.

The power analysis of the butterfly match-line scheme is as follows. Before the power formulas of the butterfly match-line schemes can be derived, some assumptions are made for simplicity.

 The power consumption of the search operation is the same in all segments (Pseg) when all of the TCAM cells are matched with the search data in one TCAM segment.

 The matching probability of the TCAM cell [i] is represented as pi (i=1 to 144).

The probability pi is defined as one when i<1.

 

probability of segment-n, PSn, represents the probability that the TCAM segment-n is matched to the search data. Each stage consists of four TCAM segments, and the segments in stage-1 are defined from Seg-1 to Seg-4. The terms, j and k, are referred to the odd and even stages of the butterfly match-line scheme, respectively. We see from Eq. (3.1) that the butterfly match-line scheme achieves higher power saving since more TCAM segments are not be turned on in evaluation period. For instance, if Seg-9 as shown in Fig. 3.9 is mismatched, the segments with the gray background in stage 4, 5 and 6 will not be activated.

3.4.2 Design Consideration

Even though one match-line with butterfly connection is divided into four sub match-lines. From layout view, all of the sub match-lines are serial in the same line instead of parallel connection with four different lines. Hence, the NOR gate of each segment suffers from the large capacitance because of long interconnection and differ in length of input signal. In the mismatch state, when match-line pre-charge signal

goes high, the evaluation phase starts and the floating node will be discharged slowly due to large capacitance. On the other hand, rather than full connections between two stages, 144-bit TCAM cells adopt butterfly connection since NOR2 requires less driving capacity than NOR4. However, we implement the 40-bit TCAM cells by utilizing NOR3 gate that is a compromise between power and performance. Based on NOR3 gate, the wider fan-in NMOS pull-down leaks the charge stored in the

goes high, the evaluation phase starts and the floating node will be discharged slowly due to large capacitance. On the other hand, rather than full connections between two stages, 144-bit TCAM cells adopt butterfly connection since NOR2 requires less driving capacity than NOR4. However, we implement the 40-bit TCAM cells by utilizing NOR3 gate that is a compromise between power and performance. Based on NOR3 gate, the wider fan-in NMOS pull-down leaks the charge stored in the