Chapter 1 Introduction
1.3 Thesis Organization
The organization of this thesis is as follows. An overview of CAM is introduced in
word schemes would be presented. Besides, the application and prior low power methodologies of CAM would be described in this chapter as well. The noise-tolerant butterfly match-line scheme with XOR-based conditional keeper is realized in Chapter 3. Furthermore, AND-type TCAM cell with P-type comparison circuit also described. Chapter 4 presents the ripple bit-line scheme, don’t-care-based ripple search-line scheme and data-aware power control scheme. By utilizing the regular table of don’t-care pattern, both the dynamic and leakage power can be saved. The energy-efficient 256x40 and 256x144 ternary CAM array are implemented in Chapter 5. In this chapter, other layout considerations are presented to reduce area overhead and coupling effect, including shared BL/DL and interleaving vertical global lines techniques. Finally, the overall investigation results and conclusions are drawn in Chapter 6.
Chapter 2 Overview of Low Power CAM/TCAM Design
This chapter is a study of CAM-design technique at the circuit level and at the architectural level. Typically CAM/TCAM architecture and the applications will be described in section 2.1. The basic operation, cell circuits and word schemes of CAM are presented in section 2.2. The low power match-line schemes and low power search-line driving approaches are presented in section 2.3 and 2.4, respectively. At the architecture level, section 2.5 reviews several design techniques for reducing power consumption of CAM/TCAM macro.
2.1 Applications & Architecture of CAM/TCAM
2.1.1 Conventional CAM Architecture
A conventional CAM architecture is usually composed of the data memories, address decoders, bit-lines pre-charge circuits, word match schemes, read sense amplifiers, address priority encoders and so on [2.1]-[2.7]. Fig. 2.1 shows a simplified block diagram of a CAM. Generally, CAM has three operation modes: write, read, and search. In write and read operation, CAM plays just like an ordinary memory.
That is to say, data is manipulated in the CAM array as the same way in SRAM array.
Different from SRAM, CAM has a special mode: search mode. The input in Fig. 2.1 called search word that is broadcast onto the search-lines to the table of stored data.
The number of bits in a CAM word is usually large, with existing implementations ranging from 36 to 144 bits. A typical CAM employs a table size ranging between a
few hundred entries to 32K entries, corresponding to an address space ranging from 7 bits to 15 bits. Each stored word has a match-line that indicates whether the search word and stored word are identical (the match case) or are different (a mismatch case, or miss). The match-lines are fed to an encoder that generates a binary match location corresponding to the match-line that is in the match state. An encoder is used in systems where only a single match is expected. In CAM applications where more than one word may match, a priority encoder is used instead of a simple encoder. A priority encoder selects the highest priority matching location to map to the match result, with words in lower address locations receiving higher priority. The overall function of a CAM is to take a search word and return the matching memory location. One can think of this operation as a fully programmable arbitrary mapping of the large space of the input search word to the smaller space of the output match location.
Address
Word Match Circuits Address Priority Encoder
Data
Fig. 2.1 Conventional CAM architecture.
2.1.2 Applications of CAM/TCAM
CAMs are widely used in cache memory system and translation look-aside buffer (TLB) in virtual memory system in past years. The primary commercial application of CAMs today is to classify and forward Internet protocol (IP) packets in network routers [2.8]-[2.12]. In networks like the Internet, a message such an as e-mail or a Web page is transferred by first breaking up the message into small data packets of a few hundred bytes, then sending each data packet individually through the network.
These packets are routed from the source, through the intermediate nodes of the network (called routers), and reassembled at the destination to reproduce the original message. The function of a router is to compare the destination address of a packet to all possible routes, in order to choose the appropriate one. A CAM is a good choice for implementing this lookup operation due to its fast search capability.
2.1.2.1 Cache Memory
In the memory hierarchy system, cache plays an important role [2.13], [2.14].
Cache is the name given to the first level of the memory hierarchy encountered once the address leaves the CPU. Its function is used to refer to any storage managed to take advantage of locality of access. Cache serves as a method for providing fast reference to recently used portion of instruction or data. When CPU finds a wanted data item in the cache, it is called cache hit. On the contrary, if CPU does not find a data item that is needed in the cache, it is called cache miss.
An example for direct data mapping cache is illustrated in Fig. 2.2. The address has 32 bits, and it is divided into three parts. First one part is byte offset which
can tell us the capacity of cache. If there are N bits for Index, the cache has 2N entries which can be stored data items. The action is first to find the corresponding position of index. When the corresponding position is found out, the tag stored in the corresponding position would be taken out. This tag would be compared to the third part of tag. If they are the same, and valid bit is one, a hit signal and the corresponding data would be sent out. Of course, the tag entries are composed of CAM array. The valid bit is used to indicate whether an entry contains a valid address or not. If they are not the same, a miss occurs.
Fig. 2.2 A simple cache memory.
2.1.2.2 Translation Look-aside Buffer
Translation look-aside buffer (TLB) is widely used to virtual memory system. A TLB is like a cache that hold only page table mapping [2.13], [2.14]. Its function is to provide fast translation from the virtual address to the physical address. When we get
Valid Tag Data
31 30 ……… 13 12 11 ……2 1 0
=
hit data
20 10
20
32 Index
0 1 2 3
1022 1023
Index Tag
the physical address, we can use this physical address to access the data which are stored in the memory (such as cache or DRAM or DISK). Because TLB can speed up address translation in processor with virtual memory and it also can cut down access time and lowering the miss rates.
Fig. 2.3 A simple virtual memory system.
Fig. 2.3 shows a simple virtual memory system. The TLB contains a subset of the virtual-to-physical page mappings that are in the page table. Because the TLB is a cache, it must have a tag field which consists of CAMs. As a virtual page number (VPN) is sent to the TLB, this VPN would be compared with all valid tags in TLB. If the VPN can find a corresponding tag in the TLB, the corresponding physical page address would find in the corresponding tag. However, if there is no matching entry in the TLB for a page, the page table must be examined. The page table either supplies a physical page number for the page or indicates that the page resides on disk, in which case a page fault occurs. Since the page table has an entry for every virtual page, no
1
tag field is needed.
2.1.2.3 ATM Switches
For ATM switching network application, CAM can be adopted as a translation table. Virtual circuits are important parts to ATM networks, and they need to be set up across ATM networks before any data transfer because ATM networks are connection-oriented. There are two types of ATM virtual circuits, Virtual Path (identified by a virtual path identifier [VPI]) and Channel Path (identified by a channel path identifier [VCI]). Each segment of the total connection has unique VPI/VCI combinations, and the VPI/VCI value of ATM cells would be changed into the value for the next segment of connection while ATM cell go through a switch
Fig. 2.4 ATM switch with CAM.
CAM is applied to an ATM switch as an address translator and can quickly perform the VPI/VCI translation. In the translation process, the CAM causes address which access data in RAM and uses incoming VPI/VCI values in ATM cell headers. A
CAM/RAM combination realizes the multi-megabit translation tables with full parallel search capability. Take VPI/VCI fields from the TM cell header and the list of current connections stored in the CAM array for comparison, as a result, CAM originates and address which is used to access an external RAM where VPI/VCI mapping data and other connection information is stored. The ATM controller uses the VPI/VCI data from the RAM for modifying the cell header, and the cell is sent to the switch, depicted in Fig. 2.4.
2.1.2.4 Packet Forwarding Using CAM
1 0 1 X X
Fig. 2.5 Packet forwarding by an address-lookup table in network routers.
In recently years, TCAMs have been popularly used in network routers for packet forwarding and packet classification. Network routers forward data packets from an incoming port to an outgoing port, using an address-lookup function [2.17]-[2.20]. Fig.
2.5 schematically depicts a simplified block diagram of a TCAM macro. The
output port associated with that address. The router maintains a list, called the routing table, which contains destination addresses and their corresponding output ports. The search data are broadcast onto the search-lines to the table of stored data. The address-lookup function determines the destination address of the packet and selects the output port that is associated with that address. For example, the packet destination address 01101 is input to the TCAM. As indicated by the table, two entries are matched, and the priority encoder chooses the upper entry and generates the matching location 01.
This matching location is the address that is input to a RAM that contains a list of output ports, as shown in Fig. 2.5. A read operation of RAM outputs the port designation, port B, to which the incoming packet is forwarded. We can view the match location output of the CAM as a pointer that retrieves the associated word from the RAM. In the particular case of packet forwarding the associated word is the designation of the output port. This TCAM/RAM system fully implements an address-lookup engine for packet forwarding.
2.2 Design of CAM/TCAM Cells
In this section, a conventional CAM/TCAM cell will be introduced. A CAM cell serves two basic functions: bit storage (as in RAM) and bit comparison (unique to CAM). There are two types of CAM cells will be introduced as following: one is binary CAM (BCAM) cell and the other is ternary CAM (TCAM) cell.
2.2.1 Binary CAM Cell
Depending upon working different methods in search mode, CAM cells are classified into two kinds: NOR-type CAM cell and AND-type CAM cell [2.21], [2.22].
The differences of them would be described as follows.
2.2.1.1 NOR-type CAM Cell
Fig. 2.6 NOR-type binary CAM cell. (a) 9-transistor BCAM cell and (b) 10-transistor BCAM cell.
Table 2.1 Truth table of NOR-type binary CAM cell.
State Qi SL ML
Zero (0) 0 0 floating
0 1 0
One (1) 1 0 0
1 1 floating
Fig. 2.6 depicts the NOR-type CAM cells which are widely used for CAM scheme design in past years. Fig. 2.6 (a) is constructed by 9-transistor structure and Fig. 2.6 (b) is composed of 10-transistor structure. Table 2.1 shows the truth table of a NOR-type CAM cell. The 9T CAM cell consists of a traditional 6T SRAM and a PTL-type compare circuit; the 10T CAM cell is composed of an ordinary 6T SRAM and the pull down XOR comparison circuits. As the CAM cell is to be written, not only 9T CAM cell but also 10T CAM cell work same as a SRAM cell. While word-line is active, the complementary data is forced onto the bit-lines to be stored in the D-latch which is
ML WL
X
SL
BL/ BL/SL
Qi Qj
WL
SL
BL/ BL/SL
ML
Qi Qj
(a) (b)
first and whether the bit-lines discharge to ground or not depends on stored data. After passing the read sense amplifier, the correct data is sent to the output stage. About 9T CAM cell, the match-line will be charged to high first in the search operation. If search data is equal to the stored data, the node X becomes low. Furthermore, the NMOS, Mn, is turned off, and the match-line is still floating. On the other hand, if search data doesn’t match with stored data, the node X would become high and result in the NMOS, Mn, being turned on. Therefore, the match-line would be discharged to ground. Regarding 10T CAM cells, the principle is same as 9T CAM cells. During searching operation, the match-line would be pre-charged to high first. If searching data is equal to the stored data, the match-line is still floating. Contrarily, if searching data is not equal to the stored data, there is a path from match-line to ground and match-line would be discharged to ground through this path.
2.2.1.2 AND-type CAM Cell
Fig. 2.7 AND-type 9-transistor binary CAM cell.
An AND-type CAM cell is similar to 9-transistor CAM cell whatever it works in write or read operation. The only one difference from 9T CAM cell is the match-line scheme. Fig. 2.7 depicts an AND-type CAM cell and Table 2.2 describes the truth
ML WL
BL/SL BL/SL
Qi
Qj
table of AND-type CAM cell. As an AND-type CAM cell works in search operation, the match-line would be pre-charged to high first. In contrary, the match-line hold floating when the search data doesn’t match with stored data and the match-line is discharged to ground only while the search data and stored data are match.
Table 2.2 Truth table of AND-type binary CAM cell.
State Qi SL ML
Zero (0) 0 0 0
0 1 floating
One (1) 1 0 floating
1 1 0
2.2.2 Ternary CAM Cell
For the CAM circuit design, the ternary CAM (TCAM) performs a more powerful data search function [2.1]. Different from binary CAM which has two states: one (1) and zero (0) state, the ternary CAM (TCAM) cell has an additional state: don’t care (X) state. Alike binary CAM cell, TCAM would be classified into two kinds:
NOR-type TCAM cell and AND-type TCAM cell.
2.2.2.1 NOR-type TCAM Cell
Fig. 2.8 Static NOR-type ternary CAM cell.
SL
DL/ DL
ML
Qj
WL
BL BL/SL
Qi
M1 M3
M2 M4
Table 2.3 State assignments and truth table for static TCAM cell.
State Qi Qj SL ML
Zero (0) 0 1 0 floating
0 1 1 0
One (1) 1 0 0 0
1 0 1 floating
Don’t care (X)
0 0 0 floating
0 0 1 floating
Not allowed 1 1 0 —
1 1 1 —
Fig. 2.8 shows a static NOR-type TCAM cell. It consists of 2-SRAM and comparison circuits. This TCAM cell is designed to store three states, namely zero (0), one (1) and don’ care (X). These three states are set by Qi and Qj. Table 2.3 illustrates how the three states are stored in this TCAM cell and the truth table of the static NOR-type TCAM cell. When Qi is low and Qj is high, the TCAM cell is in the “zero”
state. In the searching operation, the same as BCAM cell, match-line will be charged to high first. If search data is low, the NMOS M1 and M4 would not be turned on, such that the ML will still be floating. On the other hand, while search data is high, the NMOS M1 and M2 are turned on at the same time result in the match-line being discharged to ground. However, the TCAM cell is in the “one” state, while search data is high, the match-line would keep high. While search data is low, the match-line would be discharge to the ground. Particularly, while Qi and Qj are both low, the TCAM cell is in “don’t care” state. No matter search data is high or is low, the NMOS M1 and M3 are not turned on result in the match-line keeping floating. Note that Qi and Qj cannot be high simultaneously, this state are not be allowed.
There is an additional dynamic NOR-type TCAM cell is called dynamic TCAM cell [2.23]-[2.26], as shown in Fig. 2.9. The major difference between static TCAM
cell and dynamic TCAM cell is that the storage memories composed of 2 SRAM cells in static TCAM cell are replaced by 2 capacitances in dynamic TCAM cell. The dynamic TCAM cell works like static TCAM and Table 2.3 also shows how these three states are stored in this dynamic TCAM cell and the truth table of the dynamic TCAM cell.
Fig. 2.9 Dynamic NOR-type ternary CAM cell.
2.2.2.2 AND-type TCAM Cell
Fig. 2.10 AND-type ternary CAM cell.
Qj
Qi
/SL BL SL
BL/
ML WL
WL
BL/SL BL/SL
DL DL
ML Qi
Qj
Table 2.4 State assignments for TCAM cell.
State Qi Qj SL ML
Zero (0) 0 0 0 0
0 0 1 floating
One (1) 1 0 0 floating
1 0 1 0
Don’t Care (X)
0 1 0 0
0 1 1 0
1 1 0 0
1 1 1 0
Fig. 2.10 illustrates a 16-transistor AND-type TCAM cell which includes 2-SRAM and comparison circuits composed of three NMOS. The state assignments and truth table of this TCAM cell is described in Table 2.4. The AND-type TCAM cell is alike a 9-transistor AND-type BCAM cell when TCAM cell works in zero (0) and one (1) states. However, while this AND-type TCAM cell is in don’t care (X) state (Qj is high), no matter the search data is high or low, the match-line would be discharged.
2.3 Low Power Match-line Schemes
The dynamic power consumed by a single match-line that misses is due to the rising edge during pre-charge and the falling edge during evaluation, and is given by the equation, Eq. (2.1), where f is the frequency of search operations. In the case of a match, the power consumption associated with a single match-line depends on the previous state of the match-line. Typically, there is only a small number of matching we can neglect this power consumption. Accordingly, the overall match-line power consumption of a CAM block with w match-lines is derived in Eq. (2.2).
𝑃𝑚𝑖𝑠𝑠 = 𝐶𝑀𝐿∙ 𝑉𝐷𝐷2∙ 𝑓 (2.1)
𝑃𝑀𝐿 = 𝑤 ∙ 𝑃𝑚𝑖𝑠𝑠= 𝑤 ∙ 𝐶𝑀𝐿∙ 𝑉𝐷𝐷2∙ 𝑓 (2.2) With the advance of technology, noises are increasing the soft-error rate of dynamic circuitries. Therefore, a low power, high speed and noise-tolerant TCAM is expected. There has been large variety of techniques to reduce the power consumption of match lines which are categorized as follow.
2.3.1 Conventional Match-line Structure
In the conventional CAM architecture, the circuit design of CAM word circuits adopts dynamic CMOS circuits to improve data matching performance and hardware cost. Applying the dynamic CMOS circuits designs, the conventional NOR-type CAM word schemes and AND-type match-line schemes are shown in Fig. 2.11 and Fig.
2.12, respectively [2.27], [2.28].
2.3.1.1 NOR-type Match-line
Memory Mermory
Memory
floating node
match-line output match-line
precharge
Ndn1 Ndnn-1
Ndn0
Np CAM Cell
Mpre
Fig. 2.11 Structure of conventional NOR-type match-line.
Fig. 2.11 depicts, in schematic form, how NOR-type cells are connected in parallel to form a NOR-type match-line. While we show CAM cells in the figure, the description of match-line operation applies to both CAM and TCAM. A typical NOR search cycle operates in three phases: search-line pre-charge, match-line pre-charge,
and match-line evaluation. First, the search-lines are discharged to disconnect the match-lines from ground by disabling the pull down paths in each CAM cell. Second, with the pull down paths disconnected, match-lines are pre-charges by Mpre. Finally,
and match-line evaluation. First, the search-lines are discharged to disconnect the match-lines from ground by disabling the pull down paths in each CAM cell. Second, with the pull down paths disconnected, match-lines are pre-charges by Mpre. Finally,