• 沒有找到結果。

Chapter 3 Low Power 2R2W Multi-Port 8Kb SRAM Design

4.5 This Work

4.5.5 Double Pump Operation

Base on Kuo’s work, a more robust and lower power design is proposed in this section. The slot control circuit is show in fig.4.25, and operation waveform is in fig.4.26. This circuit will check second pulse is on or off. When “WEN_S1” signal is 0, it means one write only this cycle. After TS1 is trigger then TS2 is follow it rising up (which is write finish signal).In this low power design, SA is floating when SA pulse is no trigger. Power gating makes leakage reduction significantly in sleep mode. For register more robust, write followed by read control is added. Simulation data show that read time is more than write time (In next chapter). If write pulse is too fast and second write has a probability disturbed to first read slot. In this reason, I set second write pulse trigger always waiting for read first slot finished. This design is very help for data write/read disturbed free and no any timing degraded.

Fig. 4.25 Slot detected circuit Fig. 4.26 Slot control waveform

Fig.4.27 is replica design in this chip. Double pump pulse is generating by replica and twice trigger in one cycle. Because each bank has four ports, one port need itself port replica bit. For read test, give the worst case that only the top read dummy cell store “0”, others are store “1”. In this case, RBL can drop down by select RWL bit cell. In order to

CLK

88

test the worst write time, I put the right cell in the bottom of column. Write path combine with column capacitive loading and simulation Write”0” & “1” operation at the same time. And W_OK signal will trigger when worst case write is finished. The replica operation waveform is showed in fig.28, R_WP signal is rise up when CLK in ing edge, and R_WP pulse is turn off when R_W_OK pulse up and close R_WP.

Waiting for a short time then replica read path again.

Fig. 4.27 Register files replica circuit

0 1

89

Fig. 4.28 Register files slot control

4.6 Summary

In this chapter, first we introduced pervious register design and conventional dual port conflict issues in write/read mode. Low power design such as multi-bank and power gating non-used bit cell are showed in Chap 4.2. In Chap. 4.3, multi threads cell design is discussed. There are many technologies for thread switch, not only circuit level but also computer architecture. Multi-threads supply high bandwidth between data transmission, this design is very suit for video or high performance CPU. Double pump design and time sharing technology are discussed in Chap. 4.4. By using lower CLK frequency, power consumption reduce is very significantly. Different technology to generate twice pulse on one cycle can lower the port counts to half, and save more operating time. In Chap. 4.5is my design work. A new share RBL structure is proposed for power reduction and Data slot switch can gain more write successful probability.

RBL base multi-thread decoder can reduce area about 50~60% than IBM’s work, and not reduce too much performance. One/twice slot switch control is proposed for power reduction and give cell more robustness.

CLK

R_WP

R_W_OK

TS1

TS2

Reset_Sig

S0 S1

90

Chapter 5

Low VDD MIN Multi-thread 4R4W Register File Design in TSMC 40nm CMOS Process

5.1 Introduction

I design a low power 2K 13T 4R4W register file with multi-thread switch in this chapter. There are many technologies for improving register file performance, power consumption and data retention I discussed in Chap 3, Chap 4 and this chapter. The floor plan, pin count, pin definition and specification of this 4R4W register file will also be introduced. In Chap 5.2 I will show the cell layout and this register file spec. Share WBL and RBL structure and thread read out structure. Chap 5.3 shows SNM simulation compare with conventional 8T, dual port, and this design. In this section, iso-area concept is included for more fare comparison. Chap 5.4 shows assist circuit design, such as negative VVSS for low VCC write assist and cut off feedback loop when write operation. In order to do floating problem free, extra Y_Cut NMOS is added in this unit cell. In Chap 5.4, Design implementation & Test-waveform function of proposed 4R4W register file is discussed. Whole chip floor plan will show in this section. Finally, post simulation and analysis are based on TSMC 40nm TN40G process. Power consumption and bank layout are show in Chap5.5. The technology file is supply by CIC, and finishes it in secrecy Lab.

5.2 4R4W Register File Structure

5.2.1 2R2W Register File Unit Cell & Layout View

Fig. 5.1 shows the 2R2W register file cell I proposed. In this cell has two subcell, each one store thread0/1 and thread 2/3. The cell is composed of 28 N/PMOS, and use

91

share WBL and share RBL technology to reduce transistor counts. In this design, there have 8 rows signals and 6 columns signal to operation this 2R2W register file cell.

RWL_A and RWL_B control is share with up/down cell. When write operation, Xsel signal will select up or down which is need to write data in. In read mode, subcell “0”

and subce11 “1” data will read out at the same time. Negative assist technology is added in the MNA VVSS for write improvement. This technology only turn on in write “1”

mode, otherwise it will reduce write “0” performance. Compare with Chap 3 I proposed 2R2W multi-port SRAM, conventional read buffer is changed to new structure RBL sharing structure. RBL sharing side in this picture is not sow because the plot is already too large. I will discuss this new structure in next section.

Fig. 5.1 The 2R2W register file unit cell

120/50

D_RBL_L U_RBL_L U_RBL_R D_RBL_L

RWL_B

92

In layout view, metal layer increases from metal 3 to metal 5 and no area damage. Fig.

5.2 is metal used in this cell and layout view is showed in fig. 5.3. Metal layer is change for Ysel & Y_Cut signal from M2 to M4, and row base signal metal is change from M3 to M5. M1 & M3 is become to inter layer connection. Fig.5.3 shows two neighbor bit-cell layout share RBL structure, and sub-cell architecture

Fig. 5.2 The metal layer implement of 2R2W register file unit cell

Fig. 5.3 Layout view of 2R2W register file

<12.38um>

GND WBL_A WBL_B RBL_A RBL_B Ysel_A Ysel_B Y_Cut VCC

<0.09um> <0.09um> <0.09um> <0.09um> <0.09um> <0.09um> <0.09um>

<0.09um>

93

5.2.2 Share WBL Structure

Like 2R2W multi-port single end write operation (Chap. 3), so in this section I will not repeat again operation. An only one problem is that it needs extra one WBL in the two side in register file array. So in this design, we have to do a new switch circuit design for WBL port switch in bit-interleaving structure.

5.2.3 Share RBL Structure

For lower power consumption, a new structure read buffer is proposed.

In this way, port numbers can reduce to 1/2 and no have dummy read problem. It is very suit for lower power device, and it can increase more battery life. Dummy read is normally see in the conventional dual port or 8T SRAM cell used in bit-interleaving structure. More than 30% power reduction by using conventional 8T dual port compare with this new read-out structure. Not only this method can do dummy read free, RWL keeps high in standby can also reduce leakage power consumption.

Fig. 5.4 is a dummy read happened in conventional SRAM design.

Normally register file read frequency is more than write. So this technology is very used for this design, and only need extra 2 column select signal to control it (Left or right signal). By the way, the dummy read power consumption will enlarge along with bit-interleaving bit.

Fig. 5.4 Conventional dummy read operation in half select cell

VVSS 1->0

Dummy Read Power Consumption

94

Fig. 5.5 shows my design of share RBL structure. The green color is the share with left bit cell and black is stand for right bit cell read buffer. There are four ports in bit cell (for thread switch detect need), then two A & B ports output. Fig.5.6 is my read buffer design in read operation mode. A one side NMOS open in at the same time, it will not worry about dummy read power consumption anymore.

Fig. 5.5 Share RBL structure in 4R4W register file

Fig. 5.6 Share RBL structure no dummy read power consumption

RWL_A_U

RWL_A_D RWL_B_U

RWL_B_U

RBL_A_U

RBL_A_D RBL_B_U RBL_B_D

BS_Ab

BS_A BS_Bb BS_B

2R2W Register Rout PATH

RWL

YSEL0 RBL0

1->0

YSEL1

<Share RBL Structure>

95

5.3 Register File Assist Technology

5.3.1 Negative VVSS Design

In this cell design, write “1” is not easy than write “0’. In order to solve this problem, a negative VVSS in used for improving write “1” case. The negative VVSS circuit is show in the fig. 5.7. At first, wen signal is input then the negative start detect this moment is write “0” or write “1”. If A and B port all choose write “0”, VVSS will not change statue and connected to the ground. In other word, either one of two port select write “1” mode, the negative circuit is start working. The capacitance sizes choose should be very carefully. If size is too small, the write assist is not enough; otherwise, if the size is too large, it will distribute others standby cell in the same column. Table 5.1 is the NEG circuit work function and Fig. 5.99 shows negative VVSS level by post simulation result in different voltage. The second Cap. Only turn on when two port is write “1”. In this moment the WBL loading is larger than single write “1”, so it needs more capacitance to assist write “1”. Fig. 5.60 shows negative VVSS variation in different corner in 0.6V supply voltage.

Fig. 5.7 Share RBL structure in 4R4W register file

VVSS

WEN_A WEN_B

DATA_A DATA_B

(1u/1u)x4PMOS

PMOS (1u/1u)x2

96

Table 5.1 Negative VVSS circuit function

Fig. 5.8 (a) Negative level in different voltage (b) Negative level in different corner

5.3.2 Single-end Write Cut-off & Y_Cut for Floating Issues Free

In this section, write mode is like 2R2W multi-port SRAM cell. Single-end write is not like differential write can do small signal sensing, but driver power consumption, periphery circuits and cell size is reduced many much. Normally, write “1” is a big problem when circuit operative under low supply voltage. In this design, cut off feedback loop can help write “1” easier, but in the same low based will happen floating issue.

97

5.4 Implementation of Multi-thread 4R4W RF

5.4.1 4R4W Register File Floor Plane

Register File Bank2 64*16

Register File Bank1 64*16

WEN Port Priority Circuit REN Port Priority Circuit

Dec ode r

Dec ode r

ConflictConflict

D_IN_IOD_IN_IO R_OUTR_OUT

Din_DriverDin_Driver R_SwitchR_Switch

W_SwitchW_Switch ReplicaReplica

98

Fig.5.13 is a schematic view of 4R4W multi-threading register file design. Including two 1k bits bank and hierarchy level conflict detect circuit, no conflict problem will happen. Each bank have itself control circuit such as R/W decoder, replica, R/W driver and switch detect circuit. CEN0 & CEN1 signal is set to bank each other. If CEN signal is keep low, all of bank is in sleep mode. Periphery circuit is in power gating for leakage power reduction. Compare with conventional always on circuit, power saving is very huge. In this figure 5.13, write control circuit is on the top side and read control is on the bottom side. Replica is place in the left side and right side for detect the worst case. WL pulse is including not only replica sensing delay but also long inverter delay chain. By this design, it supplies a very robust WL pulse for write/read successful. Read and write slot 0/1 switch circuit is different, read is no disturbed the cell node so I set it priority is always higher than write; otherwise, write will flip the storage node and change data so it need pass thought the conflict detect circuit.

Macro Size 2K bits (1024*2*1)

Process Technology TSMC 40nm General purpose CMOS process

Data-width 32 bit

Read power 98.63 nW/t (per bit c-ell, Double read) Write power 112.3 nW/t (per bit c-ell, Double write)

Table 5.2 The specification of proposed

99

5.4.2 Design Implementation & Test-flow of Proposed 4R4W Register File

Fig. 5.10 shows my test function waveform. At first, single write “0” operation is in 1st CLK cycle. In second CLK cycle, write the bottom bit and read the bit which is write in first cycle. In this cycle, test signal read and double write function is showed.

Write bit is all set one nearest side and one furthest side, and try to find the critical path.

In next cycle, two read operation at this moment. In fourth cycle, try to test W/R conflict circuit. Finally, the data switch test is in the 5th CLK cycle.

After using simulation test pattern can trace the worst case in this chip. If function is not working, waveform is different. Test CLK frequency is adaptive 50 MHz in this work. The write simulation post-simulation waveform can see in Fig. 5.11. Fig5.12 shows read mode waveform. In 5th CLK cycle, a WChange signal is raised up and changed S0 & S1 order.

Fig. 5.10 Test pattern and simulation waveform result

0 1 2 3 4 5 6

100

Fig. 5.11 Write post-layout simulation waveform result

Fig. 5.12 Read post-layout simulation waveform result

CLK

101

Fig. 5.13 Data transmission path in this 4R4W multi-threading register file Fig.5.11 shows the post-layout simulation in this chip design, test pattern can refer to Fig. 5.10. Below this picture, read simulation waveform is showed in Fig. 5.12.

Fig.5.13 shows data transmission path in this 4R4W multi-threading register file design.

There are two level conflict detect circuit level I have introduces in chapter 4.5 and chapter 5.2. Data input have to give the signal before the CLK rising edge and

finished first level port priority conflict detect. Then CLK rising edge trigger DFF, start second detect level address detect. WEN signal is turn on if no conflict problem issues.

This signal passes to replica circuit and generates a WWL pulse to control write driver.

Input Buffer

102

REN is like conventional read design and also have to pass port conflict detect circuit.

5.5 Post-layout Simulation Result

Fig. 5.14 shows a bank layout view of the proposed 4R4W multi-thread register file bank. The proposed 4R4W 2K register file is fabricated using TSMC 40nm general purpose process. The area of bit-cell is 3.095um x 1.8um = 5.571um2 and the bank size is 213.2um x 111.81um = 23.837 mm2. Below is all of improved technology of this design Table 5.3. These technologies are introduced in the Chap. 4 &Chap. 5, and post-simulation result is showed in next section.

These technologies I have introduce the design concept and logic circuit in Chap.3, Chap.4 and Chap.5. Low power, area reduction and wide range operation register file is present in this section.

Table 5.3 Improve technology of 4R4W register file design

NO. Technology

1. Share RBL structure

2. Share WBL structure

3. Power gating sensing circuit 4. Negative VVSS for write assist 5. Transmission gate cut off write assist

6. Data slot switch control

7. Column base Multi-thread switch

8. Double pump timing sharing

9. Y_Cut NMOS for floating issue free 10. RWL VVSS reduce leakage current 11. Bit-interleaving structure

12. Hierarchical conflict detect structure

13. Wide range operation 0.4~1.2V

103

Fig. 5.14 4R4W multi-thread register file bank layout view

5.5.1 Performance

Base on post-layout simulation result, this proposed 2K 4R4W register file array can operate at 220 MHz in VDD=0.9V, TT corner &25C. When operation in Vmin (VDD=0.4V), the cycle time also can operate at 6 MHz, TT corner & 25C. Although two slot operate is need more time for timing control and waiting data, it also save 20%

time compare with conventional two cycle can finish it. The timing reduction is show in the below.

Fig. 5.15 Improve technologies of 4R4W register file design

<213.2um>

104

0.4 0.5 0.6 0.7 0.8 0.9 1.0

1 10

Time (ns)

Supply Voltage (V)

TT Coner

Write 0 -25C Write 0 25C Write 0 125C Write Double -25C Write Double 25C Write Double 125C

In this post-layout simulation, the cycle time is dominated by Read 0 operation. Read 0 speed is limited by read buffer discharge share read bit line and multi-thread decoder operation. This problem is become more significantly in low power operation (VDD lower than 0.5V). In order to reduce read 0 time in this cell design, short channel is used for increasing more read current and used a short RBL structure (Fig.5.16).

Fig. 5.16 Read “0” performance for post-layout simulation

Fig. 5.17 Write “0” performance for post-layout simulation

105

Fig 5.18 is one/double read simulation result in different operation voltage. By this picture 125C with higher speed in high supply and slower in low voltage. Fig 5.17 shows write “0” time simulation, in this cell writes “0” is easier than write “1”. NMOS access transistor is good for write “0” operation.

Fig 5.18 is write “1” simulation result in this design. Write “1” in this design is faster than write “0”, because cut off feedback and negative VVSS write “1” assist is used.

The WEN address conflict detect time is included in this data, and one write and double write timing more closed in low voltage operation. This effect is conflict circuit with XOR delay raise significantly in low power mode.

Fig. 5.18 Write “1” performance for post-layout simulation

Fig 5.19 (a) shows address conflict detect circuit operate in wide range and (b) is delay of write worst case compare with read “0” access time. This delay is similar in super threshold voltage operation, but split to each other in sub threshold region.

0.4 0.5 0.6 0.7 0.8 0.9 1.0

1 10

Tim e ( ns)

Supply Voltage (V)

TT Corner

Write 1 -25 Write 1 25 Write 1 125 Write Double -25 Write Double 25 Write Double 125

106

Fig. 5.19 (a) Address conflict detect circuit delay under wide range

(b) Delay of write worst case compare with read “0” access time.

107

5.5.2 Power Consumption

Fig 5.20 shows the power consumption of read/write operation and standby mode in different supply voltage and one/two slots operation in one cycle. Power in write “0” is lower than others, because in this case negative VVSS circuit will not turn on and save many power dissipation. Another reason is that RBL pass through thread switch is complex and read two sub-cells at the same time. This data is based on two bank are turn on and do the same operation in the one cycle.

Consider lowest power timing product, the minima energy point is operation under VDD=0.65V (Fig 5.21). Although this cell can scaling the supply voltage down to 0.4V all corner pass, read delay too long and raise its energy consumption. Read Ion/off is become very weak under sub threshold voltage, it domain the all 4R4W register file cycle time.

.

Fig. 5.20 Power consumption with different supply voltage

108

Fig. 5.21 Energy consumption with different supply voltage

In this design, no dummy read in others unselect bit-interleaving in the same row.

Compare with conventional design, read power reduction can reduce more than 50%.

Power consumption reduction is relative with bit-interleaving number and data store in the bit-cell. If the data is “1”, conventional read buffer NMOS is cut off. No dummy read current; otherwise, if data is “0”, dummy read current occur in half select cell. Fig.

5.22 shows simulation data with different bit-interleaving number and all node store “0”

worst case.

In standby mode, conventional 8T SRAM read buffer is connect to ground. Leakage current will pass through two stack NMOS, and extra power consumption generate by this leakage path. RWL keep high when no read operation, compare with conventional 8T SRAM read buffer power saving about 28% in standby mode (Fig. 5.23). Keep RWL on VVSS also cannot take care of RBL sensing failed by RBL leakage induced.

109

Fig. 5.22 Read power consumption with different bit-interleaving structure

Fig. 5.23 Leakage power saving different voltage

BINT=0 BINT=2 BINT=4 BINT=8 BINT=16 Store 1 & 0 Equivalent All store 0

Compare with conventional 8T Read Buffer (TT, 25C) Leakage Power Saving (%)

110

5.5.3 Iso-Area SNM Simulation and Comparison

Comparing with the conventional dual-port (DP) 8T and single-ended (SE) 8T, the proposed 2R2W multi-port bit-cell and 4R4W multi-thread bit-cell area consumption are showed on table 5.4. The cell layout use TSMC 40nm technology, in this design dummy poly is needed between poly and poly. In this reason, this work bit cell is not too larger compare with convention 8T and dual 8T design.

Table 5.4 These works compare with conventional SRAM bit-cell

Table 5.4 These works compare with conventional SRAM bit-cell

相關文件