Chapter 3 Transaction Level Modeling of H.264/AVC Decoder
3.4 Memory Controller in H.264 Decoder
There are three main modules in the H.264 decoder will access external memory to receive or give data. These three modules are inter prediction (motion compensation) module, de-blocking module, and de-interlace module. The relationship between these modules and external memory is shown in Fig. 3.7.
AMBA BUS
M emory Controller
BUFFE R
Inter block(MC
)
BUFFE R
De-block
BUFFE R
De-interlace DRAM pad
External memory
Fig. 3.7 Relationship between H.264 decider and external memory
The functionality of de-block module is writing the re-construct block into the external memory, all the decoded block will be stored in the external memory through de-blocking module.
The inter prediction module is used to do the motion compensation for the inter-block. The data fetch module obtains the motion vectors from CABAC block first, and it fetches the block in the reference frame which is placed in the external
first-in-first-out (FIFO) buffer before entering the inter prediction module.
In order to display the picture in field mode, a de-interlace module is need to read the decoded field stored in the external memory and de-interlace the fields.
The H.264 decoder needs a memory controller to communicate with external memory. Due the complexity of H.264/AVC decoder at level 4 we need a memory controller that not only keep the correctness of data but also can help the whole system to achieve better performance.
Fig. 3.8 Architecture of the memory controller in H.264 decoder
Fig. 3.8 shows the architecture of my memory controller inside the H.264 decoder. This memory controller combines two layers together. The Layer 0 is the external memory interface (EMI) which is used to control the external memory. The Layer 1 is the address translation machine (ATM) which is designed for the H.264
decoder. The Layer 0 part (EMI) will receive the addresses translated from Layer 1 (ATM) part and generate the appropriate commands to external memory.
The purpose of layer 0 (EMI) is to communicate with external memory. The EMI is designed for various kind of system. As a result, the EMI is independent of the system but dependent of the external memory. In order to make this EMI can fit variable type of external memory. The configurability of this EMI is very essential.
The detail of EMI will be introduced in the Chapter 4.
Layer 1 is an address translation machine (ATM). It exploits the characteristics of the video process to make the accesses become more regular. The Layer 1 is designed for the memory subsystem. The detail of Layer 1 and the relationship between memory subsystem and H.264 decoder will be described in Chapter 5.
Chapter 4
External Memory Interface
In the design of SOC system, it usually needs an off-chip memory to storage the large amount of data. The external memory interface is used to communicate with the external memory for the system as shown in Fig. 4.1. To deal with tremendous data transfer and storage in H.264 decoder, the external memory must provide high data bandwidth to achieve the real time request. The bandwidth of the external memory is limited due to the pin number of I/O is finite. Accordingly the external memory interface must provide high data bandwidth utilization by using some techniques. The proposed external memory interface will be introduced in this chapter.
Fig. 4.1 Architecture of my memory controller
4.1 Concept of EMI
The external memory interface (EMI) is one of the Layers in this memory controller, that is, layer 0. EMI will receive the physical addresses from the address translation machine or the functional block, and use the addresses to access the DRAM. This external memory interface is designed to control the external memory, and it will directly connect to the DRAM as show in Fig. 4.2.
Address & Write
data Layer 0 : EMI
Addr/cmd
DRAM
Data
Fig. 4.2 Connection of EMI
The external memory interface will generate the appropriate commands to external memory (DRAM). The external memory can only accept the commands that it defined in the specification, and the commands have been introduced in chapter2.
The design of my EMI is configurable for different kinds for external memory.
Users only have to set the specific parameter of the external memory that they use, and this EMI will translate the accesses into the commands. Thus the users can conveniently use this EMI to access the external memory (DRAM).
This EMI has another advantage of improving the performance of external
restriction on the design of SOC. In order to increase the performance of the whole system this EMI is designed to achieve higher performance. The whole architecture will be introduced in next section.
4.2 Architecture of EMI
The detail architecture is shown in Fig. 4.3. It contains five fundamental parts inside the EMI. The input data and address will stored in the FIFO and generate the appropriate command through the mode control block and the FSM block. Then the commands will be send to DRAM. The detail of these parts will be described in the following.
Fig. 4.3 Architecture of EMI
4.2.1 FIFO
The FIFO (first-in & first-out register) in the EMI is used to store the data and addresses temporarily. It can divide into data FIFO and command FIFO. The data FIFO is used to put the data that will write to the external memory for the write operation. The command FIFO is used to put the addresses that the system wants to access.
The command FIFO is designed in a particular way that can reduce the latency to get the data. In the system of H.264 decoder, there are large amount of consecutive read operations to external memory. How to reduce the latency of receiving data is very important.
In the external memory, the consecutive read (or write) operations to access the same row is called as a row-hit status. The next read operation can be issued right after the previous read operation as shown in Fig. 4.4. The consecutive read (or write) operations to different row in the same bank is called as a row-miss status. If the row-miss status happens, the next read operation can’t be issued directly. The bank has to be pre-charged in order to turn off its current row, and issue an active command to turn on the row that the next read operation will access, as shown in Fig. 4.5.
Fig. 4.4 Consecutive accesses to same row (burst length=4)
Fig. 4.5 Consecutive accesses to different row (burst length=4)
Compare Fig. 4.4 to Fig. 4.5, we can find the row-miss status consumes more latency on receiving the data. The row-miss status takes time to pre-charge the bank and open the next row. In order to decrease the latency when meeting the row-miss situation, the external memory supports auto-pre-charge function. The read (or write) command can be issued with auto-pre-charge capability. Thus the pre-charge time (tRP) can be hid in the time of data transfer, as shown in Fig. 4.6.
Fig. 4.6 Consecutive accesses to different row with auto-pre-charge
Apparently if we use auto-pre-charge capability appropriately, we can decrease the latency of reading data. On the contrary, it costs the latency to re-active the same row if we issue a read command with auto-pre-charge capability when the next command is a row-hit status.
The design of my command FIFO provides a dynamic way to use auto-pre-charge capability. It can accelerate the accuracy of prediction. The addresses will be stored in both of the command FIFO and previous-address register. The incoming address will compare the bank address and column address with the previous address register. If bank address and column address are same with the
previous register, the hit bit of the previous command will change to 1. Otherwise the hit bit is still 0. Fig. 4.7 shows the architecture of the command FIFO.
Fig. 4.7 Command FIFO
If the hit bit of the command is 1, it means the row address of next command is same as current command. The current command will read/write the row buffer without auto-pre-charge capability. Similarly it will do the access operation with auto-pre-charge capability when the hit bit is zero. This dynamic method of choosing the auto-pre-charge capability is useful. Even though sometimes the current command don’t know the address of next command, it is effective to pre-charge the bank. That is because of no information of the next command means the master is going to be changed, and the addresses of other masters have high probability of being located at other space in the external memory.
This dynamic method has two options for users to choose. One of the options is the bank will start the auto-pre-charge capability if it detect the next command is bank-miss. This method is suitable when there is high probability of accessing the original bank is row-miss. The other option is the bank will stop the auto-pre-charge capability if it detect the next command is bank-miss. This method is useful when there is high probability of accessing the original bank is row-hit.
Generally we often use auto-pre-charge capability when it is a random-access process and disuse auto-pre-charge capability when the system is regular access. The random-access process has more row-miss situation than regular-access process, so it is benefit to use the auto-pre-charge in the read operation. In the video process, the access is usually regular access. I used different method to read a HD-picture (1920x1080) from external memory in H.264 decoder. The simulation results of different methods are list in Table 4-1.
Table 4.1 Different method for receiving a picture
Auto-pre-charge method The read timing per picture (1920x1080)
Cycle count
Row-close method 2025283 ns 337548 cycle Row-open method 820431 ns 136739 cycles Dynamic method 1
(bank-miss no pre-charge)
612377 ns 113442 cycles
Dynamic method 2 (bank-miss pre-charge)
680652 ns 102063 cycles
4.2.2 Mode Control
The mode control block is used to control the finite state machine block. It will record the state of each bank, and record the active row in each bank. There are two states of the bank. One is idle state, and it means there is no row open in this bank.
The other state is active state, and it means there is a particular row open in this bank.
The row register of the bank will record the row address of this particular row.
If the bank is in idle state, it is in bank-miss status. It induces (ACTIVE + CAS) latency for read access and induces (ACTIVE + DQSS) latency for write access in DDR SDRAM. The access state diagram is shown in Fig. 4.8 (a) and Fig. 4.8 (b).
Fig. 4.8 State diagram of bank-miss status
Else if the bank is in active state and the row address in the row register is same as current command, it means a row-hit status. The access state diagram of row-hit status is shown in Fig. 4.9(a). Finally when the row in the row register is different to
the current command, it is a row-miss status. The access state diagram of row-hit status is shown in Fig. 4.9(b)
0 1 2 3 4 5 6 7 8 9 10
PRE
A0 CLOCK
Command
Data
A1 A2 A3
tRP
(b) READ OPERATION of row-miss status
ACT READ
11
tRCD tCL
Fig. 4.9 State diagram of row-hit and row-miss status
The mode control block receives the input signal from finite state machine to change the state of every bank. When the current command is updated, the mode control block will compare the address with the row register. The architecture is shown in Fig. 4.10.
Fig. 4.10 Mode control block
4.2.3 Finite States Machine and schedule block
The functionality of this finite states machine (FSM) is to produce the appropriate commands that can access the external memory. These commands should not violate the timing restrictions which are specified in the memory. The most important of all is the commands should keep the correctness of data.
4.2.3.1 FSM
The basic state diagram of FSM is shown in Fig. 4.11. The NOP state will send out a NOP command to the external. The NOP command means no operation in the memory, the number of NOP commands is issued for different kind of timing restriction. The state flow is controlled by the mode control block. The next state in the FSM is decided by the result of mode control block.
Fig. 4.11 Basic state diagram of FSM
We only use one finite states machine to control four banks inside the memory for the sake of reducing the area and circuit. Because the external memory can only accept one command at each clock cycle, it is wasteful to use many controllers to generate commands for each bank. The overall state diagram is shown in Fig. 4.12.
Some states is designed for some particular external memory, the detail operation of
states will be described as follows.
Fig. 4.12 Finite State Machine
Fig. 4.12(a) shows the operation of initialization. The external memory needs a long latency to power up. After power up state is finished, one pre-charge command and two auto-refresh command must be issued to complete the initialization. After initialization, FSM will enter load-mode-register state. The mode registers are used to define the specific mode of operation of the external memory. LMR_1 state is used to load standard mode register. The standard mode register enables the selection of burst length, burst type, CAS latency, and operating mode. LMR_2 state is used to load extended mode register for some specific external memory such as mobile DRAM.
Fig. 4.12(b) shows the operation of access. The operation of access is to generate the proper commands that can access the DRAM. When the memory stays in idle state for a long time it will enter power down state to save the power. After the memory is being accessed again, it will come back to idle state. The state flow is controlled by mode control block and schedule block.
4.2.3.2 Schedule block
The performance of external memory sometimes is still not enough to satisfy applications which have demand for short access latency and high bandwidth. The performance of normal memory controller is constrained because it can not support parallel processing which are addressed to different banks.
The schedule block in the FSM is used to schedule the command. To fully utilize the DRAM bandwidth, it is necessary to parallel process accesses addressed to different banks. This work is done by the schedule block. The schedule block can insert commands that belong to other banks as shown in Fig. 4.13 and Fig. 4.14. The schedule block can utilize the command bus by arranging the command that access to different bank. It not only can improve the bandwidth but also reduce the latency.
Fig. 4.13 Command with schedule and without schedule (two bank-miss)
Fig. 4.14 Command with schedule and without schedule (two row-miss)
4.2.4 Counter and Timing Checker
The counter is used to calculate the different latency for different commands.
After a command is issued, the counter will start to count until the latency is satisfied.
The memory can accept next command after the counter is completed. During the time of counting, EMI will issue NOP commands to external memory. It is used to ensure no other command is issued during this latency.
The timing checker stored the timing parameters and some other settings that users can configure it. Some of the important settings inside are listed as follows.
Settings
• Data rate:
Users can choose single data rate or double data rate for the DRAM they use.
• Memory data bus bandwidth:
There are three selections including x8, x16, and x32.
• Memory size:
The number of rows per bank and the number of columns per row can be changed by users.
• Mode registers setting
The CAS latency, burst length, burst type, PASR, TCSR, drive length can be set by users
• Buffer size
The number of buffer inside the external memory can be decided by users.
• Auto-Pre-charge method
There are four method can be selected. The row-open method and row-close method are the traditional methods. Additionally the dynamic method 1 and dynamic method 2 are provided by this EMI. Users can choose different modes for the process they use.
• Timing parameter
The timing parameters are not the same for different types of external memory.
The users can set the timing parameters in this timing parameter file. Thus the EMI can easily integrate this EMI to the external memory they use. Some important timing parameters are illustrated in Table 4-2.
Table 4.2 Timing parameters of Micron Mobile DDR DRAM
Micron DDR
4.3 The external memory in H.264 decoder
In our design of H.264 decoder, it need to process a 8x8 Y block per 165 cycle.
Consequently the latency of writing data and reading data should be less enough to achieve the real time request.
We choose Micron Mobile-DDR SDRAM to be our external memory model. The Mobile-DDR SDRAM is a high speed CMOS, dynamic random access memory. It is internally configured as a quad-bank DRAM. The Mobile-DDR SDRAM uses a double data rate architecture to achieve high-speed operation. The double data rate is essentially a 2n-prefetch architecture with an interface designed to transfer two data words per clock cycle at the I/O pins. A single read or write access for the Mobile-DDR SDRAM effectively consists of a single 2n0bit wide, one-clock-cycle data transfer at the internal DRAM core and two corresponding n-bit wide, one-half-clock cycle data transfer at the I/O balls.
Read and write accesses to the Mobile-DDR SDRAM are burst oriented.
Accesses start at a selected location and continued for a programmed number. The address bits registered coincident with the ACTIVE command are used to select the bank and row to be accessed. The address bits registered coincident with the READ or WRITE command are used to select the bank and the starting column location for the burst access.
As with standard SDR SDRAM, the pipelined, multi-bank architecture of Mobile-DDR SDRAM enables concurrent operation, thereby providing high effective bandwidth by hiding row pre-charge and activation time.
The other advantage of Mobile-DDR SDRAM is that can support deep power-down mode. Deep power-down mode can achieve maximum power reduction by eliminating the power draw of the memory array. Data will not be retained when
the device enters DPD mode. Some other methods that can save power such as PASR and TCSR are also supported.
The latency of writing and reading a 8x4 block in H.264 decoder can be decreased when compare with the SDR DRAM. The timing diagram of read operation and write operation in the SDR DRAM and DDR DRAM are shown in Fig. 4.15 and Fig. 4.16. As we can see, DDR takes fewer cycles than SDR. The advantage of improving the latency is the critical factor we choose DDR DRAM.
Fig. 4.15 Read operation of SDR and DDR
Fig. 4.16 Write operation of SDR and DDR
4.4 Analysis
In this section, I present an experimental work used to evaluate the performance
In this section, I present an experimental work used to evaluate the performance