Chapter 4 External Memory Interface
4.5 Summary
For multi-core SoC designs, the performance of memory subsystem is even more important, due to the share of memory bus with different access requirements of these heterogeneous cores. The data transfer to off-chip memory is especially important due to the scarce resource of off-chip bandwidth. Although the tremendous progress in VLSI technology provides an ever-increasing number of transistors and routing resource on a single chip, and hence allows integrating heterogeneous control and computing functions to realize SoCs, the improvement on off-chip communication is limited due to the number of available input/output (I/O) pins and. As many recent studies have shown, the off-chip memory system is one of the primary performance bottlenecks in current systems.
In this chapter, we have presented an external memory interface (EMI). Users can conveniently integrate this EMI to their system. The users just need to set the parameter for the type of external memory they use. Some other mode control methods and the buffer size are suitable for different kinds of process. Users can dynamically choose them which are beneficial to their system. The configurability of this EMI can alleviate the burden for system designer. Besides, this EMI is able to improve the performance including latency, bandwidth, and power.
To support complex multimedia applications, architecture of multimedia systems must provide high data bandwidth, high speed, and low power. Furthermore, a multimedia operation system should support real-time scheduling. In order to test the performance and the configurability, I choose mobile-DDR DRAM as my external memory model and select the H.264 decoder as our video process. In this chapter, this EMI only simulate with the mobile-DDR DRAM. The pattern I simulate is not complete and lacks the characteristic of the video process. In the next chapter, we will
integrate this EMI and the mobile-DDR DRAM into the H.264 decoder system. By simulating with real video process patterns we can see the performance of our EMI.
Chapter 5
Address Translation Machine &
Memory Subsystem for H.264
Decoder
The EMI (layer 0) of the memory controller has been proposed in previous chapter. This EMI can reduce the latency of row-miss status and bank-miss status, but the latency still longer than row-hit status and bank-miss status. For the reason gives above, we add an address translation machine which can increase the probability of row-hit and bank-hit status in my memory controller. The performance of this memory controller in video process can be improved after combined address translation machine with EMI. The detail of this address translation machine and the whole memory subsystem will be introduced in this chapter. The experimental result of the memory subsystem with different granularity in the H.264 decoder will also be shown in this chapter.
5.1 Introduction
To improve memory bandwidth and power consumption in video applications, a new address translation machine is proposed. This address translation machine is used for H.264 decoder. The advantage of the address translation machine is the accessing to external memory can become more regular. Since the translation can minimize the
number of overhead cycles needed for row-activations in synchronous DRAM (SDRAM), we can improve memory bandwidth and energy consumption significantly.
The features of SDRAM and memory-access patterns of video-processing applications are considered to find a suitable address translation which can improve the performance of whole system.
As the resolution of video-processing applications becomes high and H.264 supports the high compressing efficiency, video signal processors should deal with a large amount of data within a tightly bounded time. Due to the large amount of data transfer, video data are stored in the external memory that are usually slow, and thus the system performance strongly depends on the memory bandwidth between processors and external memories.
The data transfer in the video decoder is especially huge in order to support different level and complex mode. To meet the requirement, we must exploit the characteristics of video-processing algorithms. From the deterministic characteristic, most memory access patterns can be known at compile time. The regularity of memory-access patterns can be effectively used to reduce the number of clock cycles required in array accesses.
Besides the high memory bandwidth, low-power consumption becomes an important factor to be considered in system design. As the power related to memory accesses dominates the whole system power in data-dominated systems, it is essential to reduce the memory power consumption. In the external memory, row-activation and pre-charge operation are dominant in dynamic power consumption. When the accesses to memory become more regular the number of active and pre-charge operation can be decreased. Therefore the dynamic power consumption is greatly
Using the address translation machine has another advantage. Since the switching of the address bus between processors and external memory is also a major source of power consumption, we can save power spend on bus transition by using the address translation machine. For another issue, if we share the bus to both address line and data line. When we don’t have to send address to memory every access, the bus will not be occupied by sending address. We can have more time on transferring data without waiting for the bus.
In this chapter we will combine an address translation machine (ATM) with the external memory interface (EMI) I introduced in the Chapter 4 to be the memory controller in the H.264 decoder. The architecture of this memory controller is shown in Fig. 5.1. The EMI use the characteristic of external memory to improve the performance and the ATM use the characteristic of video processing to improve the performance.
Fig. 5.1 The memory controller in H.264 decoder
5.2 Memory Subsystem Architecture
The memory subsystem we proposed is shown in Fig. 5.2. There are three modules that need to access the external memory via AHB-bus and memory controller.
Specifically, the IIP module fetches the motion-compensated data from the external memory for prediction. The DB module writes back the reconstructed block and the DEI module reads the decoded fields for de-interlacing. The physical DRAM addresses are generated by the address translation unit with the motion vector as input.
The external memory interface, on the other hand, generates associated DRAM commands. In particular, we use the DDR DRAM for higher data bandwidth.
DDR
Fig. 5.2 Memory subsystem Architecture
In our design, all the data required for IIP, DB, and DEI modules will be firstly stored in the synchronization buffer so as to enable concurrent DRAM access and video decoding. Particularly, the synchronization buffer is implemented by two SRAMs with size being determined by the block granularity.
Table 1 shows the cycle counts per macroblock (MB) when the synchronization buffer is designed at the levels of 4x4, 8x8, and 16x16. Note that the number is based on the worst case assumption in which the sub-pel interpolation of a 4x4 block requires the most input data, no redundant data are fetched for different blocks, and the input data are distributed in different banks.
Table 5.1 Dram requirement for real time requirement in design of worst case
Granularity 4 x 4 8 x 8 16x16
Cycles/MB with 1 DRAM 1488 992 579
Equivalent configuration for real-time requirement
# of DRAMs 8 4 1
Cycle/MB with N DRAMs 720 604 579
Bandwidth Utilization (DRAM) 25% 42.4% 79.4%
Width of AHB bus (bit) 128 108.5 50.84
As shown, smaller block size causes poor DRAM bandwidth utilization because of more frequent DRAM active and pre-charge operations. Equivalently, for the same real-time requirement and clocking rate, the configuration with smaller block size requires more number of DRAMs and wider AHB bus. In Table 1, when one DRAM is used, only the block size of 16x16 can fulfill the real-time requirement. The block size of 8x8 need four DRAM to achieve real-time requirement. The 4x4 block size can’t reach real-time even if we use eight DRAM, i. e., 660cycles/MB with clock rate being 162MHz.
Because of the 16x16 granularity costs large amount of synchronous buffer and
the 4x4 granularity requires too many DRAMs, we finally choose 8x8 block size to be our granularity.
5.3 Data Arrangement
We use Mobile-DDR SDRAM (32bit width) to be our external memory. In order to increase the bandwidth of our H.264 decoder, we use 1~4 Mobile-DDR SDRAMs and use a AMBA bus. The relationship of our H.264 and memory controller and external memory is shown in Fig. 5.3
DDR M0 DDR M1 DDR M2 DDR M3
128 bit AMBA Data bus 32
IIP FIFO DB FIFO DI FIFO
4 DDR 1 (128bit) bus
Fig. 5.3 Architecture of memory organization
5.3.1 Memory Mapping Method
There are one to four DRAMs in our subsystem and there are two kinds of memory method. If the number of external memory is four we will access luma block and chroma block simultaneously. If the number of memory is less than four we will interlace the luma block and the chroma block. The latency of different method is show in Table 5.2. We can see the interlaced method is more suitable for 2-memory
Table 5.2 Different memory mapping method for different memory configuration Luma (8x8) cycle Chroma(8x4) cycle Total cycle
1-memory(interlaced) 60 30 60+30=90
2-memory(interlaced) 60/2=30 30/2=15 30+15=45
2-memory(simultaneous) 60 30 60
4-memory(interlaced) 60/3=20 30/2.5=12 32
4-memory(simultaneous) 60/2=30 30/2=15 30
Memory mapping simultaneous method: (four DRAMs)
We use two memories to store the luma block and use another two memories to store the chroma block so that we can access luma block and chroma block at the same time. Take luma block for example, Fig. 5.4 and Fig. 5.5 illustrate how the luma block is stored in the memory.
Bank
As shown in Fig. 5.4 , the frame is divided into four parts. Each part is stored in the different banks. This frame is stored in memory 0 and memory 1. We can see the enlargement of a single bank; we change the memory per two pixels. The yellow part represents memory 0 and the orange part represents memory 1. The advantage of using two memories is it can reduce almost half the latency to access data.
Each check represents one particular row in that bank. As we can see no consecutive rows in the same bank is put together. As a result, when we want to reference a block in the frame the row-miss status will not appear. Only the row-hit status and bank-miss status occurs. As we have mentioned in the previous chapter the row-miss status causes most bandwidth utilization loss and longest latency. In this way, when we decrease the number of row-miss status we can utilize the finite bandwidth and shorten the latency.
Fig. 5.5 Memory map to frame
Fig. 5.5 indicates the memory organization. There is one current frame and many
reference frame need to be stored in the external memory. This is because this H.264 support multi reference frame. There are eight banks in two memories. Each frame is stored in the eight banks equally. This data arrangement leads to we can access data in memory 0 and memory 1 simultaneously. The proportion of data in each memory differs a lot will suffer a great memory bandwidth loss.
Memory mapping interlaced method: (one or two DRAMs)
If the number of memory is less than four we will interleave the luma block and chroma block. Fig. 5.6 shows the frame how to map to memory. The green block means luma block and the blue block means chroma block. We interleave the luma block and chroma block because that chroma block wiil be accessed after luma block.
The active and pre-charge operation result from chroma block can be eliminated thus the latency can be decreased. Each 64 x 64 block means a particular row in the bank.
We can see the probability of crossing bank increases thus the number of active operation and pre-charge operation for luma block increases but the active operation and pre-charge operation for chroma block will all be removed.
Fig. 5.6 Frame map to memory (1 or 2)
5.3.2 Latency Estimation
The architecture of our H.264 decoder is a pipelined design. Each module will process an 8 x 8 block in a stage. The amount of data each module will access will be described in this section. The cost of latency in each module will also be estimated here.
Inter prediction (motion compensation):
Inter Prediction module is used to decode the inter block by motion compensation. The Data Fetch module will receive the motion vectors from CABAC module. The block which is pointed by the motion vector in the reference frame will be scratched from external memory.
H.264 supports more flexibility in motion compensation, the following are a few example:
1. Selection of motion compensation block sizes (with a minimum luma motion compensation block size as small as 4 x 4).
2. Quarter-sample-accurate motion compensation.
3. Multiple reference picture motion compensation.
4. Motion vectors over picture boundaries.
5. Weighted prediction
6. Improved “skipped” and “direct” motion inference.
Due to the complexity of the motion compensation, the access latency to reference frame varies case by case. The worst case of accessing an 8 x 8 luma block for a P frame is shown in Fig. 5.7.
Fig. 5.7 Worst case of accessing a 4 x 4 luma block (P frame)
The worst case is when the block size is 4 x 4 so that we need to scratch four 4 x 4 blocks for an 8 x 8 block. The motion vector of the 4 x 4 block is 1/4 pixel thus it request to get a 9 x 9 block for interpolation. When this 9 x 9 block locates across four banks it would cost latency on activating the other three banks. The latency of scratching a 4 x 4 block is 21 cycles. As a result, it takes 21*4=80 cycles to read reference frame in worst case.
The worst case of accessing an 8 x 8 luma block for a B frame is shown in Fig.
5.8. It needs to read two 13 x 13 block in worst case. Reading a 13 x 13 block needs 25 cycles. The B frame uses bi-directional prediction. It need to read two reference block to do the motion compensation thus it takes 50 cycles for worst case.
Fig. 5.8 Worst case of accessing a 8 x 8 luma block (B frame)
This memory system will read memory 0 and memory 1 for luma block and read memory 3 and memory 4 to for chroma block. The structure of reading reference frame for motion compensation is shown in Fig. 5.9. It costs 84 cycles to read reference frame for inter prediction block totally.
Fig. 5.9 Read operation for motion compensation
De-blocking:
De-blocking will write the decoded block into the external memory. The worst case of luma and chroma block is list in Table 5.2. We write luma block and chroma block to the four memories at the same time. As shown in the Tabl1 5.1, the luma block dominates the cycle counts.
De-interlace:
De-interlace module will de-interlace the fields that read from external memory.
The cycle counts is shown in Table 5.3.
Table 5.3 Cycle count of each module in worst case
Module Block size in worst case Cycle count in worst case
Inter-Prediction Luma : 4 (9x9) block Each is (8+10+3)=21 cycle
Chroma: 4 (6x3) block Each is (8+4+3)=15 cycle
21* 4=84 cycles
15*4=60 cycles
De-blocking Luma : 2 (4x4) block 3(8x4) block Each is (3+2+3)=8 cycle
Chroma: 2 (2x2) block 3 (4x2) block Each is (8+1+3)=7 cycle
8*5=40 cycles
7*5=35 cycles
De-interlace Luma : 1 (16x9) block 1(16x4) block Each is (4+10+3)=17 cycle Each is (3+4+3)=10 cycle
Chroma: 1 (8x5) block 1 (8x2) block Each is (4+3+3)=10 cycle Each is (4+2+3)=9 cycle
17+10=27 cycles
10+9=19 cycles
5.3.3 Data Bus Schdule
In our design of H.264 decoder, our video pipe contains seven modules which are CABAC, IQ/IDCT, Data Fetch (DF), Intra-Inter prediction (IIP), De-Blocking, and De-Interlacer. CABAC is the first module in video pipe and the functionality of CABAC is decoding the bit-stream syntax below slice header. Because our decoder processes luma and chroma components in parallel and CABAC can not decode chroma components until all luma coefficients have decoded in one macroblock, it is more efficient to make CABAC operate in macroblock level. For saving the buffer size, all other modules operate in 8x8 block. The max macroblock processing rate is 245760 (MB/s). The operating frequency of our decoder is 162 MHz. In order to process an 8x8 block per stage, we have 165 cycles to deal with memory access. The cycle counts can be derived as follow.
(1) 1/245760 = 4.06 us / MB (2) 1/162MHz = 6.17 ns / cycle (3) 4060/6.17 = 658 cycles/MB (4) 658/4 = 164.5 cycles/8x8 block
The cycle counts spend on accessing the external of each module is listed in Table 5.1. We can see the motion compensation dominate the time to access memory.
The total cycle is shown as follow. Fig. 5.10 shows how we schedule the data bus for each master module.
84 (MC) + 40 (DB) + 27 (DEI) = 151 (cycles)
Fig. 5.10 Read operation for motion compensation
The DB master need to write control register first then it takes a short latency to generate address from layer 1 (ATM) of memory controller). When Layer 0 (EMI) receive the address it will translate addresses into appropriate commands and DRAM will start to work. After DB is completed, the master is turn to DEI module, it will write the control register and change to DF to write the control register. The master will return to DEI module to receive the data output from DRAM. After the data transfer is completed, the master will turn to MC to read the data.
5.4 Address Translation
The layer1 of our memory controller is address translation machine. It will generate the addresses by the data arrangement method we have discussed in previous section. The diagram of this layer is shown in Fig. 5.11
Fig. 5.11 Layer 1 architecture
The data bus will transfer the control signal in block level. These control signals include initial location (X,Y), motion vector (MV), block size, frame or field, and read or write.
The control bus will send the information about slice level. These control signals include height, width, POC_1, and POC_2. The address translation will generate appropriate address based on the control signal which stored in the control register.
The address will be send to EMI to generate commands.
5.5 Analysis & Simulation Result
5.5.1 Design for worst case
In this section, we present the experimental framework used to evaluate the considered DRAM controllers, we will measure the hit-rate and miss-rate of different type of frame. The latency and the bandwidth utilization of different picture will also be shown in this section. In the previous section, we have estimated that the granularity of the 8x8 block in worst case needs four DRAM. This memory subsystem design is for worst case, we use 8x8 block to be our granularity and use four external memories.
Fig. 5.12 Read operation for motion compensation
Fig. 5.12 shows the test-bed used to evaluate the considered memory controller.
The descriptions of some key components of the test-bed are listed as followed.
The DF module and DB module are the functional blocks in the H.264 decoder.
The DF module will generate the patterns that read reference frame for motion compensation. The DB module will generate the patterns that write the reconstructed data into memory. This pipelined system is operated on the 8x8 block thus the patterns will change module every 8x8 block.
The patterns that pass through address translation machine will be translated into external memory addresses. The external memory interface (EMI) will change these addresses to appropriate commands that are acceptable for the external memory.
The important settings of my EMI are listed in Table 5.4.
Table 5.4 EMI setting EMI Setting Option
SDR DDR
8-word 16-word 32-word
Buffer size
N Y Y
2 4 8
Burst length
Y N N
Row-close Row-open Dynamic 1 Dynamic 2
Row-close Row-open Dynamic 1 Dynamic 2