Inter-request Optimization - Optimization of Memory Access

Chapter 3 DATA MAPPING AWARE FRAME MEMORY CONTROLLER

3.2 Optimization of Memory Access

3.2.4 Inter-request Optimization

In intra-request optimization, we have determined the optimized data mapping to reduce the row misses. Furthermore, for successive requests, the requested data has high probability to be stored in the same row due to overlapped search range. This

access can get the same benefit as the intra request without any row miss. However, there is still a certain amount of data that is stored in different rows, and thus closing unused rows by precharging the banks and opening the new rows will cause row miss.

To reduce such row misses, we shall consider when and how to close the row by precharing.

The SDRAMs support auto precharge which performs the same individual-bank precharge function right after the last read/write command without requiring an explicit command. When we get data in z scan order, the data location may cross banks frequently. We may read one data from row 0, then from row 1 and then from row 0 again. With auto precharge, an extra row break is introduced. Thus mutual precharge operation leads to high bandwidth efficiency. When we finish a request, we should check whether the opened rows still active in next request or not. If the rows has been opened in the previous request, we can skip the row active and precharging operation. In our design, since we open all required rows at beginning of the request, we suffer no latency with the row-cross data access.

There are two types of precharge command, precharge all banks or precharge single bank. Precharing each bank separately is preferred to easy reduce row miss. If not all rows are the same with previous request, we can also skip the row active operations that have been done before. However, individual precharging has overheads to send more explicit commands to close corresponding rows while only one command is needed for all banks precharging. The benefit we gain from 2 row breaks to 1 break is comparatively small with the one from no break to 1 break. The overheads and benefits in different situation are as listed below.

Table 5 Overheads and benefits of separate bank precharging comparing with all banks precharging. All numbers are in cycles.

Previously

Combining this overhead and benefit with previous distribution of access cases, the actual gain by simulation is about 0.1% in total memory access cycles. The benefit is so small that we can neglect it. Considering the hardware control cost and bandwidth performance, we choose all banks precharging as our solution. Every time we request new data, we will check whether all required rows for current block has been opened or not. If all rows have been opened, we will skip the active and precharging operations. In this case, only the CAS latency is suffered. Otherwise, we need to close rows for previous request and active new rows. The suffered latency for this case includes t_RP(precharge), t_RCD(active) and CAS latency. The average row miss and required cycles with all banks precharging is listed in Table 6.

Table 6 Average miss rate and required cycles in motion compensation

Test sequences = crew, night, sailormen, and harbour with 525 SD frame size

QP 20 32 40 miss rate without

detection (%) 6.82 7.53 9.39

miss rate with

detection (%) 1.45 1.28 1.16

enhancement (%) 78.62 82.48 86.79

required cycles 79802862 75043239 58408725 required frequency

(MHz) 7.98 7.50 5.84

Fig.24 shows the memory operations under row miss and row hit detection. With the detection mechanism, we can reduce about 80% of row miss comparing with no detection mapping. To meet the real time luma motion compensation process, the required frequency of SD frame size is about 8MHz.

Fig.24 Memory operation under row miss and row hit

In order to take the advantage of row active operation skipping, we should separate the read and write request in different memories. This is because the location of read and write data lies in different frames. We store reference and current frame data in different rows according to the result of inner request optimization. Thus we

need dual channel environment such as conventional ping-pong structured memories, which one stores reference frame and another stores current frame. Fig.25 shows the read/write operation in I-frame and P-frame. Typically speaking, the write latency is hide by read latency since we do not suffer row cross situation when writing.

Fig.25 Read/write operation in I-frame and P-frame

Considering the interleaved decoding flow of chroma component, the data arrangement should be as shown in Fig.26 to gain the benefit from active operation skipping. Three rows build a set to store four luma 32x32 block data and their corresponding chroma data. However, we suffer the loss of memory data size utilization. For instance, at the tail of every image row, we may only need to store Y2, Cb2 and Cr2, and thus we waste the data size for Y3, Cb3, Cr3. With this data arrangement, the single request of chroma interpolation may access different row data in the same bank, which is guaranteed not to be happened in luma interpolation by the bank arrangement. This may cause more latency. However, since the block size and reference data access of chroma are relatively small than the ones of luma, the overall performance does not drop too much. Table 7 lists the data size utilization in different format and Table 8 lists the required frequency to support motion compensation of

luma and chroma component in 525 SD format.

Fig.26 Chroma data arrangement

Table 7 Memory data size utilization

Format Frame size Data size utilization

QCIF 176x144 92.8%

CIF 352x288 91.7%

525SD 720x480 93.8%

720HD 1280x720 100%

Table 8 Performance analysis under SD525

Sequence (525SD QP 20) miss rate required frequency

Crew 1.54% 10.44 MHz

Night 1.80% 7.26 MHz

Sailormen 1.86% 15.05 MHz

Harbour 1.93% 14.20 MHz

Average 1.78% 11.74 MHz

在文檔中用於H.264視訊解碼器之記憶體控制器與熵解碼器之設計 (頁 49-54)