• 沒有找到結果。

Chapter 2 OVERVIEW OF H.264/AVC STANDARD…

2.3 Profiles

A profile defines a set of coding tools or algorithms that can be used in generating a conforming bit stream. In H.264/AVC, there are three profiles, which are the Baseline, Main, and Extended Profile. The Baseline profile supports all features in H.264/AVC except the following two feature sets:

‧ Set 1: B slices, weighted prediction, CABAC, field coding, and picture or macroblock adaptive switching between frame and field coding.

‧ Set 2: SP/SI slices, and slice data partitioning.

The first set of additional features is supported by the Main profile. However, the Main profile does not support the FMO, ASO, and redundant features which are supported by the Baseline profile. The Extended Profile supports all features of the

baseline profile and main profile except CABAC. Baseline profile is used for lower-cost application with less computation resources such as videoconferencing, internet multimedia and mobile application. Main profile supports the mainstream consumer for applications of broadcast system and storage devices. The extended profile is intended as the streaming video profile with extra coding tools for data loss robustness and server stream switching. Fig.6 shows the relationship of these three profiles.

Fig.6 Profiles

Chapter 3

DATA MAPPING AWARE FRAME MEMORY CONTROLLER

3.1 Backgrounds

Fig.7 Simplified architecture of typical 4-bank SDRAM

3.1.1 Features of SDRAM

A brief illustration of 4-bank SDRAM architecture is shown in Fig.7. These

kinds of SDRAMs are three-dimensional structures of bank, row, and column. Each bank contains its own row decoder, column decoder, and sense amplifier, while four banks share the same command, address and data buses. SDRAMs provide programmable Burst-Length, CAS Latency, and burst type. The mode register is used to define these operation modes. While updating the register, all banks must be idle, and the controller should wait the specified time before initiating the subsequent operation. Violating either of these requirements will result in unspecified operation.

Fig.8(a) Typical read access in SDRAM

Fig.8(b) Typical write access in SDRAM

Fig.8(a) and Fig.8(b) shows a typical read and write access. A memory access operation consists of three steps. First, an active command should be sent to open a row in a particular bank, which will copy the row data into the sense amplifier.

Second, a read or write command is issued to initiate a burst read/write access to the active row. The starting column and bank address are provided by address bus, and the burst length and type are as defined in mode register in advance. Data for any read/write burst may be truncated with subsequent read/write command as shown in Fig.9(a), and the first data element from the new burst follows either the last element of a completed burst or the last desired data element of a longer burst which is being truncated. Finally, a precharge command is used to deactivate the open row in a particular bank or the open rows in all banks. The bank(s) will be available for a subsequent memory access time after the sense amplifier is precharged. Many SDRAMs provide auto precharge which performs individual-bank precharge function without command issued right after the completion of read/write access. This is

accomplished by setting an index when the read/write access command is sent. Thus we can issue another permitted command during the cycle of precharge command to improve the utilization of command bus.

Fig.9(a) Random read access

Fig.9(b) Random write access

Since each bank can operate independently, a row-activation command can be overlapped to reduce the number of cycles for the row-activation as shown in Fig.10.

Take a read access as an example. Assume to read 8 data from the SDRAM, and that 4 data lie in row 0 of bank 0 and others lie in row1 of bank 1. Without bank alternating, we need 16 cycles to get 8 data from the SDRAM. This access occupies command bus for 16 cycles. However, only 12 cycles are needed with alternating access and the command bus is busy for 8 cycles. We can send subsequent command to pipeline the following operation. The more data is required, the more cycles can be saved. For 8 data, we gain 25% of speedup for the read memory access latency.

Fig.10 Bank alternating read access

Some important timing characteristics are listed below. The behavior model used in our design is Micron’s MT48LC8M32B2P 256 Mb SDRAM [7] with 4 banks by 4,096 rows by 512 columns by 32 bits.

Table 1 SDRAM timing characteristics

Targeted to video codec applications, many papers have been proposed to improve SDRAM bandwidth utilization and achieve efficient memory access. Li [8]

develops a bus arbitration algorithm optimized with different processing unit to meet the real-time performance. Ling’s Table 9 controller schedules DRAM accesses in pre-determined order to lower the peak bus bandwidth. Kim’s [10] memory interface adopts an array-translation technique to reduce power consumption and increase memory bandwidth. Park’s [11] history-based memory controller reduces page break to achieve energy and memory latency reduction.

For H.264 application, Kang’s [12] AHB based scalable bus architecture and

dual memory controller supports 1080 HD under 130MHz. Zhu’s [13] SDRAM controller employs the main idea of Kim’s memory interface to HDTV application. It focuses on data arrangement and memory mapping to reduce page active overheads so that it not only improve throughput but also provides lower power consumption.

However, it doesn’t take the memory operation scheduling into consideration. With careful scheduling, some loss of bus bandwidth introduced by page active operation can be reduced. We combine the data mapping and operation scheduling in our design to decrease the bandwidth requirement for real-time decoding.

3.1.3 Problem definition

For a memory access, for each time we activate a closed row, we will suffer the latency introduced by SDRAM inherent structure. For read access, the read latency consists of tRP (precharge), tRCD(active) and CAS latency. For write access, the write latency includes tRP (precharge) and tRCD(active). There are two methods to ease this overhead. One is to reduce the required active command. This means either the demand data should lie in as less rows as possible within a single request or the probability of row miss must be as small as possible between successive requests.

Since the numbers of row opening is decreased, the latency we suffered can be shortened. Fig.11 shows the effect of row opening reduction on access latency. The other is to apply bank alternating to hide the latency. However, the number of total required operation remains the same. Taking the advantage of banks architecture, the free bank can continue another operation while other banks process requested works, and thus the latency of current access will be overlapped as shown in Fig.12. For such case, the required data should lie in different banks when row miss is happened, or the bank interleaving technique will fail to improve the data bus utilization.

Fig.11 Memory access under different row miss count

Fig.12 Cycles overlapped by bank interleaving technique

For video applications, the memory request is usually to get a determined size of rectangle image from frame memory like those in motion compensation, intra prediction and deblocking filter process. These data are continuous in spatial domain separately and the area we may request between two consecutive blocks has high probability to be overlapped. For instance, when we process motion compensation, the range of data we may need is bounded by block size and search range set during encoding. If the search range is L and the block length is N, a 2L by 2L+N rectangle is overlapped. Data in this region has high probability to be opened in previous block access. Thus we can find a method to avoid the row miss. Fig.13 illustrates an

example to explain this characteristic.

Fig.13 Possible required area between adjacent blocks

According to the characteristics of video data, the translation between physical location in memory and pixel coordinates in spatial domain will affect the probability of row miss. We can analyze the statistics of video sequence to find an optimized translation to minimize the times of row opening and then the controller can schedule memory operations to enhance the efficiency. As a result, the bandwidth utilization of the same SDRAM can be increased.

3.1.4 Estimation of bandwidth requirement

H.264 has high efficiency of video compression comparing with previous video processing standards such as MPEG-2 and MPEG-4. This is contributed to its

advanced features like fractional pixel interpolation, variable block sizes, multiple reference frames, etc. However, these lead to high bandwidth requirement in implementation. Below we briefly discuss the requirement of bandwidth in different processing unit. Assume that the frame width is W, height is H, the frame rate is F and all are in 4:2:0 format.

Reference frame storage

In H.264 decoder, the processed frame must be store in memory for following frame reference. The required bandwidth is

F H

W

BWRFS = * *(1Y +0.25Cb +0.25Cr)*

Loop filter

A deblocking filter is adopted in H.264 to improve the subjective quality. For luma, a 16x4 block and a 4x16 block adjacent to current macroblock are referenced while deblocking current macroblock. For chroma, we need to reference an 8x4 block and a 4x8 block. As a result, the bandwidth we required is

F H

W

BWLP =( /16)*( /16)*((16*4*2)Y +(8*4*2)Cb +(8*4*2)Cr)* Fig.14. illustrate the reference blocks for loop filter.

Fig.14 Reference blocks for deblocking fliter

Motion compensation

H.264 supports variable block size motion compensation to enhance coding efficiency as shown in Fig.15.

Fig.15 Block types in H.264

Beside, sub-pixel interpolation is applied to further improve the performance but extra bandwidth is required to meet the real-time decoding.

Table 2 lists the maximum reference block size for different block type.

Considering the worst case, all blocks are broken into smallest 4x4 size with maximum reference blocks and all frames are p-frame except first frame. The requirement of bandwidth is

F H

W

BWMC =( /16)*( /16)*((9*9*16)Y +(3*3*16)Cb +(3*3*16)Cr)* The effect of first frame is neglected in above formula.

Table 2 Maximum area of reference blocks

luma block max. size of

Summing up all above three cases, we can get rough bandwidth requirement for H.264 decoder. The bandwidth for different frame size is listed in Table 3 with assuming that frame rate is 30 fps in all cases.

Table 3 Rough estimation of required bandwidth in different frame size

format width height BWRFS BWLP BWMC BWall unit

1080 HD 1920 1080 93.31 62.21 384.91 540.40 MBps

With the growth of frame size, the required bandwidth increase rapidly and motion compensation processing unit dominates the demand for memory bandwidth.

However, SDRAM bandwidth can not achieve 100% utilization. It needs to improve the performance with optimized data arrangement and careful operation scheduling.

3.2 optimization of memory access

3.2.1 Intra-request optimization

In this section we discuss the memory access operation within a single request and build a mathematic model to describe this behavior. According to the model, we can find optimized mapping between physical location and the position in spatial domain. With the optimized translation, the probability of row miss can be reduced.

The maximum of row break times is 3. Employing the bank interleaving mechanism, some latency can be hidden to further improve the bus utilization.

In H.264 decoder, the most bandwidth consuming part is motion compensation.

In which, luma interpolation takes most of bandwidth for motion compensation. With the data arrangement optimization based on luma interpolation behavior, we can improve the common case of bandwidth usage. For luma interpolation, the referenced data ranges from 21x21 to 4x4 rectangles. These data can be characterized with starting point, data width, and data height.

Fig.16 1-D row break probability model

To ease analysis without loss of generality, we degrade this 2-D problem to 1-D domain. Assume that a SDRAM row contains L pixels and N continuous data are requested. The situation of row miss could be as follows. For the case without row misses, the starting point shall lie in the first L-N position of the row. However, if the starting points lie in last N-1 pixels, row miss happened. The last N-1 positions build an alert interval as shown in Fig.16. Assuming the probability of starting point position is uniform distributed, the probability of row miss is

L

For constant data length N, larger size of row means fewer row miss and the longer data length leads to higher row-miss probability with fixed row size.

Extending above conclusion to 2-D domain, the total row miss consists of horizontal row miss and vertical row miss. The probability of total row miss is

Y

LX, LY denote the length of memory window in horizontal and vertical separately.

However, the row size is fixed for SDRAM. The width and height of the row window are influenced each other. The row miss probability should be adjusted as follows:

Ω

Thus, the row-miss probability has the minimum value when LX is equal to Ω .

In H.264 the data length we may request for motion compensation is 4, 8, 9, 13, 16 and 21 pixels. The length selection is judged by block modes and sub-pixel motion vectors. The distribution of data length is as shown below.

Table 4 Data length distribution

Direction Data Length Avg. Probability

(%)

With the increasing of quantization parameter, the encoder tends to choose larger size block mode. This means that the long length data is requested more frequently. The test sequences are crew, night, sailormen, and harbour with 525 SD frame size and QP=32.

However, the starting point doesn’t distribute uniformly in many video sequences. The position number that is divisible by 4, which means the 0th, 4th, 8th, 12th …4kth …pixels of row window in vertical or in horizontal, has higher probability to appear. Generally speaking, p4k is 1.5 to 2.5 times larger than others, where p4k denotes the probability of 0th, 4th, 8th, 12th …4kth …positions. This is because the smallest block length is 4 and the zero motion vector effect. The blocks with zero motion vectors are usually referenced for background image. For larger quantization parameter, this effect becomes more significant. Thus, the row miss probability of H.264 motion compensation is

∑ ∑

The PNX4 is the probability of data length equal to 4 pixels in horizontal and we assume p4k is twice than others for simplification.

Applying the statistics in table.4, we can find the row miss probability function is

Ω row of a SDRAM can store 16384, 8192 or 4096 bits, that is 2048, 1024 or 512 pixels.

The SDRAM behavior model we used contains 2048 pixels in a row. The optimized window size should be a 46x44 rectangle.

However, it is hard to implement the translation with the 46x44 windows. We adjust the window size to 64x32. Because 32 and 64 are powers of 2, the translation can be easily implemented with bits shift. Fig.17 shows the statistics of row miss in different window size. The test sequences are crew, night, sailormen, and harbour in 525 SD frame size.

Comparing with the linear translation like 1x2048 and 2048x1 window, the 64x32 mapping reduces about 84% in miss rate. The rows with large size can decrease the probability of row break, thus the 32x32 window has higher row miss count than 64x32. Due to the video sequences characteristic, the occurrence of horizontal break is more frequent than vertical. The widow with slightly large width can lead to better performance.

row miss prob. in different windows

Fig.17 Miss rate in different row windows

Fig.18 Translation of physical location and image position

Fig.18 illustrates the mapping between physical location in memory and image position in spatial domain. The latency of single request can be reduced with bank alternating operation. Considering the SDRAM architecture, the banks we requested

in single request should be different. It is because this will lead to the more reduction of latency. Thus, it is easy to find a good bank arrangement that all adjacent banks are different as shown in Fig.19. If the requested data amount is less than twice of DRAM window size for both vertical and horizontal directions, the maximum of row break occurrence is 4. For example, for a 64x32 window, the request data length shall smaller than 128x64.

Fig.19 Bank arrangement with optimization

3.2.2 The memory operations in single request

With the data arrangement mentioned before, the requests can be classified to 3 kinds as shown in Fig.20. We open all required rows at the beginning of every request.

Thus we can reduce the control overhead and ease the hardware design.

Case 1: all data of single access are contained in a row.

It is clear that this case introduces no row miss, since all the data to be requested

are stored in a row. The memory operation contains the row activation, data reading and precharging. Fig.21 shows the operations under case 1. L+4 cycles are needed to complete this access, where L denotes the number of accessed data.

Case 2: all data of single access are contained in two rows. The data may be discontinuous in horizontal or in vertical as illustrated in Fig.20.

In this case, we suffer two row miss since the accessed data are contained in different rows. However, they are in different banks. We can shorten the latency with bank alternating access. Fig.21 shows the operations of case 2. We open the rows we may access, read the data in determined order and then precharge the opened row.

Total cycle count is L+5.

Fig.20 Request classification

Case 3: all data of single access are stored in four rows. The data break in horizontal and vertical as illustrated in Fig.20.

Four row breaks are encountered in this case. Due to the limitation of tRRP, one cycle latency is introduced to meet timing requirement. Fig.21 illustrates the

operations. The number of total cycles is L+7.

Fig.21 Request operations in different cases

bank cross distribution

case 3 1.763961321 1.536542139 1.088995216

case 2 23.55624675 20.01307293 14.54669593

case 1 74.67979193 78.45038493 84.36430885

qp20 qp32 qp40

Fig.22 Distribution of request kinds

The distribution of the cases mentioned before is as shown above. The test sequences are crew, night, sailormen, and harbour in 525 SD frame size. With the increasing of quantization parameter, the bank-across cases decreases rapidly.

Encoder tends to choose zero motion vector in high QP, and thus the data for motion

compensation process usually lie in the same row within single request. According to the statistics, case 1 takes most cases in total accesses. This means we usually only need to open one row in single request. Since our data arrangement reduces the row miss times, the bandwidth can be enhanced comparing with the other mappings.

3.2.3 Burst length selection

The SDRAMs support burst read access to enhance the bus utilization. However, this feature introduces extra overhead in the data arrangement discussed above. We put all data required in single request in one row as possible as we can. But the locations within a request are usually discontinuous. Every time we encounter the data discontinuity, we may waste cycles to read the unnecessary data. Take 9 by 9 block for example, we waste 9 cycles with BL = 2, 9 cycles with BL = 4 and 45 cycles with BL = 8 as shown in Fig.23. Many motion compensation units prefer received data in z scan order to ease data reuse. This characteristic worsens the over access problem.

Fig.23 Wasted data in different data burst length

Though the read burst can be truncated by another read or burst terminate

Though the read burst can be truncated by another read or burst terminate

相關文件