1.1 THE SCENE
Pervasive, seamless, high quality digital video has been the goal of companies, researchers and standards bodies over the last two decades. In some areas (for example broadcast television and consumer video storage), digital video has clearly captured the market (such as videoconferencing, video email, mobile video), market success is perhaps still too early to judge. However, there is no doubt that digital video is a globally important industry which will continues to pervade businesses, networks and homes. The continuous evolution of the digital video industry is being driven by commercial and technical forces. The commercial drive comes from the huge revenue potential of persuading consumers and businesses:
1. Replace analogue technology and older digital technology with new, efficient, high quality digital video products.
2. Adopt new communication and entertainment products those have been made possibly by the move to digital video.
The technical drive comes from continuing improvements in processing performance, the availability of higher capacity storage and transmission mechanisms and research and development of video and image processing technology.
Getting digital video from its source (a camera or a stored clip) to its destination (a display) involves a chain of components or processes. Keys to this chain are the processes of compression (encoding) and decompression (decoding), in which bandwidth-intensive ‘raw’ digital video is reduced to a manageable size for transmission or storage, then reconstructed for display. Getting the compression and decompression processes ‘right’ can give a significant technical and commercial edge to a product, by providing better image quality, greater reliability and more flexibility than competing solutions. There is therefore a knee interest in the continuing development and improvement of video compression and decompression methods and systems. The interested parties include entertainment, communication and broadcasting companies, software and hardware developers, researchers and holders of potentially lucrative patents on new compression algorithms.
The early successes in the digital video industry (notably broadcast digital television and DVD-video) were underpinned by international standard ISO/IEC 13818 [1], popularly known as ‘MPEG-2’ (after the working group that developed the standard, the Moving Picture Experts Group). Anticipation of a need for better compression tools has led to the development of two further standards for video compression, known as ISO/IEC 14496 Part 2 (MPEG-4 Visual) [2] and ITU-T Recommendation H.264/ISO/IEC14496 Part 10 (H.264) [3]. MPEG-4 Visual and H.264 share the same ancestry and some common features (they both draw on well-proven techniques from earlier standards) but have notably different visions, seeking to improve upon the older standards in different ways. The vision of MPEG-4 Visual is to move away from a restrictive reliance on rectangular video images and to provide an open, flexible framework for visual communications that uses the best features of efficient video compression and object-oriented processing. In contrast, H.264 has a more pragmatic vision, aiming to do what previous standards did (provide a mechanism for the compression of rectangular video images) but to do it in a more efficient, robust and practical way, supporting the types of applications that are becoming widespread in the marketplace (such as broadcast, storage and streaming).
1.2 VIDEO COMPRESSION
Network bit rates continue to increase (dramatically in the local area and somewhat less so in the wider area), high bit rate connections to the home are commonplace and the storage capacity of hard disks, flash memories and optical media is greater than ever before. With the price per transmitted or stored bit continually falling, it is perhaps not immediately obvious why video compression is necessary (and why there is such a significant effort to make it better). Video compression has two important benefits. First, it makes it possible to use digital video in transmission and storage environments that would not support uncompressed raw video. For example, current internet throughput rates are insufficient to handle uncompressed video in real time (even at low frame rates or small frame size). A Digital Versatile Disk (DVD) can only store a few seconds of raw video at television quality resolution and frame rate, so DVD video storage would not be practical without video and audio compression. Second, video compression enables more efficient use of transmission and storage resources. If a high bit rate transmission channel is available, then it is more attractive proposition to send high resolution compressed video or multiple compressed video channels than to send a single, low resolution, uncompressed stream. Even with constant advances in storage and
transmission capacity, compression is likely to be an essential component of multimedia services for many years to come.
An information carrying signal may be compressed by removing redundancy from the signal. In a lossless compression system statistical redundancy is removed so that the original signal can be perfectly reconstructed at the receiver. Unfortunately, at the present time lossless methods can only achieve a modest amount of compression of image and video signals. Most practical video compression techniques are based on lossy compression, in which greater compression is achieved with the penalty that the decoded signal is not identical to the original. The goal of a video compression algorithm is to achieve efficient compression whilst minimizing the distortion introduced by the compression process.
Video compression algorithms operate by removing redundancy in the temporal, spatial frequency domain. The human eye and brain (Human Visual System) are more sensitive to lower frequencies. By removing different types of redundancy (spatial and temporal) it is possible to compress the data significantly at the expense of a certain amount of information loss (distortion). Further compression can be achieved by encoding the processed data using an entropy coding scheme such as Huffman coding or Arithmetic coding.
Image and video compression has been a very active field of research and development for over twenty years and many different systems and algorithms for compression and decompression have been proposed and developed. In order to encourage inter-working, competition and increased choice, it has been necessary to define standard methods of compression encoding and decoding to allow products from different manufacturers to communicate effectively. This has led to the development of a number of key International Standards for image and video compression, including the JPEG, MPEG and H.26X series of standards.
1.3 MPEG-4 AND H.264
MPEG-4 Visual and H.264 (also known as Advanced Video Coding) are standards for the coded representation of visual information. Each standard is a document that primarily defines two things, a coded representation (or syntax) that describes visual data in a compressed form and a method of decoding the syntax to reconstruct visual information. Each standard aims to ensure that compliant encoders and decoders can successfully inter-work with each other, whilst allowing
manufacturers the freedom to develop competitive and innovative products. The standards specially do not define an encoder; rather, they define the output that an encoder should produce. A decoding method is defined in each standard but manufacturers are free to develop alternative decoders as long as they achieve the same result as the method in the standard.
MPEG-4 Visual and H.264 have related but significantly different visions. Both are concerned with compression of visual data but MPEG-4 Visual emphasizes flexibility whilst H.264’s emphasis is on efficiency and reliability. MPEG-4 Visual provides a highly flexible toolkit of coding techniques and resources, making it possible to deal with a wide range of types of visual data including rectangular frames (traditional video material), video objects (arbitrary-shaped regions of a visual scene), still images and hybrids of natural (real-world) and synthetic (computer-generated) visual information. MPEG-4 Visual provides its functionality through a set of coding tools, organized into ‘profiles’, recommended groupings of tools suitable for certain applications. Classes of profile include ‘simple’ profiles (coding of rectangular video frames), object-based profiles (coding of arbitrary-shaped visual objects), still texture profiles (coding of still images or texture), scalable profiles (coding at multiple resolutions or quality levels) and studio profiles (coding for high quality studio applications).
In contrast with the highly flexible approach of MPEG-4 Visual, H.264 concentrates specifically on efficient compression of video frames. Key features of the standard include compression efficiency (providing significantly better compression than any previous standard), transmission efficiency (with a number of built-in features to support reliable, robust transmission over a range of channels and networks) and a focus on popular applications of video compression. Only three profiles are currently supported (in contrast to nearly 20 in MPEG-4 Visual), each targeted at a class of popular video compression applications. The Baseline profile may be particularly useful for ‘conversational’ applications such as video conferencing, the extended profile adds extra tools that are likely to be useful for video streaming across networks and the Main profile includes tools that may be suitable for consumer applications such as video broadcast and storage.
1.4 INTRODUCTION
With modern day advances in computer processing and multimedia applications, improvements in the area of image processing and video compression are analogous.
Video compression allows the reduction of high-resolution video into a more compact memory space to thereby reduce storage and video processing resources during playback. Reduced memory requirements for video footage can aid in lengthy video segments being stored onto portable media to and improve the mobility and
transferability of large files. Bandwidth is also increased when performing file transfers, as quicker download and upload times are achieved through Internet and other transfer protocols.
Videos are produced through a series of different frames (or images) played in sequence. Therefore, the area of video compression reduces down to specialized forms of image compression with specific consideration for video playback. The art of video compression tends to fall into one of two categories: lossless compression and lossy compression. Lossy compression entails the reduction of certain finer image details that are sacrificed for the sake of saving a little more bandwidth or storage space. Lossless compression, on the other hand, involves compressing data such that it will be an exact replica of the original data upon decompression. For many types of binary data, such as documents and various programs, lossless compression is required as the integrity of the original data needs to be preserved. Many types of multimedia, on the other hand, need not be reproduced exactly as before. An approximation of the original image is usually sufficient for most purposes, as long as the error between the original and the compressed image is tolerable.
In performing lossy compression, a common technique is to remove redundant information between adjacent frames to reduce memory constraints and increase bandwidth. This technique is referred to as motion estimation (ME), of which H.264 and MPEG-4 are the current known standards. These standards exploit and remove temporal redundancies between successive frames, or more simply, select a reference frame and predict subsequent frames based on the reference frame. Motion estimation makes the assumption that the objects in the scene solely possess translational motion.
This assumption holds as long as there is no pan, zoom, changes in luminance, or rotational motion. Motion estimation is an intensive process which generally consumes 60-90% of the computational time of a related encoder or micro-controller.
The ME process begins first by dividing the current frame into macroblocks. The size of a macroblock is typically 16x16 pixels, but can vary for each ME technique according to the desired tradeoff between resolution and computational cost. Each macroblock of a current frame is compared to a macroblock of a reference frame by calculating a cost value for selected search points of the macroblocks. A current
macroblock that is sufficiently similar reference macroblock is then selected and paired together. Vectors denoting a displacement between each matching reference macroblock and each matching current macroblock are then determined. These vectors are known as motion vectors, and serve as a representation of the displacement between matching macroblocks from the reference frame to the current frame for use in the prediction process.
Using the reference frame and motion vectors, one can now reconstruct an approximation of the current frame (now the reconstructed frame) by copying the matching reference macroblock of the reference frame to the location noted by the corresponding motion vectors. This form of image reconstruction is also known as motion compensation. In this manner, subsequent frames can be continually predicted, without having to store redundant macroblocks from a current frame into memory.
Certain macroblocks from the reconstructed frame are simply produced from a matching macroblock from a reference frame according to a motion vector. This process therefore compresses video sizes by omitting the storage of redundantly used macroblocks. The level of compression varies with the number of macroblocks replaced from frame to frame, and the desired image resolution.
The matching process in ME entails comparing selected pixels from a current macroblock with the same pixels from a reference macroblock using a cost function.
A search algorithm provides the selection of search points indicating which pixels are to be used for comparison in the matching process. The cost function provides a value indicating the degree of similarity between the compared search points. One of the more common cost functions to determine the similarity between two input images includes the sum of absolute differences (SAD). The greater the similarity between the two inputs, the smaller the SAD value will result. The matching process in ME therefore uses a cost function to compare search points of a current macroblock to search points of a reference macroblock to determine the degree of similarity between the two macroblocks. If the cost values between the two macroblocks are sufficiently low, then the reference macroblock is suitable to replace the current macroblcok in motion estimation.
1.5 MOTIVATION
According to the literature published before, we can find that the motion estimation process is the most time consumed part. To further realize this process, we can mainly divide it into two parts: integer motion estimation and fractional motion
estimation. Integer motion estimation cost most part of time under the original algorithm unchanged. The main reason is that the search window is too large. So we have a very simple idea that we want to decrease the search window. Reducing search range is the most effective way to decrease search window and memory accesses can be saved significantly. This is the main reason why we choose the way but other methods such as search pattern rearrangement. Fractional motion estimation will not affect obviously under the original condition. But when the fast algorithm is applied for integer motion estimation, the portion of encoding time due to fractional motion estimation is getting larger. Based on the assumption of uni-modal error surface, we want to use the results of half pixel step to predict the slope of error surface. We also apply early termination technique. Due to the unchanged system order, we use the information from integer part to predict the threshold of fractional part. Making use of hardware parallelism to speed up is also a common method in H.264 research field.
To trade off between speed and area, we use certainly parallelism and decompose variable block size into 4X4. In the topic of speed up, we reach the goal by applying early termination technique.
1.6 THESIS ORGANIZATION
In the thesis, we will introduce the H.264 standard and some published
algorithms in chapter2. In integer motion estimation part, we develop fast algorithm as dynamic search range prediction. We will detail it in chapter3. In fractional motion estimation part, fast algorithm named as adaptive search pattern prediction is
described in chapter4. The co-simulation result by applying both fast algorithms mentioned in chapter3 and chapter4 is shown in chapter5. Then, we will show the hardware architecture and result comparisons in chapter6. Finally, a conclusion is given in chapter7.