Chapter 1 Introduction
1.5 Thesis Organization
In the remainder of this thesis, related works about motion detection, video data hiding, video authentication, privacy protection in surveillance videos, and the H.264 standard are reviewed in Chapter 2. In Chapter 3, the proposed method of motion detection and the application of video-content search are described. In Chapter 4, the proposed video authentication system for surveillance videos is described. In Chapter 5, the proposed method of privacy protection of surveillance videos is presented.
Finally, conclusions and some suggestions for future works are given in Chapter 6.
Chapter 2
Review of Related Works and H.264 Standard
2.1 Review of Techniques for Motion Detection
A lot of motion detection techniques have been proposed to detect moving objects in a video [1-5]. The techniques can be classified into two categories. One is for use in the pixel domain [1-2]; the other in the compressed domain [3-4]. Generally speaking, the approaches used in the pixel domain need to fully decode a compressed video bitstream first, but they can be employed in videos coded in different video coding standards. On the other hand, each of the approaches used in the compressed domain can perform a motion detection process by partially decoding a compressed video bitstream, but they can only be employed in videos coded in specific standards.
Haritaoglu et al. [1] proposed a motion detection method based on background subtraction in the pixel domain. They built a statistical model for a background scene that allows them to detect moving objects even when the background scene is not completely stationary. Lipton et al. [2] proposed another approach based on temporal differencing in the pixel domain, which means computation of pixel-wise differences between consecutive video frames. The basic idea of the approach is to compare video frames separated by a constant time to find moving objects. Zeng et al. [3] proposed
another approach in the compressed domain. They employed a block-based Markov random field (MRF) model in a field formed with motion vectors to segment moving objects during a decoding process. The methods mentioned above detect motions by common properties of videos, such as pixel values, motion vectors, etc, but they don’t use special features of a specific standard as the main clues.
2.2 Review of Techniques for Video Data Hiding
Video data hiding can be used in many applications about videos, such as video watermarking, video authentication, etc. As a result, lots of techniques for video data hiding have been introduced [6-10]. Mobasseri et al. [10] embedded data into the CAVLC code space of an H.264 bitstream, which is one of the existing entropy coding techniques of H.264. This method was directly applied to a bitstream without video decoding or partial decompression. Huang and Tsai [7] proposed a video data hiding method based on the use of prediction modes and tree structured macroblock motion compensation of the H.264 structures. In addition, they also used the Lagrange optimization technique to minimize image distortion yielded by the data hiding process. In Noorkami and Mersereau [6], a robust data hiding algorithm for H.264 was proposed. The basic idea is to embed data by modifying the DC coefficients in luminance residual blocks of the video and employ a human visual model adapted for a 4×4 discrete cosine transform block to increase the payload and robustness while limiting visual distortion. Gong and Lu [8] also proposed a data hiding method in the frequency domain. They employed a texture-masking-based perceptual model to adaptively choose the hiding strength of each block. In an H.264 encoding process, if someone re-encodes a video, the prediction modes in the original video may be
different from the resulting video. As a consequence, if re-encoding is unavoidable, videos using the robust data hiding methods based on the frequency domain will face a critical problem that the frequency coefficients may be different from the original ones with hidden data being modified due to the changes of an intra prediction mode.
This problem may cause a loss of hidden information, so we introduce a robust data hiding method which can endure an H.264 re-encoding process.
2.3 Review of Techniques for Video Authentication
Video authentication plays an important role in a digital rights management system, so many different methods have been proposed to solve the problem [10-13].
Zhang and Ho [11] introduced a video authentication method which makes an accurate usage of the tree structured motion compensation, motion estimation, and Lagrange optimization of the H.264 standard. As mentioned in the paper, authentication information is embedded based on the best mode decision strategy in the sense that if a video undergoes any spatial and temporal attacks, the scheme can detect the tampering by the sensitive mode change. Pröfrock et al. [12] proposed a method using skipped macroblocks of a H.264 video to embed authentication data.
The data are embedded as a fragile, blind and erasable watermark with low video quality degradations. In contrast with other authentication methods, the embedding process is done after an H.264 compression process, while others are done during the process. The methods mentioned above usually use additional authentication information to authenticate videos. How to authenticate videos without external information is an interesting research topic and is investigated in this study.
2.4 Review of Techniques for Privacy Protection in Videos
Privacy protection has become more and more important along with the rise of video surveillance systems. Many different approaches have been introduced in recent years [14-16]. Meuel et al. [14] introduced a method to protect faces in surveillance videos. As mentioned in the method, any visible information of faces in a video is deleted and embedded in the video that allows further reconstruction of the faces if needed. In Dufaux et al. [15], the regions containing personal information are scrambled. As a consequence, the scene remains visible, but the privacy-sensitive information is not identifiable. Zhang et al. [16] proposed another method to protect authorized persons, which are not only removed from a surveillance video, but also embedded into the video. The above-mentioned methods are based on a concept which is to protect privacy of authorized persons, so the protected persons must be recognized first by manpower. If there are many persons requiring recognition, it will become a tedious job. We call this kind of methods as object-based privacy protection.
In order to solve this problem, we introduce a region-based privacy protection method to avoid recognizing authorized persons by hand. Besides, an authorized user can define the protected region easily.
2.5 Review of H.264 Standard
2.5.1 Structure of H.264 standard
The H.264 standard defines three profiles: Baseline, Main, and Extended, which provide different sets of coding functions and different components required by an
encoder or decoder. Because of the individual features of each profile, there are many different potential applications of these three profiles. The applications of the baseline profile include video telephony, video conferencing, and wireless communications;
the applications of the main profile include television broadcasting and video storage;
and the extended profile may be particularly useful for streaming media applications.
The relation between these profiles is illustrated in Figure 2.1. H.264 videos have a hierarchical structure. A video sequence consists of consecutive video images (frames). A video image is composed of at least one slice. There are five slice types:
intra slice (I), predicted slice (P), bi-predicated slice (B), switching P slice (SP), and switching I slice (SI). Generally speaking, the first three slice types are the main slice types which are widely used in H.264 videos. A slice is composed of macroblocks, which can be categorized into four types including I, P, B, and skipped. I macroblocks are predicted by previously encoded and reconstructed blocks in the same slice. P macroblocks are predicted by previously encoded samples before the current frame in temporal order. B macroblocks are predicted by encoded samples before or after the current frame. Skipped macroblocks of the P slice is transmitted with motion vectors but not with frequency coefficients. Skipped macroblocks of the B slice is transmitted without both motion vectors and frequency coefficients. How the macroblocks comprise the slices is illustrated in Table 2.1.
Table 2.1 The way that macroblocks comprise slices
I macroblocks P macroblocks B macroblocks
Skipped macroblocks
I slice √
P slice √ √ √
B slice √ √ √
Figure 2.1 Relation between the Baseline, Main, Extended profiles.
2.5.2 Process of Encoding
Before describing the process of encoding of H.264 videos, we introduce the concept “prediction” first. Because a video sequence is formed by consecutive similar images, there is a lot of coding redundancy. How to use the high correlation between these similar images to reduce the redundancy has become a main topic in the video compression research field. The basic idea of prediction is to find a block which is the most similar to the current block and to save the difference between the two. There
are two models of prediction: intra mode and inter mode. The intra mode uses the similarity between pixel samples in the same frame and the inter mode uses the similarity between different frames.
The process of encoding of H.264 videos is illustrated in Figure 2.2. There are a forward path and a reconstruction path in the figure. In the forward path, an input frame Fn is processed in units of a macroblock. Each macroblock is encoded in intra or inter mode and can be sub-partitioned into sub-macroblocks. For each sub-macroblock in the macroblock, a prediction P is formed based on previously encoded, decoded, and reconstructed samples. In the intra mode, P is formed from the samples in the current slice. In the inter mode, P is formed by the samples in the past or future frames which can also be called reference frames. The difference between P and the current sub-macroblock is used to produce a residual sub-macroblock that is DCT-based transformed and quantized. The resulting frequency coefficients are reordered and entropy encoded. The coefficients after entropy coding and other information required in a decoding process (prediction modes, motion vectors, etc.) form the compressed bitstream which is passed to a Network Abstraction Layer (NAL) for transmission or storage usage. In the reconstruction path, the encoder decodes (reconstructs) the previously encoded data to provide a reference for further predictions. The encoded data is inverse transformed to produce a difference sub-macroblock and the prediction block P is added to the difference sub-macroblock to create a reconstructed sub-macroblock which is a decoded version of the original sub-macroblock.
2.5.3 Process of Decoding
The decoder receives a compressed bitstream from the NAL and entropy decodes
the data elements to produce quantized coefficients. Then these coefficients are inversely transformed to yield a difference sub-macroblock. Using the decoding information retrieved from the bitstream, the decoder creates a prediction sub-macroblock P which is identical to the original prediction sub-macroblock in the encoder. P is added to the difference sub-macroblock to produce a decoded sub-macroblock. The flow chart of the decoding process is illustrated in Figure 2.3.
Figure 2.2 Flow chart of H.264/AVC encoding process.
Figure 2.3 Flow chart of H.264/AVC decoding process.
2.5.4 Tree Structured Motion Compensation
Motion compensation is the process of finding the best prediction block in inter mode. In all video standards except H.264, the processing unit of motion compensation is a whole macroblock. H.264 introduces a novel feature: tree structured motion compensation. The basic concept is that a macroblock can be
divided into sub-macroblocks and each of the sub-macroblocks is motion compensated individually. Macroblocks can be partitioned in different modes for different video contents.
Each 16×16 macroblock may be partitioned and motion compensated by one of the following ways: one 16×16 macroblock partition, two 16×8 partitions, two 8×16 partitions, and four 8×8 partitions, as illustrated in Figure 2.4. If the 8×8 mode is chosen, each of the four 8×8 sub-macroblocks in the macroblock may be further partitioned in four ways: one 8×8 sub-macroblock partition, two 8×4 sub-macroblock partitions, two 4×8 sub-macroblock partitions, and four 4×4 sub-macroblock partitions, as illustrated in Figure 2.5. This method of partitioning macroblocks into motion compensated sub-macroblocks of varying sizes gives rise to a large number of possible combinations in each macroblock.
Each sub-macroblock requires a separate motion vector. Choosing a large partition size (16×16, 16×8, 8×16) means that a small number of bits are needed to transmit the motion vector(s) and the partition mode(s) but the motion compensated residual may be a large number in detailed frame areas. Choosing a small partition size (8×4, 4×4, etc.) may result in a small residual but needs a larger number of bits to transmit the motion vectors and the partition modes. In general, a large partition size is appropriate for smooth frame areas while a small partition size is suitable for detailed areas, as illustrated in Figure 2.6.
8×8 4×8 8×4 4×4 8
8 4
16×16 8×16 16×8 8×8
16
16 8
Figure 2.6 An example of tree structured motion compensation.
Figure 2.5 Sub-macroblock partitions.
Figure 2.4 Macroblock partitions.
2.5.5 Intra Prediction Modes
Within an intra macroblock of an H.264 video, a 4×4 sub-macroblock is a unit of processing in intra prediction. Thirteen pixel samples are formed by previously encoded and reconstructed blocks. As illustrated in Figure 2.7, A, B, C and D are from the upper neighboring block; E, F, G and H are from the upper-right neighboring block; I, J, K and L are from left neighboring block, and M is from the upper-left neighboring block. There are nine prediction modes for the thirteen pixel samples to form the prediction block of the current processing 4×4 block. The nine modes of intra prediction are illustrated in Figure 2.8. The prediction block is subtracted from the current block to create the residual block. The encoder selects the best mode of intra prediction that performs the lowest cost of encoding.
Figure 2.7 The prediction block and the thirteen samples.
Figure 2.8 The nine modes of intra prediction.
Chapter 3
Searches of Video Contents for Scene Surveillance by Novel Uses of H.264 Coding Features
3.1 Introduction
Since a surveillance video system usually monitors a space for a long period, it may record lots of suspicious people or activities. If someone wants to check whether a surveillance video contains illegal activities, it will often take him/her a very long time to search the whole video for the specific activities or involved people. In this chapter, we describe the proposed method of video-content search by a novel use of H.264 coding features to avoid tedious search on recorded videos. With such a video-content search method, it will become much easier and faster to check any suspicious activity or people in recorded videos.
In Section 3.1.1, some definitions related to the video-content search problem are described, and the proposed idea and system configuration are given in Section 3.1.2.
In Section 3.2, a motion detection algorithm based on the proposed idea is introduced.
In Section 3.3, the way we use for embedding the motion region information is described. In Section 3.4, the process of extraction of motion region information is presented. Some experiment results are shown in Section 3.5. In Section 3.6, the last section of this chapter, some discussions and summary are given.
3.1.1 Problem Definition
In the video-content search problem dealt with in this study, the activities recorded in an input video are detected and embedded back into the video for later search. Two issues are involved in this problem. The first is how to detect motion regions correctly in a video taken by a real-time surveillance system with a stationary camera. Motion detection is a very popular research topic in video analysis and can be implemented in many different ways as described in Chapter 2. The second issue is how to embed information about the detected motion regions into an H.264 compressed bitstream during an encoding process and how to extract them during a decoding process.
3.1.2 Proposed Idea
In the proposed method, each frame captured from a stationary camera is encoded into a compressed bitstream in the H.264 encoding process. During the encoding process, a novel motion detection technique is used in this study to detect suspicious activities in the currently-processed frame. While the motion regions are detected, the location information of the motion regions is embedded into the quantized frequency domain of the compressed H.264 bitstream. Therefore, if someone wants to know whether a specific region of a video contains suspicious activities or not, the data extraction process in the proposed method can be utilized to search the video contents and output the video clips that the user is interested in.
3.2 Detection of Motion Regions by H.264/AVC Coding Features
In the proposed system, we introduce a novel motion detection technique by use of H.264 coding features. In Section 3.2.1, the idea of the proposed technique is stated.
And in Section 3.2.2, the detailed process of the proposed motion detection technique is described.
3.2.1 Proposed Idea of Motion Detection Technique
As mentioned in Chapter 2, in an encoding process of a P or B slice of a compressed video stream, an H.264 encoder needs to find the best partition mode of the currently processed macroblock. Each 16×16 macroblock may be partitioned and motion compensated by one of the following ways: one 16×16 macroblock partition, two 16×8 partitions, two 8×16 partitions, and four 8×8 partitions, as illustrated in Figure 3.1(a). If the 8×8 mode is chosen, each of the four 8×8 sub-macroblocks in the macroblock may be further partitioned in four ways: one 8×8 sub-macroblock partition, two 8×4 sub-macroblock partitions, two 4×8 sub-macroblock partitions, and four 4×4 sub-macroblock partitions, as illustrated in Figure 3.1(b).
For the same macroblock, different partition modes produce different coding costs to the compressed video stream. If the encoder does not choose the best partition mode for the current macroblock, it will cause more bits to be included in the stream.
Therefore, the encoder does so according to the video content of the currently-processed macroblock to get the lowest coding cost. Generally speaking, the partition modes with large partition sizes (16×16, 16×8, 8×16) are suitable for smooth areas, while the modes with small partition sizes (8×8, 8×4, 4×8, 4×4) are appropriate for detailed areas.
Video contents of motion regions usually change greatly both in the time domain and in the spatial domain. In the time domain, the movements in the motion regions result in a lot of information of changes, which needs to be described. In the spatial domain, the motion regions may contain some moving objects which might be humans, cars, etc. These moving objects might contain lots of details that need to be encoded.
Changes in the time domain generally make the partition sizes of the motion regions to be small ones in order to reduce the coding cost calculated in the motion compensation process. Therefore, the partition modes of the motion regions are mostly with small partition sizes. Moreover, changes in the spatial domain make the partition modes of the macroblocks within the motion regions variable, because video contents between these macroblocks are quite different from each other.
Based on the clues mentioned above, we choose to use small partition sizes and variable partition modes as features of motion regions, and use them and motion vectors to detect motion regions in surveillance videos.
Besides, after detecting motion regions by these features, some noise caused by lights and shadows might be included in the detected motion regions and appears on the fringes of the regions. We call macroblocks containing such noise in the detected motion regions as noise macroblocks. Partition modes of these noise macroblocks usually include large partition sizes; Therefore, we also use this characteristic to eliminate the noise. An example of noise macroblocks is illustrated in Figure 3.2.
3.2.2 Process of Detection of Motion Regions
The proposed motion detection method is applied to frames composed of P slices only. During the encoding process of an input frame, the length of every motion
16×16 8×16 16×8 8×8 16
16 8
8×8 4×8 8×4 4×4
8
8 4
vector of a sub-macroblock of the input frame is compared with a pre-defined threshold in order to filter motion-less sub-macroblocks. The remaining sub-macroblocks are called motion blocks.
(a)
(b)
We obtain candidate motion regions by applying a region growing algorithm to the motion blocks. The basic concept of the region growing algorithm is to check the eight neighboring 16×16 macroblocks M1 through M8 around each 16×16 macroblock where the motion blocks are located. If any of M1 through M8, say Mi, contains
We obtain candidate motion regions by applying a region growing algorithm to the motion blocks. The basic concept of the region growing algorithm is to check the eight neighboring 16×16 macroblocks M1 through M8 around each 16×16 macroblock where the motion blocks are located. If any of M1 through M8, say Mi, contains