Thesis Organization - 利用WebM視訊做資訊隱藏及其應用之研究

Chapter 1 Introduction

1.5 Thesis Organization

In the remainder of this thesis, a detailed review of related works about motion

detection, video data hiding, video authentication, and privacy protection in surveillance videos, as well as the WebM standard is given in Chapter 2. In Chapter 3, the proposed method for data hiding via WebM videos for covert communication is described. In Chapter 4, the proposed video authentication system for surveillance videos is described. In Chapter 5, the proposed method of privacy protection of surveillance videos is presented. Finally, conclusions and some suggestions for future works are given in Chapter 6.

Chapter 2 Review of Related Works and WebM Standard

In this chapter, we give a survey of related works about data hiding, motion detection, video authentication, and privacy protection in videos in Sections 2.1 through 2.4, respectively. Then, we give a review of the standard of the WebM video in Section 2.5.

2.1 Review of Techniques for Data Hiding via Videos

Lots of data hiding Techniques have been developed for hiding secret data into various media and documents in the past decade. By this way, secret data can be transmitted covertly or kept securely for various applications. Because the capacities of hiding data in videos are usually larger than hiding data in images or documents, many data hiding techniques via videos have been proposed [1-4]. Hu et al. [1]

proposed a method for hiding data in H.264/AVC videos based on the use of intra-prediction modes. The basic idea is to modify 44 intra-prediction modes based on a mapping between 4×4 intra-modes and hidden bits. Their method uses only the intra-coded macroblock to hide data. Hussein [2] proposed a method for embedding data in motion vectors based on their associated prediction error. Yang and Bourbakis [3] proposed a method for embedding data in the DCT coefficients by means of vector

quantization. Kapotas et al. [4] proposed a method for embedding data into encoded video sequences, in which the hiding technique is used to modulate the partition size to hide the secret data. This method can only be used for embedding information in inter-coded macroblocks.

2.2 Review of Techniques for Motion Detection

A lot of motion detection techniques have been proposed to detect moving objects in videos [5-9]. The techniques can be classified into two categories. One is for use in the pixel domain [5-6] and the other in the compressed domain [7-9].

Generally speaking, the approaches used in the pixel domain have to fully decode a compressed video bitstream first, but they can be employed for videos coded according to different video coding standards. On the other hand, each of the approaches used in the compressed domain can perform a motion detection process by partially decoding a compressed video bitstream, but they can only be employed in videos coded according to specific standards, such as H.264/AVC or WebM.

Specifically, Haritaoglu et al. [5] proposed a motion detection method based on background subtraction in the pixel domain. They built a statistical model for a background scene that allows them to detect moving objects even when the background scene is not completely stationary. Lipton et al. [6] proposed another approach based on temporal differencing in the pixel domain, which computes pixel-wise differences between consecutive video frames separated by a constant time to find moving objects. Zeng et al. [7] proposed another approach in the compressed domain by employing a block-based Markov random field (MRF) model in a field formed with motion vectors to segment moving objects during a decoding process.

Babu et al. [8] proposed an automatic video object segmentation algorithm for the MPEG video. They estimated first the number of independently moving objects in the scene using a block-based affine clustering method. Object segmentation is then accomplished by an expectation maximization (EM) clustering algorithm. Spyridon et al. [9] proposed a method for automatic direct detection of moving objects in the H.264 compressed domain. Different blocks/sub-blocks are combined with their associated motion vectors in order to denote a moving object. Their method works in the compressed domain as the block-sizes and the motion vectors can be found by partially decoding the H.264 bitsream.

2.3 Review of Techniques for Video Authentication

Video authentication plays an important role in a digital-rights-management system, so many different methods have been proposed to solve the problem [10-12].

Zhang and Ho [10] introduced a video authentication method which makes an accurate usage of tree-structured motion compensation, motion estimation, and Lagrange optimization of the H.264 standard. As mentioned in the paper, authentication information is embedded according to a best-mode decision strategy in the sense that if a video undergoes any spatial and temporal attacks, the scheme can detect the tampering by the sensitive mode change. Pröfrock et al. [11] proposed a method using skipped macroblocks of an H.264 video to embed authentication data.

The data are embedded as a fragile, blind, and erasable watermark with low video quality degradations. In contrast with other authentication methods, the embedding process is done after an H.264 compression process, while others are done during the process. The methods mentioned above usually use additional authentication

information to authenticate videos. Ait Saadi et al. [12] proposed a method using content based digital signatures from the transform domain as fragile watermarks and then embeds them in motion vectors with the best partition mode in tree-structured motion compensation.

2.4 Review of Techniques for Privacy Protection in Videos

Privacy protection has become an important issue along with video surveillance systems. Many different approaches have been introduced in recent years [13-16].

Dufaux et al. [13] introduced a method to protect personal privacy by scrambling regions containing personal information. As a consequence, the scene remains visible, but the privacy-sensitive information is not identifiable. Meuel et al. [14] introduced a method to protect faces in surveillance videos. Any visible information of faces in a video is deleted and embedded in the video that allows further reconstruction of the faces if needed. Zhang et al. [15] proposed a method to protect authorized persons, which are not only removed from a surveillance video, but also embedded into the video. Yu et al. [16] proposed another method protecting individuals‟ privacy by controlling the disclosure of individuals‟ private visual information. A set of visual abstraction operators such as silhouette and transparency is applied, which gradually control individuals‟ private visual information.

2.5 Review of WebM Standard

In this study, all the proposed information hiding, video authentication, and privacy protection techniques employ WebM videos as carriers for hiding information.

The WebM project, which is a project founded by Google Inc., is aimed to describe the detail for the WebM standard, which can be found at the WebM project website [17]. We give a brief review of the WebM standard in this section. In Section 2.5.1, the structure of the WebM standard will be described. In Sections 2.5.2 and 2.5.3, the encoding and decoding processes in the WebM standard are described, respectively. In Sections 2.5.4, 2.5.5, and 2.5.6, related WebM features are described.

2.5.1 Structure of WebM standard

WebM is an open media file format designed for the web whose openness was offered by Google Inc. in May 2010. Each WebM file consists of video streams compressed with the VP8 video codec and audio streams compressed with the Vorbis audio codec. The WebM file structure is based on the Matroska media container. All of them are royalty-free patent license products, so developers could develop or do researches on them without considering any patent suit issue.

The VP8 video codec works exclusively with an 8-bit YUV 4:2:0 image format, each 8-bit chroma pixel in the two chroma color space (U and V) corresponds to a 2×2 block of 8-bit luma pixels in the luma color space (Y), and the coordinates of the upper left corner of the Y block are exactly twice the coordinates of the corresponding chroma pixels. The pixels are simply a large array of bytes stored in rows from top to bottom, each row being stored from left to right. This “left to right” then “top to bottom” raster-scan order is reflected in the layout of the compressed data.

Also, each frame is decomposed into an array of macroblocks. A macroblock is a square array of pixels whose Y dimensions are 16×16 and whose U and V dimensions are 8×8. The macroblock-level data in a compressed frame are also processed in a raster-scan order. The macroblocks are further decomposed into 4×4 subblocks. So

every macroblock has sixteen Y subblocks, four U subblocks, and four V subblocks.

Like other video codecs, the VP8 video codec also has a transform process which converts pixels in the spatial domain into coefficients in the frequency domain. In the VP8 video codec, the discrete cosine transform (DCT) and the Walsh-Hadamard transform (WHT) always conduct compression at the 4×4 resolution. The DCT is used for the sixteen Y, four U, and four V subblocks. The WHT is used to encode a 4×4 array comprising the average intensities of the sixteen Y subblocks of a macroblock.

These average intensities are, up to a constant normalization factor, nothing more than the zeroth DCT coefficients of the Y subblocks. The VP8 video codec considers this 4

×4 array as a second-order subblock called Y2.

There are two frame types in the VP8 video codec which are intra-frame and

inter-frame. Intra-frames (also called key frames or I-frames) are decoded without

reference to any other frame in a sequence. Key frames provide random access points in a video stream. Inter-frames (also called prediction frames or P-frames) are encoded with reference to prior frames, specifically all prior frames up to and including the most recent key frame. The VP8 video codec uses three types of reference frames for prediction frames: prior frame, golden reference frame, and

alternate reference frame. We will have more illustrations about the golden reference

frame and the alternate reference frame which are features of the VP8 video codec in Section 2.5.5.

2.5.2 Process of Encoding

The process of encoding of WebM videos is illustrated in Figure 2.1. There are two data flow paths, forward and reconstruction. In the forward path, a macroblock is

encoded in the intra-mode or inter-mode. In the intra-mode, the encoder calculates the best intra-prediction mode which uses the current encoded blocks as references. In the inter-mode, the encoder calculates the best inter-prediction mode from the last frame or the golden reference frame. After deciding the prediction mode, the encoder generates prediction blocks/buffers. In the intra-mode, the encoder subtracts 128 from each pixel which needs to be encoded. In the inter-mode, the encoder subtracts values of pixels of the current block from those of corresponding pixels of a block which is selected by the motion vector. Both the intra-mode and the inter-mode will produce a residual block.

Also, each 16×16 macroblock is divided into sixteen 4×4 DCT blocks, each of which is transformed by a bit-exact DCT approximation. After the DC coefficients of these bit-exact DCT blocks are collected into another group, all DC coefficients set as zero. Furthermore, this group performs the Walsh-Hadamard transform in order to increase the compression rate. After that, transformed coefficients of these blocks are quantized. Then, each resulting block is scanned in a zig-zag order and entropy encoded. Here, entropy coding is the process of taking all information from all the other processes: DCT coefficients, prediction mode, motion vectors, and so forth



and compressing them losslessly into the final output file.

In the reconstruction path, the encoder decodes (reconstructs) each block in a macroblock which is regarded as a reference for further prediction. The quantized coefficients are scaled and inverse-transformed to product a difference block, and then the prediction is added to the difference block to product a reconstructed block.

Finally, a loop filter is used to reduce the effects of blocking distortion and the reconstructed reference picture is created from a series of blocks.

Figure 2.1 Flow diagram of WebM encoding process.

2.5.3 Process of Decoding

The decoder receives a compressed bitstream. First, the frame header (the beginning of the first data partition) is decoded. Then, the macroblock data occur in raster-scan order. These data come in two more parts. The first part is a prediction mode coming in the remainder of the first data partition. The other part comprises the data partition(s) for the DCT/WHT coefficients of the residue signal. Figure 2.3

shows the top-level hierarchy of the WebM video bitstream. For each macroblock, the prediction data must be processed before the residue. Each macroblock is predicted using one (and only one) of four possible frames, namely, the current frame, the immediately previous reconstructed frame, the most recent golden reference frame, and the recent alternate reference frame.

Regardless of the prediction method, the residue DCT signal is decoded, dequantized, reverse-transformed, and added to the prediction buffer to produce the reconstructed value of the macroblock, which is stored in the correct position of the current frame buffer. After all the macroblocks have been generated (predicted and corrected with the DCT/WHT residue), a filtering step is applied to the entire frame.

The purpose of the loop filter is to reduce blocking artifacts at the boundaries between macroblocks and between subblocks of the macroblocks. Figure 2.3 shows the flow diagram of the WebM decoding process.

Per-macroblock information

Residue Signals (Coefficients information) Frame Header

information

Prediction information Uncompressed data

chunk Compressed data chunk

...

16x16 Y

8x8 U 8x8 V

Figure 2.2 Top-level hierarchy of WebM video bitstream.

Dequantize

Reverse-transform

Prediction Buffer

Loop Filter 16x16

Macroblock

Sixteen 4x4 Subblock

+ Reconstructed

Output Frame Encoded Frame

Entropy Decode

Figure 2.3 Flow diagram of WebM decoding process.

2.5.4 Region of Interest maps

The use of region of interest (ROI) maps is a way for applications to assign each macroblock in a frame to a region in WebM videos, and then set custom parameters such as quantization levels and filtering parameters. The VP8 video codec uses segment based adjustments to support changing the quantizer level and the loop filter level for a macroblock. It supports totally four different maps for each frame, so there could have up to four different maps in each frame. Macroblocks have its own map index, and these indexes also encode to be bitstreams by the tree coding. Figure 2.4 shows an example of ROI maps, where each block is a unit of map. Different colors mean different maps in this frame.

Figure 2.4 an Example of ROI maps of a frame.

2.5.5 Reference Frames

The VP8 video codec uses three types of reference frames for inter prediction:

the prior frame, a golden reference frame, and an alternate reference frame. Overall, this design has a much smaller memory footprint on both the encoder and the decoder than designs with many more reference frames. More details of the golden reference frame and the alternate reference frame are illustrated below,

(A) Golden Reference Frame 

The VP8 video codec was designed to use one reference frame buffer to store a video frame from an arbitrary point in the past. This buffer is known as the golden

reference frame. The VP8 encoder could use the golden reference frame in many ways

to improve coding efficiency. One situation is that it can be used to maintain a copy of the background image when there are objects moving in the foreground part; by using the golden reference frame, the foreground part can be easily and cheaply reconstructed when a foreground object moves away. Another example is using the golden reference frame to encode back and forth cut of two scenes, where the golden reference frame buffer can be used to maintain a copy of the second scene. Finally, the golden reference frame can also be used for error recovery in a real-time video conference, or even in a multi-party video conference for scalability. Figure 2.5 shows an example of using the golden reference frame. In Figure 2.5, Frame 0 is a key frame and also a golden reference frame. Frame 1 through Frame 4 build a predictor using the prior frame. Frame 5 uses only Frame 0 as a reference. If any frames between Frame 1 to Frame 4 are lost, the VP8 video codec still can decode Frame 7 because it references only to Frame 0.

19 Frame 0

Frame 1 Frame 2

Frame 3 Frame 4

Frame 5

Figure 2.5 An example of the use of the golden reference frame.

(B) Alternate Reference Frame 

The VP8 alternate reference frame has much difference than other types of reference frames used in video compression. While reference frames usually are displayed to the user by the decoder, the VP8 alternate reference frame is decoded normally but may or may not be shown in the decoder. Because the alternate reference frames have an option of not being displayed, the VP8 encoder can use them to transmit any data that are helpful to compression. The flexibility in the VP8 specification allows many types of usage of the alternate reference frame for improving coding efficiency. For example, the VP8 video codec has a lack of B frames, which led to discussions in the research community about the ability to achieve high compression efficiency in the VP8 video codec. So, the VP8 video codec intelligently uses the golden reference frame and the alternate reference frames together to compensate for this problem.

2.5.6 VP8 Intra Prediction and Inter Prediction

To encode a video frame, a block-based video codec, such as the VP8 video codec, at first decomposes the frame into smaller segments called macroblocks. For each macroblock in the VP8 video codec, the encoder will predict redundant motion and color information based on previously processed macroblocks. The redundant information can be subtracted and transformed from the macroblock, resulting in more efficient compression. The VP8 encoder uses two prediction types: intra

prediction and inter prediction. The intra prediction uses data within an encoded

macroblock in this frame so it does not reference any previously encoded frames; and the inter prediction uses data from previously encoded frames, so the residual signal data are encoded using other techniques, such as transform coding.

(A) VP8 Intra Prediction Modes 

The VP8 video codec uses three types of macroblocks in intra prediction modes,

4×4 luma, 16×16 luma, and 8×8 chroma. Five intra prediction modes are shared by

these macroblocks. The first is the H_PRED (horizontal prediction), which fills each column of the block with a copy of the left column. The second is the V_PRED (vertical prediction), which fills each row of the block with a copy of the row above.

The third is the DC_PRED (DC prediciton), which fills the block with a single value using the average of the pixels in the row above, A, and the column to the left, L (see Fig. 2.6). The fourth is the B_PRED, which divides a macroblock into sixteen blocks with each block having its own prediction modes. The last is the TM_PRED (TrueMotion prediction), which is a new compression prediction technique developed by On2 Technologies. We illustrate more details about TrueMotion prediction below.

In addition to the row A and the column L, TreMotion prediction uses the pixel C above and to the left of the block. Horizontal differences between pixels in A (starting from C) are propagated using the pixels from L to start each row. As mentioned above, the TM_PRED mode is unique to the VP8 video codec. Figure 2.6 uses an example 4×

4 block of pixels to illustrate how the TM_PRED mode works, where C, A_x and L_x (x

= 0, 1, 2, 3) represent reconstructed pixel values from previously encoded blocks, and

X

₀₀ through X₃₃ represent predicted values for the current block. The TM_PRED mode uses the following equation to calculate Xij:

ij i j

Figure 2.6 An example of 4×4 block of pixels.

Although the above example uses a 4×4 block, the TM_PRED mode for 8×8 and 16×16 blocks works in the same way. The TM_PRED prediction mode is one of the more frequently used intra prediction modes in the VP8 video codec. Generally speaking, together with other intra prediction modes, the TM_PRED prediction mode helps the VP8 video codec to achieve very good compression efficiency, especially for key frames, which can only use intra modes.

在文檔中利用WebM視訊做資訊隱藏及其應用之研究 (頁 20-0)