Chapter 1 Introduction
1.5 Thesis Organization
In the remaining parts of this thesis, the related works about privacy protection in video surveillance, data hiding in images and 3D steganography via KINECT images are reviewed in Chapter 2. The proposed methods for protecting privacy-sensitive regions and privacy-sensitive motion activities in surveillance videos are described in Chapters 3 and 4, respectively. The proposed method for 3D steganography via KINECT images will be introduced in Chapter 5. Finally, conclusions and some suggestions for future works will be presented in Chapter 6.
10
Chapter 2
Review of Related Works
2.1 Review of Techniques for Privacy Protection in Video Surveillance
Applications
As global security concerns are now escalating, important video surveillance solutions have been proposed for applications of national security, law enforcement, public transportation, etc. They not only monitor people in various environments, but also expose unintentionally information with personal privacy. For this reason, privacy protection becomes indispensable in video surveillance. In this section, we will review those techniques proposed for privacy protection in video surveillance.
Paruchuri et al. [2] gave a survey of data hiding techniques for protecting privacy-sensitive contents in surveillance videos. To protect the privacy of selected individuals in videos, the usual way is to erase, blur, or re-render the image parts or frames of the individuals. Such modifications, however, destroy the authenticity of the original content of the surveillance video in concern. Paruchuri et al. [2] proposed a new rate-distortion based video data hiding algorithm for the purpose of storing the privacy-sensitive information in the compression domain. The algorithm embeds the privacy-sensitive information in optimal locations that minimize the resulting perceptual distortion and bandwidth expansion due to data embedding. Both reversible and irreversible embedding techniques were considered within the proposed framework, and extensive experiments were performed to demonstrate the
11
effectiveness of the techniques.
Elaine et al. [3] used the face recognition technique to automatically identify known people, such as against a database of driver-license photos. Moreover, they tracked people regardless of suspicion, guaranteeing that face recognition software will not recognize de-identified faces reliably, even though many facial characteristics were preserved. Also, the system obliterated relevant information, for example, object tracks or suspicious activities, from videos.
Dufaux et al. [4] also introduced a method to protect personal privacy by scrambling image regions containing personal information. As a result, the scene remained visible, but the privacy-sensitive information was not identifiable any more.
2.2 Review of Techniques for Information Hiding Techniques
With the advance of computer technology, information hiding has already become an indispensable part of our lives. In this field, videos, pictures, and digital audios are furnished with distinguishing abilities but imperceptible marks, which may contain a hidden patent or information, or even help to prevent modifications of its own.
Many data hiding methods have been proposed for still images. They can be classified into two main categories: (1) spatial-domain and (2) frequency-domain. In a spatial-domain method, the secret message is usually embedded by using LSB, statistical, feature, or block-based techniques. These techniques work with the pixel values directly, and images are generally manipulated by altering one or more bits of each byte of the image. On the other hand, in a frequency-domain method, a secret message is hidden in the coefficients of the image in the transform domain.
12
Generally speaking, the spatial-domain method is sensitive against attacks like image compression, resulting in quality degradations and content distortions, but the frequency-domain method is more robust. Therefore, data hiding in the frequency domain is less prone to attacks, but a lot of people prefer to use spatial-domain modification since it can hide more data.
For example, Ni, et al. [5] proposed a reversible data hiding algorithm for embedding data in the spatial domain by using the zero or the minimum point of the histogram and slightly modifying the pixel values. Also, Xuan et al. [6] proposed an approach to hiding data into the middle biplanes of the integer wavelet transform coefficients in the middle and high sub-bands of the frequency domain.
Besides, the most popular hiding technique may be the least-significant-bit (LSB) replacement. This method has been improved, extended, and revisited for years because the traditional LSB replacement method has many weaknesses, like distortion incurring, inherent fragility, etc. Even reversible LSB replacement methods havw been developed, which are very useful for visible watermarking and covert communication.
For example, Celik et al. [7] proposed a method for data embedding using lossless generalized LSB replacement. This method modifies the lowest levels, instead of the bit plane, of raw pixel values, and can recover the original image by compressing and transmitting the modified level values.
2.3 Review of Techniques for Image Steganography
Steganography is the art of writing hidden messages in such a way that no one, apart from the sender and intended recipient, suspects the existence of the message. It
13
may be regarded as a form of security through obscurity.
Generally, messages will appear to be something else: images, articles, shopping lists, or some other cover texts. The advantage of steganography over cryptography alone is that messages do not attract attention to themselves. Plainly visible encrypted messages, no matter how unbreakable, will arouse suspicion and may in themselves be incriminating in countries where encryption is illegal. Therefore, whereas cryptography protects the contents of a message, steganography can be said to protect both messages and communicating parties.
Lee et al. [8] proposed a method of watermarking which embeds an autostereogram into a cover image by the discrete cosine transform (DCT). This not only improves the low bit capacity but also the watermark efficiency because the random dots are distributed in the image space.
In [9], Wang and Chen presented an image steganography method that utilizes a two-way block-matching procedure in order to search for the highest similarity block for each block of the secret image. The indexes of these secret blocks are obtained in a block-matching procedure and recorded in the least significant bits of the cover image. Recently, Tsuda et al. [10] proposed a modified secure and high-capacity based steganography scheme for hiding a large-size secret image into a small-size cover image. The results show that the proposed algorithm for the modified steganography is highly secured in addition to having good perceptual invisibility.
2.4 Previous Studies on Applications of KINECT Devices
The KINECT device incorporates several types of advanced sensing hardware.
The most notable is that it contains a depth sensor, a color camera, and a
14
four-microphone array, providing a full-body 3D motion capture device with face and voice recognition capabilities. The KINECT device is shown in Figures 2.1 and 2.2.
In more detail, the depth sensor in the KINECT device can emit an IR pattern, and capture the return signals as an IR image with a traditional CMOS camera that is fitted with an IR-pass filter.
Figure 2.1 The outer appearance of a KINECT sensor.
Figure 2.2 The KINECT device is composed of an Infra-red (IR) projector, an IR camera, and an RGB camera
15
The image processor of the KINECT device uses the relative positions of the dots in the pattern to calculate the depth displacement at each pixel position in the image. It is noted that the actual depth values are the distances from the camera-laser plane rather than from the sensor itself. As such, the depth sensor can be seen as a device that returns the coordinates of 3D objects.
The hardware specifications of the KINECT device are described in many documents. The main ones are shown in Table 1.
Table 2.1Main hardware specifications of the KINECT device.
Property Value
Angular Field-of-View 57 horz., 43 vert.
Frame rate approx. 30 Hz
Nominal spatial range 640 x 480 (VGA)
Nominal spatial resolution (at 2m distance) 3 mm
Nominal depth range 0.8 m - 3.5 m
Nominal depth resolution (at 2m distance) 1 cm
Device connection type USB (+ external power)
Three software frameworks are available for software developments using the KINECT device: Microsoft SDKs [11], OpenNI [12], and OpenKinect [13]. The Microsoft SDK is available only for Windows 7 whereas the other two frameworks are multi-platform and open-source software.
With the development of the depth camera technology, it is feasible to get high-quality color and depth images synchronously in realtime using the KINECT device now. Zhao et al. [16] compared the performances of different ways of
16
extracting interest points, and showed that the best performance could only be achieved by extracting interest points solely from the RGB channels, and then computing RGB-based descriptors and depth map-based descriptors together with those interest points.
In [17], Sung et al. proposed a method of using the KINECT device to extract human motions. The method was based on a hierarchical maximum-entropy Markov model (MEMM). From the RGBD images acquired with a Kinect sensor, the method extracts features and feeds them as input to a learning algorithm to train a two-layered Markov model which can capture different properties of human activities, including the correspondence between sub-activities and human skeletal features. The method considers a person’s activity as composed of a set of sub-activities, and infers a two-layered graph structure from it by using a dynamic programming approach, as illustrated in Figure 2.3.
Figure 2.3 An example of extracting human activities from the RGBD data acquired with the Kinect sensor conducted by Sung et al. [17].
At last, Wang et al. [18] proposed a new feature descriptor, Pyramid Depth Self-Similarities (PDSS), for depth images. It was based on the idea that the depth information of people has high local self-similarities. Pedestrian detection was performed restrictively to single images, which involves three key aspects: feature, classifier, and detection strategy. To yield a better performance, it was suggested to
17
look for better features, as people present different somatotype and appearances, as illustrated in Figure 2.4.
Figure 2.4 An example of experimental results of feature extraction conducted by Wag et al. [18].
2.5 Review of Techniques for Motion Detection
Many motion detection techniques have been proposed to detect moving objects in videos [19-24], and some of them are reviewed in this section.
The main purpose of motion detection is to identify motion areas in a video, or to segment out motion objects from a video. A lot of content-based applications, such as smart signal processing, computer vision analysis, etc., have been developed. And their common works are often to detect motion objects firstly, and then to identify the properties of the objects for various applications.
Related techniques include human-face detection and recognition, motion-object tracking, content-based video retrieval, etc. Existing motion detection techniques can
18
be classified into two categories: one for use in the pixel domain [25-26] and the other in the compressed domain [27-29]. Generally speaking, the approaches used in the pixel domain have to fully decode a compressed video bitstream first, but they can be employed for videos coded according to different video coding standards. On the other hand, each of the approaches used in the compressed domain can perform a motion detection process by partially decoding a compressed video bitstream, but they can only be employed in videos coded according to specific standards.
Lipton et al. [19] proposed another approach which is based on temporal differencing in the pixel domain, where temporal differencing means the pixel-wise value differences between consecutive video frames. The basic idea of this approach is to compare video frames that are separated by a constant time in order to find moving objects. Haritaoglu et al. [20] also proposed a motion detection method which is based on background subtraction in the pixel domain. Both of the two methods built a statistical model for a background scene, and used the model to detect moving objects even though the background scene was not completely static.
19
Chapter 3
Protection of Privacy-sensitive Regions in Surveillance Videos Acquired by KINECT Device
3.1 Introduction
With the increasing public concern about personal privacy protection issues, it is desired to develop privacy protection methods for use in video surveillance systems.
Besides, with the rapid development of stereo-vision technology, 3D imaging devices become more and more popular. Many types of such devices have been invented to get 3D multimedia information under various conditions. One famous example is the KINECT device manufactured by Microsoft.
Accordingly, in the application of video surveillance, we can get easily from 3D image data such information as intruding persons, their heights and thicknesses, and so on. Because of these characteristics and usefulness of 3D image data, the issue of protecting 3D image data in various applications becomes more and more important.
In this chapter, we describe the proposed method for privacy protection of selected privacy-sensitive regions in image frames of surveillance videos taken with the KINECT device.
In Section 3.1.1, the related problem definitions are given. In Section 3.1.2, the related ideas are reviewed and the basic idea of the proposed method is described. The
20
principle behind the proposed method is based on the concept of merging of processed color and depth images to display 3D information, which we describe in Section 3.2. Detailed algorithms for privacy-sensitive region concealment and recovery based on the principle are presented in Section 3.3. Finally, some experimental results showing the feasibility of the proposed method are given in Section 3.4.
3.1.1 Problem Definition
As security surveillance is extensively applied in our living environment, there is a growing concern that the systems pose threats to personal privacy. Since the feeling of privacy is highly subjective and varying across cultures and individuals, the method of privacy protection should be adapted as much as possible to suit individual requirements. With regard to this demand, in the proposed method we allow an authorized user to specify a privacy-sensitive region R in a surveillance video in advance. The image content in R is defined as a privacy-sensitive image part which is not to be revealed to unauthorized people. The goal of the proposed method is to disguise the pre-selected privacy-sensitive image part as a corresponding background image part to conceal privacy-sensitive information in the image frames of the surveillance video. In addition, it is hoped that the protected image frames can be restored to include the original privacy-sensitive image part if a secret key is given as input.
Using traditional data hiding techniques to hide the privacy-sensitive image part may achieve this goal, but such techniques usually are time-consuming and demand large spaces for data embedding. Therefore, we design alternatively a general method for concealing the privacy-sensitive image part imperceptibly and recovering the original content of this image part from the resulting protected image losslessly. In
21
addition, even if a person knows the algorithms implementing the method, he/she still cannot retrieve the privacy-sensitive image part without the secret key. The security of the protected privacy in the surveillance video is thus ensured.
On the other hand, the KINECT sensor can be used to acquire the depth information, so the data can be used more extensively in applications than the 2D data.
This kind of information is very important, so its correctness must be guaranteed. It is noted that the range of the depth values that are provided by the KINECT device is different from that of the general values of color-image. On the other hand, together with a depth image, a color image is taken simultaneously by the KINECT device. So, the depth image and the color image taken by the KINECT device at an identical instant of time should be protected together to keep their relation in time.
3.1.2 Review of Ideas of a Previous Study
Camouflaging is a method of hiding a secret. It allows a secret to be disguised by an object and so remain unnoticed. The major idea of the proposed method was inspired by the concepts of camouflaging and privacy protection in surveillance videos which were proposed by Lin and Tsai [25] as well as by the natures of the features of the KINECT images. The proposed method aims to produce a protected image by disguising a privacy-sensitive image part as a pre-selected background image part in a surveillance video, and to protect as a whole the color and depth images taken by a KINECT device at an identical instant. The proposed method utilizes the features of KINECT images and the range of pixel values in the depth image to achieve the goal of protecting the color and depth image together.
Specifically, the proposed method produces a camouflage image by using a prediction-based mapping, which a deterministic one-to-one (reversible) compound mapping function proposed originally by Liu and Tsai [1]. Following the function
22
proposed originally by Liu and Tsai [1], Lin and Tsai [25] integrated a new prediction technique into the prediction-based mapping to make the resulting camouflage image closely resemble the background image in appearance. The technique they proposed can estimate more effectively pixel values for use in the prediction-based mapping.
Specifically, it uses the pixel-value similarity among adjacent pixels and employs a simple edge detection technique coming from the JPEG-LS standard [31].
3.2 Proposed Techniques of Synchronized Removals of
Corresponding Areas in Color and Depth Images
In this section, we introduce the method proposed in this study which synchronously removes corresponding areas in color and depth images for privacy-sensitive image part concealment. The principle behind the proposed method is based on Lin and Tsai [25] as mentioned. In Section 3.2.1, the idea of the proposed method is described. And in Section 3.2.2, the detailed algorithms implementing the method are described.
3.2.1 Idea of Proposed Method
In the proposed method, at first a scheme is proposed to generate a new 3D image from the color and depth images acquired with a KINECT device. It is known that the KINECT device has a color camera and a laser sensor which are not aligned, leading to a displacement between the coordinates of the color image and those of the depth image. Therefore, we must correct this displacement to merge the two images to
23
produce a single 3D image. We conduct this correction based on the use of a pinhole camera model.
The next major step of the proposed method is to disguise the privacy-sensitive image part as a pre-selected background image part to conceal the privacy-sensitive information in the image frames in the surveillance video. For this, we allow an authorized user to specify a privacy-sensitive region R in a surveillance video in advance, followed by the action of disguising the image content as mentioned above.
The third major step is to remove synchronously corresponding areas in the color and depth images in the mean time. The result is finally displayed in a 3D fashion.
More details are described in the following.
3.2.2 Merging of Processed Color and Depth Images
To merge the depth and color images to produce a 3D image, a calibration of the camera parameters in advance is necessary. In this process, a rotation problem and a displacement problem in the 3D space should be solved at first. A solution to the rotation problem is to correct related parameters before mapping the KINECT images (including color and depth images) to produce an integrated 3D image. For this, some functions and parameters provided by the KINECT device and the Kinect-for-Windows SDK [11] can be utilized. In more detail, we tilt the field of view of each KINECT device to the zero-angle position using the tilt motor in the KINECT device before acquiring KINECT images; and to solve the displacement problem between the color image and the depth image, we use certain functions provided by
To merge the depth and color images to produce a 3D image, a calibration of the camera parameters in advance is necessary. In this process, a rotation problem and a displacement problem in the 3D space should be solved at first. A solution to the rotation problem is to correct related parameters before mapping the KINECT images (including color and depth images) to produce an integrated 3D image. For this, some functions and parameters provided by the KINECT device and the Kinect-for-Windows SDK [11] can be utilized. In more detail, we tilt the field of view of each KINECT device to the zero-angle position using the tilt motor in the KINECT device before acquiring KINECT images; and to solve the displacement problem between the color image and the depth image, we use certain functions provided by