Chapter 2 Review of Related Works
2.5 Review of Techniques for Motion Detection
Many motion detection techniques have been proposed to detect moving objects in videos [19-24], and some of them are reviewed in this section.
The main purpose of motion detection is to identify motion areas in a video, or to segment out motion objects from a video. A lot of content-based applications, such as smart signal processing, computer vision analysis, etc., have been developed. And their common works are often to detect motion objects firstly, and then to identify the properties of the objects for various applications.
Related techniques include human-face detection and recognition, motion-object tracking, content-based video retrieval, etc. Existing motion detection techniques can
18
be classified into two categories: one for use in the pixel domain [25-26] and the other in the compressed domain [27-29]. Generally speaking, the approaches used in the pixel domain have to fully decode a compressed video bitstream first, but they can be employed for videos coded according to different video coding standards. On the other hand, each of the approaches used in the compressed domain can perform a motion detection process by partially decoding a compressed video bitstream, but they can only be employed in videos coded according to specific standards.
Lipton et al. [19] proposed another approach which is based on temporal differencing in the pixel domain, where temporal differencing means the pixel-wise value differences between consecutive video frames. The basic idea of this approach is to compare video frames that are separated by a constant time in order to find moving objects. Haritaoglu et al. [20] also proposed a motion detection method which is based on background subtraction in the pixel domain. Both of the two methods built a statistical model for a background scene, and used the model to detect moving objects even though the background scene was not completely static.
19
Chapter 3
Protection of Privacy-sensitive Regions in Surveillance Videos Acquired by KINECT Device
3.1 Introduction
With the increasing public concern about personal privacy protection issues, it is desired to develop privacy protection methods for use in video surveillance systems.
Besides, with the rapid development of stereo-vision technology, 3D imaging devices become more and more popular. Many types of such devices have been invented to get 3D multimedia information under various conditions. One famous example is the KINECT device manufactured by Microsoft.
Accordingly, in the application of video surveillance, we can get easily from 3D image data such information as intruding persons, their heights and thicknesses, and so on. Because of these characteristics and usefulness of 3D image data, the issue of protecting 3D image data in various applications becomes more and more important.
In this chapter, we describe the proposed method for privacy protection of selected privacy-sensitive regions in image frames of surveillance videos taken with the KINECT device.
In Section 3.1.1, the related problem definitions are given. In Section 3.1.2, the related ideas are reviewed and the basic idea of the proposed method is described. The
20
principle behind the proposed method is based on the concept of merging of processed color and depth images to display 3D information, which we describe in Section 3.2. Detailed algorithms for privacy-sensitive region concealment and recovery based on the principle are presented in Section 3.3. Finally, some experimental results showing the feasibility of the proposed method are given in Section 3.4.
3.1.1 Problem Definition
As security surveillance is extensively applied in our living environment, there is a growing concern that the systems pose threats to personal privacy. Since the feeling of privacy is highly subjective and varying across cultures and individuals, the method of privacy protection should be adapted as much as possible to suit individual requirements. With regard to this demand, in the proposed method we allow an authorized user to specify a privacy-sensitive region R in a surveillance video in advance. The image content in R is defined as a privacy-sensitive image part which is not to be revealed to unauthorized people. The goal of the proposed method is to disguise the pre-selected privacy-sensitive image part as a corresponding background image part to conceal privacy-sensitive information in the image frames of the surveillance video. In addition, it is hoped that the protected image frames can be restored to include the original privacy-sensitive image part if a secret key is given as input.
Using traditional data hiding techniques to hide the privacy-sensitive image part may achieve this goal, but such techniques usually are time-consuming and demand large spaces for data embedding. Therefore, we design alternatively a general method for concealing the privacy-sensitive image part imperceptibly and recovering the original content of this image part from the resulting protected image losslessly. In
21
addition, even if a person knows the algorithms implementing the method, he/she still cannot retrieve the privacy-sensitive image part without the secret key. The security of the protected privacy in the surveillance video is thus ensured.
On the other hand, the KINECT sensor can be used to acquire the depth information, so the data can be used more extensively in applications than the 2D data.
This kind of information is very important, so its correctness must be guaranteed. It is noted that the range of the depth values that are provided by the KINECT device is different from that of the general values of color-image. On the other hand, together with a depth image, a color image is taken simultaneously by the KINECT device. So, the depth image and the color image taken by the KINECT device at an identical instant of time should be protected together to keep their relation in time.
3.1.2 Review of Ideas of a Previous Study
Camouflaging is a method of hiding a secret. It allows a secret to be disguised by an object and so remain unnoticed. The major idea of the proposed method was inspired by the concepts of camouflaging and privacy protection in surveillance videos which were proposed by Lin and Tsai [25] as well as by the natures of the features of the KINECT images. The proposed method aims to produce a protected image by disguising a privacy-sensitive image part as a pre-selected background image part in a surveillance video, and to protect as a whole the color and depth images taken by a KINECT device at an identical instant. The proposed method utilizes the features of KINECT images and the range of pixel values in the depth image to achieve the goal of protecting the color and depth image together.
Specifically, the proposed method produces a camouflage image by using a prediction-based mapping, which a deterministic one-to-one (reversible) compound mapping function proposed originally by Liu and Tsai [1]. Following the function
22
proposed originally by Liu and Tsai [1], Lin and Tsai [25] integrated a new prediction technique into the prediction-based mapping to make the resulting camouflage image closely resemble the background image in appearance. The technique they proposed can estimate more effectively pixel values for use in the prediction-based mapping.
Specifically, it uses the pixel-value similarity among adjacent pixels and employs a simple edge detection technique coming from the JPEG-LS standard [31].
3.2 Proposed Techniques of Synchronized Removals of
Corresponding Areas in Color and Depth Images
In this section, we introduce the method proposed in this study which synchronously removes corresponding areas in color and depth images for privacy-sensitive image part concealment. The principle behind the proposed method is based on Lin and Tsai [25] as mentioned. In Section 3.2.1, the idea of the proposed method is described. And in Section 3.2.2, the detailed algorithms implementing the method are described.
3.2.1 Idea of Proposed Method
In the proposed method, at first a scheme is proposed to generate a new 3D image from the color and depth images acquired with a KINECT device. It is known that the KINECT device has a color camera and a laser sensor which are not aligned, leading to a displacement between the coordinates of the color image and those of the depth image. Therefore, we must correct this displacement to merge the two images to
23
produce a single 3D image. We conduct this correction based on the use of a pinhole camera model.
The next major step of the proposed method is to disguise the privacy-sensitive image part as a pre-selected background image part to conceal the privacy-sensitive information in the image frames in the surveillance video. For this, we allow an authorized user to specify a privacy-sensitive region R in a surveillance video in advance, followed by the action of disguising the image content as mentioned above.
The third major step is to remove synchronously corresponding areas in the color and depth images in the mean time. The result is finally displayed in a 3D fashion.
More details are described in the following.
3.2.2 Merging of Processed Color and Depth Images
To merge the depth and color images to produce a 3D image, a calibration of the camera parameters in advance is necessary. In this process, a rotation problem and a displacement problem in the 3D space should be solved at first. A solution to the rotation problem is to correct related parameters before mapping the KINECT images (including color and depth images) to produce an integrated 3D image. For this, some functions and parameters provided by the KINECT device and the Kinect-for-Windows SDK [11] can be utilized. In more detail, we tilt the field of view of each KINECT device to the zero-angle position using the tilt motor in the KINECT device before acquiring KINECT images; and to solve the displacement problem between the color image and the depth image, we use certain functions provided by the Kinect-for-Windows SDK [11].
After the above problems are solved, the original color and depth images are aligned in the same image coordinate system. But these 3D image data are just the 3D depth coordinates combined with the 2D image coordinates, so they must be
24
transformed into a single 3D space coordinate system integrally. For this purpose, we apply the principle of the pinhole camera model to conduct the transformation of the image coordinates into the 3D space coordinates.
Pₐ (u, v)
Figure 3.1 The pinhole camera model.
As illustrated in Figure 3.1, the pinhole model may be considered to be a simple camera with its center of projection (i.e., its lens center) located at O and its optical axis taken to be the Z-axis of the camera coordinate system. The focus point is on the image plane with a focal length f. A 3D space point P = (X, Y, Z) in the camera coordinate system is projected onto an image point Pa on the image plane at image coordinates (u, v), where the image plane may be that of the depth image or the color image. The depth value d of the space point P is provided by the KINECT device, but we do not have its correct coordinates (X, Y, Z) in the 3D space (i.e., in the camera coordinate system). We can calculate them according to the similar-triangle principle.
25
In more detail, following the direction vector starting from the center C of the image plane, going through lens center O of the camera, and finally reaching at a 3D space point G' which is the projection of the space point P on the vector, we can see two similar triangles aside the direction vector. We can calculate the distance d between the image plane and the lens center O by the following equation:
2 2
2 v f
u
d . (3.1)
Then, according to the principle of similar triangles, we can derive the following equalities: acquired by the KINECT device. The detailed algorithm is described in Algorithm 3.1 below.
Algorithm 3.1: merging of color and depth images.
Input: a depth image D, a color image C.
Output: a 3D image J constructed from the color and depth images.
26
Steps:
Step. 1 Perform the process of Equations 3.1 through 3.3 to “calibrate” the pixels in color image C and those in depth image D, with a pixel Cp in the resulting C corresponding pixel Cp from the color image C.
Step. 4 Knowing the extrinsic rotation R and translation T between the color and depth camera, express the mapping between color image pixel Cp and depth
image pixel Dp by following equations: R T D
In the above algorithm, the values of R and T are given by following equations:
T as extrinsic rotation R and translation T.
27
3.3 Proposed Method for Protecting Selected Privacy-sensitive Regions in Surveillance Videos
In this study, we adopt the processes for reversible prediction-based mapping from the previous study of Liu and Tsai [1] and Lin and Tsai [25]. About the processes for reversible prediction-based mapping, we will discuss problems encountered in applying the mapping, and propose solutions to them. Subsequently, they will be applied to security protection for video surveillance in Section 3.3.1. And the complete processes of privacy-sensitive region concealment and recovery are presented in Sections 3.3.2 and 3.3.3, respectively. The detailed algorithms about the proposed methods and the complete processes of concealing and recovering privacy-sensitive image parts will be presented in this section.
3.3.1 Review of application of reversible
prediction-based mapping to privacy protection in surveillance videos
The method for privacy protection in surveillance videos is based on the use of the reversible prediction-based mapping proposed by Liu and Tsai [1], which is a deterministic one-to-one compound mapping of values. Besides, Lin and Tsai [25]
proposed a principle of mapping and its use for protection of pre-selected privacy-sensitive regions will be described in this section.
As proposed in Liu and Tsai [1], a one-to-one mapping Fc with the function of Fc(p) = p c is created, where c is a parameter to be determined. As is well known,
28
the values of the adjacent pixels of P in the image are usually close to p because of contextual dependency. So, if we compute the average value a of them, a will be usually close to p and can be regarded as a prediction value of p. Therefore, we can take a as the parameter c of the function Fc above so that Fc(p) = Fa(p) = p c = p a will be usually small because a is close to p in value. This function value will be called the prediction residue of p and denoted as r subsequently, i.e., Fa(p) = p a = r.
Next, a second one-to-one mapping Fb-1(r) = r + b is performed, which adds r to the target value b, where Fb1 is the inverse of F with parameter b. The resulting value is denoted as q. The overall 2-step mappings result in a compound one-to-one mapping function f with the following effect: f(p) = Fb1
(r) = Fb1
(Fa(p)) = r + b = (p
a) + b = q.
As stated above, r = p a is usually close to 0, so the value q = f(p) will be close to the target value b, creating an effect of steganography. Therefore, we will call q a stego-value. Also, it is obvious that the smaller the prediction residue value r, the closer the stego-value q is to the target value b.
If it is desired to recover p from q, then the inverse f 1 of the compound one-to-one mapping function f can be used and the recovery is lossless, as can be seen from the following derivations:
f1(q) = [Fb1
(Fa(q))]1 = Fa1
(Fb(q)) = Fa1(q b)
= Fa1(p a + b b) = Fa1(p a) = p a + a = p.
Accordingly, to recover p from q, first we retrieve the prediction residue value r
= p a by computing the value of the inverse Fb of the second mapping function Fb1
with the stego-value q as input. This results in Fb(q) = q b = p a + b b = p a = r.
Then, we use the same prediction scheme to compute the prediction value a from the values of the pixels adjacent to pixel P. Finally, we use the inverse Fa1
of the first mapping function Fa to compute the original pixel value p of pixel P by Fa-1(r) = r + a
29
= p a + a = p.
It is noted that in the above recovery process, we have to compute the same prediction value a first, and then use it to recover the original value of the source pixel P. Based on this principle, Lin and Tsai [25] did not use the original value of the pixel P, but use only the values of the pixels adjacent to P, to compute a for the purpose of producing the identical prediction value.
In the previous method, conversion of a source value into a stego-value by two simple mapping function Fa(p) = p – a = r and Fb-1
(r) = r + b = q will cause some problems. The computed stego-value q might exceed the range of valid pixel values of a, b, and p. Then, Lin and Tsai [25] proposed another way to solve this problem.
To solve this problem, Lin and Tsai [25] adopted another one-to-one mapping function Fc proposed in [1] with c = a or b such that the compound mapping q = Fb–1
(Fa(p)) did not exhibit this wrap-around problem. Based on this new function Fc, we describe how the corresponding mappings Fa(p) and Fb1
(r) work, respectively, by Algorithms 3.2 and 3.3 below.
Algorithm 3.2: computing the value of a new mapping function Fa(p) which does not cause the wrap-around problem.
Input: a prediction a and a source pixel value p.
Output: the prediction residue r of the mapping function Fa(p) without causing the wrap-around problem.
Steps:
Step. 1 Initialize r to be zero.
Step. 2 Create a set S with 256 initial elements 0 through 255.
Step. 3 Find a value p in S such that |a – p| is the minimum, preferring a smaller p
in case of ties occur.
30
Step. 4 If p is not equal to p, then remove p from S, increment r by one, and go to Step 3; otherwise, take the final r as the output.
Algorithm 3.3: computing the value of the inverse Fb1
(r) of the mapping Fa(p) described by Algorithm 3.2.
Input: a target value b and a prediction residue value r.
Output: an stego-value q of the inverse mapping function Fb1(r) of Fa(p) Steps:
Step 1 Create a set S with 256 initial elements 0 through 255.
Step 2 Find a value q in S such that |b – q| is the minimum, preferring a smaller q in case of ties.
Step 3 If r is larger than 0, then remove q from S, decrement r by 1, and go to Step 2; otherwise, take the final q as the output.
Based on the above two algorithms, the ideas for converting the source value p into a stego-value q and recovering p from q losslessly now can be described integrally by Algorithms 3.4 and 3.5, respectively, below.
Algorithm 3.4: converting a source pixel value to a stego-value which is close to a target pixel value.
Input: a source value p, a target value b, and the mapping Fc and its inverse Fc-1
described by Algorithms 3.2 and 3.3, respectively, where c is a parameter.
Output: a stego-value q.
Steps:
Step 1. Compute a prediction value a by a prediction technique.
Step 2. Perform Algorithm 3.2 to compute the prediction residue value r = Fa(p).
Step 3. Perform Algorithm 3.3 to compute the output stego-value q = Fb1
(r).
31
Algorithm 3.5: recovering a source pixel value from a stego-value.
Input: the stego-value q produced by Algorithm 3.4 and the target b used there.
Output: the source value p.
Steps:
Step 1. Compute the prediction value a by the same technique as used in Step 1 of Algorithm 3.4.
Step 2. Regard the input values b and q as a and p, respectively, and take them as inputs to Algorithm 3.2 to compute a prediction residue value r.
Step 3. Regard a obtained in Step 1 as b, and take it and the value r obtained in Step 2 as inputs to Algorithm 3.3 to compute the source value p as the output.
The principle of reversible prediction-based mapping described above was applied to accomplish privacy protection in a surveillance video in the previous review. Lin and Tsai [25] regarded the source value p as a pixel value in a privacy-sensitive image part Is to be protected, which appears within a region R, called privacy-sensitive region, in every surveillance video frame V other than a pre-selected video frame Vo taken as a background image.
Also, Lin and Tsai [25] used a background image part Ib in Vo, which corresponds to Is in position in R but without privacy information, to replace Is using reversible prediction-based mapping described above. Each pixel value in Ib is just a target value b mentioned previously. Then, as illustrated in Figure 3.2, we may use Algorithm 3.4 to find a prediction value a for each pixel in Is by Step 1 of the algorithm, yield a prediction-residue image part Ir consisting of the prediction residue values r in region R by Step 2, and finally generate a camouflage image V′ with stego-values q in region R by Step 3.
32
Source value p
Fa(p) = r
Prediction-residual image(block)
residue value r
Fb-1
(r) = q
Target value b Background image
Camouflage image Mapped value q Surveillance image
R
Figure 3.2 Illustration of concealment of a privacy-sensitive image part.
Figure 3.2 Illustration of concealment of a privacy-sensitive image part.