e-Fovea: A Multi-Resolution Approach with Steerable Focus to Large-Scale and High-Resolution Monitoring

(1)

e-Fovea: A Multi-Resolution Approach with Steerable Focus to Large-Scale and High-Resolution Monitoring

Kuan-Wen Chen

¹

, Chih-Wei Lin

¹

, Mike Y. Chen

^1,2

, and Yi-Ping Hung

^1,2

1Dept. of Computer Science and Information Engineering, ²Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan

{mikechen, hung}@csie.ntu.edu.tw

ABSTRACT

This paper presents e-Fovea, a system that combines both multi- resolution camera input and multi-resolution steerable projector output to support large-scale and high-resolution visual monitoring. e-Fovea utilizes a design similar to the human eyes, which provides peripheral vision with a steerable fovea that is in higher resolution. e-Fovea is implemented using a steerable telephoto camera and a wide-angle camera. The telephoto image is displayed using a projector with a steerable mirror, and overlaid on the wide-angle image that is displayed using a second projector.

We have deployed e-Fovea in two installations to demonstrate its feasibility. We have also conducted two user studies, with a total of 36 participants, to compare e-Fovea to two existing multi- resolution visual monitoring designs. The user study results show that for visual monitoring tasks, our e-Fovea design with steerable focus is significantly faster than existing approaches and preferred by users.

Categories and Subject Descriptors

H5.2 [Information Interfaces and Presentation]: User Interfaces – Evaluation/methodology, User-centered design.

General Terms

Design, Human Factors, Verification.

Keywords

e-Fovea, multi-resolution, steerable focus, user study, visual monitoring, hybrid dual-camera system.

1. INTRODUCTION

The human visual perception system is a multi-resolution mechanism [11][16][24]. Only the fovea region of human eyes can afford sharp vision with acute visual details. Opposite to the fovea region, the peripheral region of human eyes perceives rough percipience of the world with coarse visual details. When monitoring an environment and an intrusion occurred, humans move their eyes so that the interesting scenes are projected onto

the fovea.

Visual surveillance applications that cover medium to large areas have similar needs. For example, when monitoring a traffic intersection, we need to observe where traffic incidents occur in the entire view and at the same time need sufficiently high- resolution details to determine what the incident is.

We propose e-Fovea, a system that utilizes the same design concept as the human visual system to support these types of visual monitoring applications. As shown in Figure 1, e-Fovea comprises a multi-resolution input system and a multi-resolution output system. The multi-resolution input system is a hybrid dual- camera system [14][15][26], which is composed of a static wide- angle camera and a steerable telephoto camera (or a pan-tilt-zoom (PTZ) camera).

The multi-resolution output system includes a fixed projector and a second, steerable projector for the fovea region. The fixed projector projects onto a large wall surface, providing peripheral vision at low pixel density. The fovea projector uses a steerable mirror to project onto a small, embedded region at a much higher pixel density [9][21].

Compared to current approaches that use a single high-resolution camera with a high-resolution display, our approach provides better resolution at the fovea region, is more scalable to a larger area size, and is at a lower cost. To our knowledge, the largest image captured by a surveillance camera currently on market is an Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

MM’10, October 25–29, 2010, Firenze, Italy.

Figure 1. Overview of the e-Fovea System.

(2)

11-megapixel camera with a resolution of 4000x2656 pixels at just 3 frames per second, and the largest display is 56 inch with a resolution of 3840x2160 pixels. In comparison, our e-Fovea installation has a high fovea pixel density, which effectively provides a resolution of 9600x7200 pixels for the entire display.

To evaluate the effectiveness of e-Fovea, we have conducted two user studies, each with 18 participants. We compared user performance and preference among three multi-resolution visual monitoring designs in hybrid dual-camera systems: overview plus detail (O+D) [1][18], focus plus context (F+C) [1][2], and steerable F+C [9][21]. The O+D interface displays two camera images in two separate windows. The F+C screen is a multi- resolution display wall with a fixed high-resolution LCD screen embedded in a large low-resolution display. The steerable F+C interface, the output system of e-Fovea, is similar to F+C but using a steerable projector instead of the fixed LCD screen. From the results, we demonstrate that e-Fovea’s multi-resolution display with steerable fovea is both faster and preferred by users.

In the following section, we first describe e-Fovea system design and implementation, including camera calibration, projector calibration, and projector-camera integration. Then, we describe installations of the system in two environments to demonstrate its feasibility for visual monitoring. Finally, we present details of our user evaluation. For simplicity, in the remaining section of this paper, the image captured by the wide-angle camera will be termed as “overview image,” and the image captured by the steerable telephoto camera will be termed as the “detail image”.

2. RELATED WORK

In computer vision research, the hybrid dual-camera system is a well-known surveillance setup, and there have been several research efforts [14][15][26] and commercial applications [17]

with such a system for visual monitoring. These designs displayed two camera images in two separate windows and focused on improving the performance of computer automation and saving manual effort. This kind of displays was also called overview plus detail (O+D) visualization [1][18]. The overview typically contains a visual marker which represents the corresponding region of the steerable telephoto camera view. The marker helps users to be aware of where the steerable telephoto camera focusing in the overview image more easily. e-Fovea embeds the detail image directly in the overview image, so users no longer need to switch between two separate windows.

In human-computer interaction research, there have been many multi-resolution display systems. Feiner and Shamash [7]

combined heterogeneous display, including a screen and a head- mounted display, and interaction device technologies to produce a hybrid user interface. Baudisch et al. [2] proposed focus plus context (F+C) screen, which is a multi-resolution display wall with a fixed high-resolution LCD screen embedded in a large low- resolution display. Staadt et al. [21] projected high resolution images onto different regions of the wall by using a pan-tilt unit with a mirror. Sanneblad et al. [23] and Geisler et al. [8] used a tablet PC for showing high resolution images. Two hand-held projectors were used and integrated by Cao et al. [4]. Hu et al. [9]

proposed i-m-Top system which is a tabletop system with a mirror mounted steerable projector. These works focused on various multi-resolution display interaction techniques, and assumed that the source images are of uniform, high resolution. e-Fovea

combines multi-resolution display with multi-resolution video capture, such as a hybrid dual-camera system, for visual monitoring.

Baudisch et al. [1] had compared F+C screen with O+D visualization, and found that F+C screen is more effective.

Although they experimented on both static scenes and dynamic scenes, the high-resolution region was in a fixed area and no target tracking, a common surveillance task was evaluated.

Hsiao et al. [10] designed a comparative study between steerable and fixed high-resolution displays and found that the steerable focus approach is preferred especially for situations requiring visits in different regions on tabletop displays. However, the manipulations of both interfaces were quite different in the experiments. With the F+C screen, users were asked to drag the region of interests into the fixed high-resolution region. On the other hand, when manipulating the interface with steerable focus, the high-resolution region was automatically moved to where users touched.

To understand how effective the different multi-resolution designs apply to visual monitoring, we conducted two user studies with surveillance-related tasks to evaluate three interfaces: O+D, F+C, and steerable high-resolution display (steerable F+C). The first study evaluated single moving target tracking and the second evaluated multiple moving target identification. In addition, we unified the manipulation of the three interfaces to use mouse click to identify the area of interests, which is the most common operation in a hybrid dual-camera system [14][15][17][26].

3. DESIGN AND IMPLEMENTATION OF E-FOVEA

As shown in Figure 1, the hardware architecture of e-Fovea system comprises a multi-resolution input system and a multi- resolution output system. In addition, it includes three main technical components: camera calibration, projector calibration, and projector-camera integration. Camera calibration and projector calibration are both completed offline, during the setup phase. Camera calibration estimates the relationship between the coordinate system of the overview image and the pan/tilt angle of steerable telephoto camera, and also calculates the geometrical transform matrix to stitch the detail image and the overview image seamlessly. Projector calibration is used to integrate the images projected from the two different projectors. Projector- camera integration describes the end-to-end video processing and also how the steerable camera and the steerable projector are controlled.

3.1 System Architecture

The multi-resolution input system is a hybrid dual-camera system which includes a static wide-angle camera (ACTI Fixed_1311) and a steerable telephoto camera (ACTI SpeedDome_6510). The image resolutions of both cameras are set to 640x480 pixels. To make it possible to spatially align the detail image in the overview image seamlessly, the lens centers of both cameras should be made as close as possible. Fortunately, this installation is usually acceptable in real surveillance applications [14][15][17][26].

Compensation for images not taken from the same place is always a challenge of image stitching, and until now it can be handled only in the situation where the positions of camera are within a small area or the scene is large scale. When the lens centers of

(3)

two cameras are near and the monitored scene is far enough, we can consider these two cameras are almost concentric. For two concentric cameras, the images can be stitched easily by warping with a 3x3 perspective transform matrix or homography [3][20].

The multi-resolution output system includes a fixed projector (JVC DLA-SX21) with a wide angle lens for large display and a steerable projector [6][9]. The steerable projector [6] is composed of a projector (JVC DLA-SX21) with a telephoto lens (Schneider CINE DIGITAR D-ILA 70MM) and a computer-controlled pan- tilt unit (Directed Perception PTU-46-17) with a mirror (6.7 inches diameter) mounted on it. The image resolutions of both projectors are set to 1024x768 pixels.

3.2 Camera Calibration

There are two camera calibration issues between static wide-angle camera and steerable telephoto camera here. The first one is to calibrate the relationship between the coordinate system of overview image and the pan/tilt angle of steerable telephoto camera, which makes it possible to turn the view center of steerable telephoto camera to where users click mouse in the overview image. Second, when given a pan/tilt angle, we need to estimate the corresponding homography to warp the detail image to spatially align it in the overview image seamlessly.

First, we estimate the intrinsic parameters of each camera separately by Zhang’s method [25], and then transform image to compensate lens distortion. The second step is to estimate the relationship between the wide-angle camera and steerable telephoto camera. We set a pan/tilt angle θ0 and a proper zooming factor of steerable telephoto camera with the view of steerable telephoto camera similar to the overview image, and denote the captured image of steerable telephoto camera as reference image.

Then, the corresponding feature points between the overview image and the reference image are estimated by SIFT algorithm [3][13], as shown in Figure 2. With more than four corresponding points, we can calculate the homography H_RI^OI between the overview image and the reference image. Then, we have

 

RI^OI



^o ^o



^T

r T

r y H x y

x

s 1   1 , (1)

where s is a scale factor, and (x^r, y^r) and (x^o, y^o) are the coordinates of reference image and overview image, respectively.

Third, the steerable telephoto camera is calibrated by itself in front of a wall, and an additional fixed projector is used to project calibration patterns. As shown in Figure 3, two kinds of pattern images are used. The center of each circle in the circle image can be considered the location of a feature point. To identify the

corresponding feature points between the reference image with a pan/tilt angle θ₀ and the detail image with a pan/tilt angle θ, we use a coding technique with gray code patterns [22]. For θ0 and each pan/tilt angle θ of steerable telephoto camera, the circle image and gray code pattern images are projected by the fixed projector sequentially. After estimating the corresponding points, we can calculate the homography H_RI^DI^{ }^ between the reference image and each detail image with a pan/tilt angle θ. For each θ, we transform the coordinate of center of detail image to the coordinate system of reference image, (x^rθ, y^rθ), and then the relationship between θ and (x^rθ, y^rθ) are estimated.

Finally, the relationship between (x^o, y^o) and the pan/tilt angle θ of PTZ camera can be calculated by (1). The homography of warping the detail image with a pan/tilt angle θ to spatially align it in the overview image is

  1 DI()

RI OI RI DI

OI H H

H  ^． ^. ⁽²⁾

After camera calibration, once a point in the overview image is selected, the pan/tilt angle of the steerable telephoto camera, θ, and the corresponding homography H_OI^DI^{ }^ can be generated.

3.3 Projector Calibration

The multi-resolution output system is composed of two projectors.

To integrate the images projected from different projectors, we calibrate the geometrical relationship between both projectors for each pose of PTU (pan-tilt unit) in advance. Denote the coordinate system of projection plane associated with the fixed projector as fixed projector (FP) image plane and the coordinate system of projection plane associated with the steerable projector as steerable projector (SP) image plane. For each pose of PTU φ, what we want to obtain is the transformation matrix H_SP^FP_{ }_ ^from the FP image plane to the SP image plane and use it for image warping.

To calibrate automatically, we use an additional PTZ camera. A diagram of the projector calibration is shown in Figure 4. The PTZ camera increases the resolution of measurement to find feature points more accurately. The zooming factor is determined manually to keep there are enough feature points in the field of view to estimate the homography, i.e. at least four feature points.

Then we can calculate H_SP^FP_{ }_ by estimating two homographies

 

 

 CI

HSP ^andH^CIFP^{ }^ first with the following equation:

      

1.

 

 _^ ^

 CI

FP CI SP FP

SP H H

H ⁽³⁾

Figure 2. The corresponding feature points estimated by SIFT algorithm between (a) overview image, and (b)

reference image.

Figure 3. The calibration pattern images with 256 feature points: (a) circle image, and (b) gray code pattern images.

(4)

Where H_SP^CI_{ }^{ }^_ ^andH_FP^CI^{ }^ are the homography transforming the coordinate from camera image plane (CI) to SP image plane, with a PTU angle φ and a pan/tilt angle ρ of additional PTZ camera, and the homography transforming the coordinate from CI to FP image plane, respectively.

To calculate H_FP^CI^{ }^ , the calibration process is similar to steerable telephoto camera calibration. For each pan/tilt angle ρ of PTZ camera, the 256-circles image and gray code pattern images are projected by the fixed projector sequentially. Then, the homography matrix H^CI_FP^{ }^ for each pan/tilt angle ρ can be calculated.

For each pose of PTU, φ, the steerable projector projects the 5- circles image (Figure 5). Then, the PTZ camera turns automatically to a proper angle, ρ, where the camera view can cover the whole area of the five circles projected by the steerable projector. Similarly, we capture an image and find the center positions of circles. The center positions of circles can form a set of point correspondences between the SP image plane and CI, and the homography matrix H_SP^CI_{ }^{ }^_ is calculated.

After estimating H_SP^CI_{ }^{ }_^ ^and H^CIFP^{ }^ , we can calculate the transformation matrix H_SP^FP_{ }_ for each pose of PTU φ by (3). The projected images of different projectors can then be integrated seamlessly by a warping process.

3.4 Projector-Camera Integration

After calibrating both multi-resolution input and output systems separately, the last problem is how to combine these two systems.

To integrate the input and output systems, we use the fixed projector to project the overview image and the steerable projector to project the detail image, as shown in Figure 1. For seamless display, the detail image should be pre-warped before being projected. Denote the original detail image and the warped detail image as Iori and Iwarped, respectively. Because overview image is displayed full screen by fixed projector, the coordinate system of FP image plane can be considered the same as that of overview image. With the pre-calibrated geometrical relationships

  DI

HOI ^andH_SP^FP_{ }_ ^{, the I}warped is calculated with the following equation:

   

 

ori

DI OI FP SP

warped H H I

I  _  ^  ^, ⁽⁴⁾ where H_OI^DI^{ }^ ^andH_SP^FP_{ }_ are the transformation matrix of the corresponding pan/tilt angle of steerable telephoto camera θ and the corresponding pose of PTU φ to current detail image, respectively.

When users click mouse in the view of static camera, θ can be got from the results of camera calibration directly, but which pose of PTU φ is suitable remains unclear. Because the coordinate of overview image is the same as that of FP image plane, we know the coordinate selected by users in FP image plane. Therefore, from the results of projector calibration, we can estimate the pose of PTU φ automatically, with the center of the projection area of steerable projector closest to where users select in FP image plane.

4. APPLICATIONS

We have installed e-Fovea system in two environments for visual monitoring. In the first environment, we installed cameras on the third floor and monitor a square, as shown in Figure 1. The scale of FOV (field-of-view) of steerable telephoto camera is about 5 times smaller than that of wide-angle camera, i.e. one pixel in view of wide-angle camera corresponded in size to approximately 25 pixels in view of steerable telephoto camera. The interesting targets are the humans and vehicles passing the square. The fixed projector is mounted on the ceiling with a height of about 8 feet and placed at a distance of 13 feet away from the projected wall.

Its projection area is 6 feet high and 8 feet wide. The steerable projector is also mounted on the ceiling and placed at a distance of 6.5 feet away from the wall. Its projection area is about 25 times smaller than that projected by the fixed projector.

In the second environment, the cameras are installed at a height of about 26 feet in front of a building to monitor the license plates of vehicle passing and people entering the building, as shown in Figure 6. The fixed projector is mounted on the ceiling with a height of about 8 feet and placed at a distance of 13 feet away from the projected wall. Its projection area is 6 feet high and 8 feet wide. The steerable projector is also mounted on the ceiling and placed at a distance of 3 feet away from the wall. In this application, we want to recognize the vehicle license plate, and so the scale of FOV of steerable telephoto camera is set about 15 times smaller than that of wide-angle camera, i.e. one pixel in view of wide-angle camera corresponded in size to approximately 225 pixels in view of steerable telephoto camera. Notice that e- Fovea’s multi-resolution input and output systems are necessary in this application, because currently there are no cameras or displays with a resolution of more than 9600x7200 pixels. A demo video of this system can be viewed at http://www.youtube.com/watch?v=CbmzbYXoQhs .

Figure 4. Diagram of projector calibration.

Figure 5. The 5-circles image.

(5)

5. EVALUATION

We have designed two user studies to compare three types of multi-resolution displays: O+D interface [1][18], F+C screen [1][2], and steerable high-resolution display [9][21] when applied to surveillance tasks. For simplicity, the steerable high-resolution display will be termed as “steerable F+C interface” in the following sections.

The two studies correspond to two of the most common surveillance tasks using the three interfaces. The first one is single moving target tracking study, where the control mode of high resolution region is a smooth and continuous pursuit of the target.

In the context of surveillance, when a suspicious person enters the area being monitored, the security personnel may need to control the PTZ camera to continuously track the target to know what the person is doing.

The second study is multiple moving targets identification. When monitoring a private environment, the security personnel may need to identify each and every visitor to determine whether any un-authorized person has entered the premise. This control mode here is called saccade. The smooth pursuit and saccade are two most common control modes of active cameras [19].

5.1 User Study 1: Single Moving Target Tracking

In the first study, we compare three interfaces in completing the simulated task of tracking single target continuously. The participants are asked to keep tracking a target by controlling the simulated steerable telephoto camera with mouse clicks on the overview image. When tracking the target, participants need to identify the symbol, either “3” or “E”, which can only be distinguished in the detail image, on it, and then press keyboard to verify that they could identify the symbol.

5.1.1 Interfaces and Apparatus

The simulated experiments are run on a PC with an extended desktop setup for dual displays. Two LCD monitors are used. One is a 22” LCD monitor (ViewSonic VG2230wm) with a resolution of 1024x768 pixels, and the other is a 17” LCD monitor (Samsung SyncMaster 172T) with a resolution of 1024x768 pixels.

In our hybrid dual-camera system, the video resolution of both cameras are configured to be 640x480 pixels, and the scale of FOV of steerable telephoto camera is about 5 times smaller than that of wide-angle camera in the first environment. To simulate the difference of resolution of input sources, we construct a 1024x768 image for each simulated video frame, and scale it down by a factor of 5 to produce overview image with a size of 205x153 pixels. The detail image is also with a size of 205x153 pixels and produced by cropping the corresponding 205x153 region from the original 1024x768 image.

All interfaces use mouse click in the window of overview image to move the high-resolution region center, and participants use keyboards to input the symbols shown on the targets. The moving speed of high-resolution region of three interfaces is unified to 1000 pixels/second in the overview image. This moving speed is similar to the turning speed of the steerable telephoto camera in our system.

The O+D interface uses both LCD monitors put side by side, as shown in Figure 7(a). The left monitor is a 22” LCD displaying overview image, and the right one is for detail image. Both images are full-screen display with a resolution of 1024x768

Figure 7. The setup of three interfaces: (a) O+D interface, (b) F+C screen, and (c) steerable F+C interface. The yellow bounding box represents the corresponding region of detail image. The upper windows show the zooming views of the pointed Figure 6. The e-Fovea system installed in front of a building.

The bottom right shows the zooming view of the multi- resolution display.

(6)

pixels.

The F+C screen and steerable F+C interface use the 22” LCD monitor only. The overview image with a resolution of 205x153 pixels is displayed full screen, and the detail image is embedded in the corresponding area of the overview image with its original image size and resolution of 205x153 pixels, as shown in Figure 7(b),(c). Because the high-resolution region of F+C screen is fixed, when users click mouse, the overview background image is moved instead of moving the detail image directly as O+D interface or steerable F+C interface. When the target moves to the border of the overview image, parts of view will be occluded, and hence there exist black borders in the view of F+C screen, as shown in Figure 7(b).

5.1.2 Task and Stimuli

As shown in Figure 7 and Figure 8, a static image is selected as the background, and there are two kinds of symbols, either “3”s or

“E”s in each square, on the target. The symbols “3” and “E”

represent “LEFT” and “RIGHT,” respectively, which can only be recognized in the detail image but not in the overview image. The yellow bounding box in the overview image is used to represent the corresponding region of the detail image.

During each trial, the participant’s task is to control the high- resolution region to keep tracking the moving target. When two flags appearing on the sides of target, as shown in Figure 7, the symbol is changed randomly at the same time, and the participant is asked to read the symbol and enter it by pressing the key

“LEFT” or “RIGHT” on the keyboard. The flags disappear after participant enters correct answer or when no correct answer has been given after more than 7 seconds. Here 7 seconds are set as the period of the target doing something needed to be noticed.

To avoid participant guessing the symbol without reading it really and make sure they should track the target constantly and keep an eye on the detail image at the same time, we set two rules here.

First, the keyboard input can only be entered when the target is within the high-resolution region. Second, these two flags can be seen only when target is within the region of detail image.

At the beginning of each trial, target appears in a random position and moves with a constant speed and along a randomly produced path. An unrecorded period of 3 seconds is given for participant to search where the target is. Then, the test begins. The symbol will be randomly changed with two flags appearing after every 8 to 12 seconds. The time interval is also randomly decided to avoid participant counting and predicting the time to read the symbol.

The symbol is changed 3 times totally in each trial.

5.1.3 Design, Procedure, and Participants

The study design is 3x3 [Interface x Target’s Moving Speed] with 5 repetition for each condition. The testing order of interface is counter-balanced, i.e. the numbers of the participants with different testing order are identical. Target’s moving speeds are set to 40, 80, or 120 pixels/second in a randomized order. For each trial, we record the response time, which is the time interval between symbol being changed and being identified.

At the beginning of experiment, participant is given a verbal explanation about the whole testing procedure and is asked to complete the training session before testing each interface, where 3 trials are given for each interface, in order to be familiar with the interface, manipulation, and tasks before starting the experiment.

At the beginning of testing, a “Welcome” screen is shown, and participant presses “SPACE” key on the keyboard to start the experiment. After each trial, a black blank screen is shown and participant can take break for a while. Once the participant rests enough, he can press “SPACE” key and next trial starts. A

“Thanks” screen appears in the end of the experiment.

After each interface task, participant is asked to provide subjective feedback by using a seven-point Likert-scale to rate their level of confidence about their satisfaction and difficulty of using such interface to finish the task. After completing all three interface tasks, subject is given a questionnaire to give their preferences with different interfaces.

The study takes about 45 to 55 minutes per participant. Total 18 volunteers (3 female) between the ages of 20 and 35 participate in this experiment.

5.1.4 Hypothesis

We propose five hypotheses for this experiment: (Those in bold are confirmed.)

[H1] The response time with multi-resolution approaches, either F+C screen or steerable F+C interface, is shorter than that with O+D interface, because users with O+D interface need to switch views to keep tracking the target and looking at its detail.

[H2] The response time with F+C screen and that with steerable F+C interface are similar.

[H3] As target’s moving speed increasing, the difference between the response time with multi-resolution approaches and that with O+D interface is increasing.

[H4] From the questionnaire of participants’ subjective feedback, O+D interface is the worst.

[H5] From the questionnaire of participants’ subjective feedback, F+C screen is slightly better than steerable F+C interface, because the tracked target is always in the center of screen and the subjects never need to pay attend to other region of the overview image.

5.1.5 Results

Figure 9 shows participants’ average response time of three interfaces. We perform repeated measure ANOVA and find that there are significant main effects for Interface, (F₁_.₄₁_,₂₃_.₉₇ 17.886 ,p0.0001) but not for Interface x Target’s Figure 8. Two kinds of symbols (a) “3” and (b) “E”.

(7)

Moving Speed, ( F₄_,₆₈ 0.894 ,p0.05 ). Thus, H3 is not confirmed.

H1: Paired t-tests between interface conditions with three different target’s moving speeds are all significant. For the 40/80/120 pixels/second condition, the average response time with F+C screen is significant less than that with O+D interface (p<0.002, p<0.0002, p<0.007). For the 40/80/120 pixels/second condition, the average response time with steerable F+C interface is significant less than that with O+D interface (p<0.0002, p<0.001, p<0.001). Hence, the hypothesis is confirmed.

H2: With repeated measure ANOVA, there are not significant main effects for both interfaces, (F₁_,₁₇ 1.266 ,p0.05^).

H4, H5: Figure 10 shows the survey results in Study 1. Wilcoxon sign-rank test is performed. The steerable F+C interface is with the most satisfaction (p<0.002), the least difficulty (p<0.001), and the best preference (p<0.001). However, the results of O+D interface and F+C screen are similar surprisingly. The reasons will be discussed in Section 5.3. From the results, H4 is confirmed, but H5 is not.

5.2 User Study 2: Multiple Moving Target Identification

In the second study, we compare three interfaces in finishing the simulated task of identifying multiple targets. The participants are asked to control high-resolution region with mouse click in the window of overview image and identify the symbols on the targets until all targets in the scene had been checked.

The interfaces, apparatus, simulated environment, appearance of moving targets, the symbols on the targets, and testing procedure are all the same as what in User Study 1, as shown in Figure 11.

5.2.1 Task

During each trial, the participant’s task is to control the high- resolution region to the locations of the unidentified targets by clicking mouse in the window of overview image, and then identify the symbols on the targets until all targets in the scene have been checked. When an unidentified target is closest to where participant clicks mouse, two flags will appear on the sides of the target as a reminder that the participants can proceed to identify the target. To identify the target, the participant needs to read the symbol and enter it by pressing the key “LEFT” or

“RIGHT” on the keyboard. After participant enters the correct answer, the symbol becomes blue or green color, as shown in Figure 11(a), to represent that it have been identified as “LEFT”

or “RIGHT,” respectively. The symbol on the unidentified target is quite different in color from what have been identified, and hence participant can distinguish them in the overview image easily to avoid identifying the same target repeatedly.

At the beginning of each trial, a number of targets are produced.

Each target, with a randomized symbol on it, appears in a random position and moves with a constant speed and along a randomly produced path. Once all targets have been identified, the trial is completed.

5.2.2 Design and Participants

The study design is 3x3x3 [Interface x Target’s Moving Speed x Number of Targets] with 3 repetition for each condition. Interface order is counter-balanced. Target’s moving speeds are set to 40, 80, or 120 pixels/second in a randomized order. The number of targets is set to 4, 8, or 12 in a randomized order. For each trial, we record the task completion time.

For each interface, participants received a training session and are asked to complete 3 8-targets trials before testing. The study takes about 40 to 55 minutes per participant. Total 18 volunteers (4 female) between the ages of 20 and 45 participate in this experiment. None of them have participated in Study 1.

Figure 9. The average response time in Study 1. (+/- standard error of the mean)

Figure 10. Survey results in Study 1.

Figure 11. Study 2: (a) The symbols of targets before and after being identified, (b) F+C screen, (c) steerable F+C

interface

(8)

5.2.3 Hypothesis

We propose five hypotheses for this experiment: (Those in bold are confirmed.)

[H6] The task completion time with F+C screen is shorter than that with O+D interface.

[H7] The task completion time with steerable F+C interface is shorter than that with F+C screen, because of the occlusion problem of F+C screen.

[H8] As number of targets increasing, the difference of task completion time between each pair of interfaces is increasing.

[H9] As target’s moving speed increasing, the difference of task completion time between each pair of interfaces is increasing.

[H10] From the questionnaire of participants’ subjective feedback, steerable F+C interface is preferred, and O+D interface and F+C screen are similar, because of the results of Study 1.

5.2.4 Results

Figure 12 shows participants’ task completion time. We perform repeated measure ANOVA and find that there are significant main effects for Interface, ( F₂_,₃₄ 53.999 ,p0.000001 ^{), for}

Interface x Number of Targets,

(F₂_.₆₇₇_,₄₅_.₅₀₁24.418 ,p0.000001) and for Interface x Target’s Moving Speed, (F₄_,₆₈ 8.233 ,p0.00002^).

H6: Paired t-tests between different interfaces in almost all conditions (Target’s Moving Speed x Number of Targets) are significant (p<0.026), except the condition (N, V) = (4, 40) (p>0.05). Although there is an exceptional condition, the trend

still exists. We infer the reason may be the number of participants is not enough. Hence, we can also confirm the hypothesis.

H7: Paired t-tests between different interfaces in almost all conditions (Target’s Moving Speed x Number of Targets) are significant (p<0.025), except the condition (N, V) = (4, 120) (p>0.05). With the same reason of H6, we confirm the hypothesis.

H8: Figure 13 shows difference of task completion time between each pair of interfaces under different number of targets conditions. Repeated measure ANOVA is performed for each pair of interfaces. The results are all significant (p<0.005).

H9: Figure 14 shows difference of task completion time between each pair of interfaces under different target’s moving speed conditions. Repeated measure ANOVA is performed for each pair of interfaces. The results between O+D interface and F+C screen and between O+D interface and steerable F+C interface are both significant (p<0.000001, p<0.00047). However, the results between F+C screen and steerable F+C interface are not significant (p>0.05).

H10: Figure 15 shows the survey results in Study 2. Wilcoxon sign-rank test is performed. The steerable F+C interface is with the most satisfaction (p<0.0003), the least difficulty (p<0.001), and the best preference (p<0.001). The results of O+D interface and F+C screen are similar.

5.3 Discussion

In this section, we discuss the advantage and drawback of each interface from the subjects’ feedback and our observation to analyze the study results.

Figure 12. The task completion time in Study 2.

Figure 13. The difference of task completion time under different number of targets conditions in Study 2.

Figure 14. The difference of task completion time under different target’s moving speed conditions in Study 2.

Figure 15. Survey results in Study 2.

(9)

5.3.1 O+D Interface

The main advantage of O+D interface is with a large display window for the detail image. Two of participants explained that they like this interface most, because of this reason.

There are several drawbacks, and hence it is always with the worst user performance (H1 and H6) and participant’s subjective opinion (H4 and H10) in both studies.

 Switching effort: when manipulating O+D interface, the participants need to switch views. It causes some difficulties:

first, it makes user uncomfortable after long-term monitoring.

Second, when targets move faster, it is hard to keep tracking the target and watch its detail at the same time. When participants track the target in the overview image, and then switching their view to the detail image, the target may have leaved the region of the detail view. Third, when there are multiple targets and the colors of their clothes are similar, it is difficult to distinguish which target in the current detail image is the one selected to be recognized in the overview image before switching (H8).

 Content switch: when switching views from the overview image to the detail image and then back to the overview image, the users need to be re-familiar with the scene and targets’ positions. In Study 1, because the participants only focus on one target, the current position of the target is similar to where before switching even though the target moves fast.

Thus H3 is not confirmed. In Study 2, the participants identify one target by switching views to the detail image, but they need to search ANOTHER unidentified target when switching back to the overview image. Because all targets are moving, the effect of content-switch happens, especially when the targets moving faster (H9). The content in the overview image will be much different before and after switching views. This causes users need to search the whole scene again.

5.3.2 F+C Screen

The first advantage is the embedded high-resolution region, and so no switching effort or content switch happen. Therefore, the user performance is usually better than that with O+D interface (H1 and H6). Second, the tracked or identified targets are always in the center of screen, and so users can usually keep their focuses within a fixed area.

Some drawbacks make it with worse user performance than steerable F+C interface (H7 and H8) and with the worst participant’s subjective opinion (H5 and H10).

 Loss of global information: some subjects explained that when tracking single target with F+C screen, they usually focus on the high-resolution region in the center of screen and the tracked target around that region. It is easy to miss what happens in other area.

 Losing perception: some subjects explained that they always forget where the detail view is in the overview image when manipulating F+C screen for a long time. They usually need to look at the black borders and think where it is.

 Occlusion problem: when the target moves to the border of the overview image, parts of view will be occluded. This results in H8, because there would be more targets in the blind

region when there are more targets in the scene. When dealing with multi-target identification task, the user even can not know whether there exist any unidentified targets, as shown in Figure 11(b). In addition, some of participants do not like the black borders in the view of F+C screen.

 Feeling dizzy: because the high-resolution region is fixed, the background image is moved when user clicks mouse. It makes users dizzy, because the background image moves in the direction opposite to where users select. In our experiments, more than half (20/36) of participants would feel dizzy. For these participants, they even like O+D interface more than F+C screen. This is why H10 is confirmed.

5.3.3 Steerable F+C Interface

The steerable F+C interface is more intuitive and similar to human eyes. In particular, users can always keep global information when focusing on an interesting target. Similar to F+C screen, no switching effort or content switch happen.

The only drawback of participants’ feedback is that the symbol is too small in our simulated studies, but this problem would not be met because the real implementation of e-Fovea system is a wall size display.

5.3.4 Summary

The following is a brief summary of the studies:

 Compare with traditional O+D interface, the multi-resolution approaches, either F+C screen or steerable F+C interface, indeed improve the user performance for visual monitoring.

 Although F+C screen is always better than O+D interface in user performance, it is not true in participants’ subjective opinion.

 To single target tracking task, F+C screen and steerable F+C interface have similar user performance. To multi-target identification task, steerable F+C interface is significantly better.

 The steerable F+C interface is the best display for visual monitoring in both quantitative and qualitative tests. This supports the design of our e-Fovea system.

6. CONCLUSION AND FUTURE WORK

We have proposed a multi-resolution approach with steerable focus, e-Fovea, to large-scale and high-resolution monitoring. It comprises a multi-resolution input system and a multi-resolution output system. The multi-resolution input system is a hybrid dual- camera system. The multi-resolution output system is a wall-size low resolution display with a steerable focus region embedded in.

Instead of using a full high-resolution approach, our setting is much more economical and enables users to focus on interesting region in a very high resolution and be aware of the peripheral information in a low resolution simultaneously.

Furthermore, a novel experimental evaluation is presented. We design two user studies to compare the user performance and participants’ subjective opinion among three existing multi- resolution designs: O+D interface, F+C screen, and steerable F+C interface. From the results, we show that the steerable F+C interface, which is applied to our e-Fovea system, is preferred.

(10)

The studies not only support the design of e-Fovea, but also demonstrate that e-Fovea system is significantly better than traditional O+D interface for large-scale and high-resolution monitoring. In our experiments, the improvement in task completion time is up to 26% for single target tracking tasks and 30% for multiple target identification tasks.

Some extension can be done in the future. For better simulating human’s eye, e-Fovea with variable zoom will be taken into consideration. The solution is to design a projector which can adjust the zooming factor and projection area programmably. In addition, the focus region is determined by human with mouse click in our current system. To ease the manual effort, some automatic control can be added, such as PTZ camera control by computer vision techniques or an eye-tracking system. Finally, current design of e-Fovea is for single user only. A straightforward extension is to use multiple steerable projectors for the situation of multiple users.

ACKNOWLEDGMENTS

This work was supported in part by the Excellent Research Projects of National Taiwan University, under grants 98R0062- 04, and the Ministry of Economic Affairs, Taiwan, under Grant 98-EC-17-A-02-S1-032.

REFERENCES

[1] Baudisch, P., Good, N., Bellotti, V., and Schraedley, P.

Keeping things in context: A comparative evaluation of focus plus context screens, overviews, and zooming. In Proc.

of CHI‘02, pages 259-266, 2002.

[2] Baudisch, P., Good, N., and Stewart, P. Focus plus context screens: combining display technology with visualize-tion techniques. In Proc. of UIST’01, pages 31-40, 2001.

[3] Brown, M. and Lowe, D.G. Recognising Panoramas, In Proc.

of ICCV, 2003.

[4] Cao, X., Forlines, C., and Balakrishnan, R. Multi-User Interaction using Handheld Projectors, In Proc. of UIST’07, pages 43-52, 2007.

[5] Chen, I.H., and Wang, S.J. An Efficient Approach for Dynamic Calibration of Multiple Cameras, IEEE Trans. on Automation Science and Engineering, 4(2), 2007.

[6] Chan, L.W., Ye, W.S., Liao, S.C., Tsai, Y.P., Hsu, J., and Hung, Y.P. A Flexible Display by Integrating a Wall-Size Display and Steerable Projectors, In Proc. of UIC’06, 2006.

[7] Feiner, S. and Shamash, A. Hybrid user interfaces: Breeding virtually bigger interfaces for physically smaller computers, In Proc. of UIST’91, pages 9-17, 1991.

[8] Geisler, J., Eck, R., Rehfeld, N., Peinsipp-Byma, E., Schütz, C., and Geggus, S. Fovea-Tablet®: A New Paradigm for the Interaction with Large Screens, In Proc. of HCI’07, 8, pages 278-287, 2007.

[9] Hu, T.T., Chia, Y.W., Chan, L.W., Hung, Y.P., and Hsu, J. i- m-Top: An Interactive Multi-Resolution Tabletop System Accommodating to Multi-Resolution Human Vision, IEEE International Workshop on Tabletops and Interactive Surfaces, 2008.

[10] Hsiao, C.H., Chan, L.W., Chen, M.C., Hsu, J., and Hung, Y.P. To Move or Not to Move: A Comparison between Steerable and Fixed Regions of High-Resolution Projection in Multi-Resolution Tabletop Systems, In Proc. of CHI’09, 2009.

[11] Kandel, E., Schwartz, J., and Jessell, T. Principles of neural science, 4th ed., McGraw-Hill, 2000.

[12] Krahnstoever, N., Yu, T., Lim, S.N., Patwardhan, K., and Tu, P. Collaborative Real-Time Control of Active Cameras in Large Scale Surveillance Systems, Workshop on Multi- camera and Multi-modal Sensor Fusion Algorithms and Applications, 2008.

[13] Lowe, D. Object Recognition from Local Scale-Invariant Features. In Proc. of ICCV’99, pages 1150–1157, 1999.

[14] Lalonde, M., Foucher, S., Gagnon, L., Pronovost, E., Derenne, M. and A. Janelle. A System to Automatically Track Humans and Vehicles with a PTZ Camera, In Proc. of SPIE Defense & Security: Visual Information Processing XVI (SPIE #6575), 2007.

[15] Marchesotti, L., Marcenaro, L., and Regazzoni, C. Dual Camera System for Face Detection in Unconstrained Environments, In Proc. of ICIP’03, 2003.

[16] Palmer, S. Vision Science: Photons to Phenomenology, MIT Press, 1999.

[17] PENPOWER, TrackIN iDVR, Auto PTZ Tracking System, http://www.trackinvideo.com/iDVR/main3_3.html/.

[18] Plaisant, C., Carr, D., and Shneiderman, B. Image-Browser Taxonomy and Guidelines for Designers, IEEE Software, 12(2): 21–32, 1995.

[19] Rivlin, E. and Rotstein, H. Control of a Camera for Active Vision: Foveal Vision, Smooth Tracking and Saccade, IJCV, 39(2): 81–96, 2000.

[20] Szeliski, R. Image Alignment and Stitching: A Tutorial, Technical Report, MSR-TR-2004-92, 2004.

[21] Staadt, O., Ahlborn, B., Kreylos, O., and Hamann, B. A foveal Inset for Large Display Environment, In Proc. of IEEE VR’06, pages 281-282, 2006.

[22] Sansoni, G., Carocci, M., and Rodella, R., Three-

Dimensional Vision Based on a Combination of Gray-Code and Phase-Shift Light Projection: Analysis and

Compensation of the Systematic Errors, Applied Optics, 38, pages 6565-6573, 1999.

[23] Sanneblad, J. and Holmquist, L. Ubiquitous Graphics:

Combining Hand-Held and Wall-Size Displays to Interact with Large Images, In Proc. of AVI’06, pages 373-377, 2006.

[24] Wandell, B., Foundations of Vision, Sinauer Associates, 1995.

[25] Zhang, Z. A flexible new technique for camera calibration, IEEE Trans. on PAMI, 22(11):1330-1334, 2000.

[26] Zhang, C., Liu, Z., Zhang, Z., Zhao, Q. Semantic saliency driven camera control for personal remote collaboration, In Proc. of MMSP’08, pages 28-33, 2008.