Chapter 1 Introduction
1.5 Thesis Organization
The remainder of this thesis is organized as follows. In Chapter 2, the
9
configuration of the proposed system and the system process are introduced in detail.
In Chapter 3, the proposed process for learning of an indoor environment, which includes the data used in the proposed system, is described in detail. In Chapter 4, the proposed user identification method by color image analysis using multi-color edge marks attached on top of the users’ mobile devices is described. In Chapter 5, the proposed multi-user localization method for indoor environments and the proposed method for detecting the user’s viewing orientation are described. In Chapter 6, the proposed AR process for guidance of merchandise shopping and merchandise recognition is presented. In Chapter 7, the proposed AR technique, a method to conduct the perspective transformation for information displays on each user’s mobile device, and the adopted technique for rendering augmentations on real images are described. In Chapter 8, some experimental results to show the feasibility of the proposed techniques for indoor navigation for merchandise shopping or similar activities are presented. At last, conclusions and some suggestions for future works are included in Chapter 9.
10
Chapter 2
System Design and Processes 2.1 Ideas of Proposed System
2.1.1 Application Environment and Possible Scenarios
People usually need a navigation system to help guiding them when they visit an unfamiliar place. In this study, we propose an indoor AR-based multi-user navigation system for merchandise shopping or other similar activities. It can be applied in indoor environments like shopping malls, supermarkets, grocery stores, etc. It is hoped that a more intuitive mall or market navigation system can be designed to guide customers to get to the locations of desired merchandise items.
When applying the system proposed in this study to real world applications, a customer uses the mobile device to navigate a shopping mall in the following way.
First, the customer browses the nearby place information by the AR function on mobile device screen. Next, if he/she cannot find the place desired to visit, he/she can search it in a database as well using the proposed system. Then, the path will be displayed on the screen to guide the customer to the destination. Also, the customer can search the merchandise he/she desires, and the system will also guide him/her to the shelf on which the item is placed. Finally, when the customer is in the front of the shelf, he/she can view the information, such as the brand and the price, of the merchandise on the mobile device screen. In short, the proposed AR-based navigation
11
system provides a more convenient and easy-to-use interface for merchandise shopping and similar activities.
2.1.2 Main Techniques for Indoor Localization and Visual Search
By the indoor localization technique, the proposed system analyzes the omni-images captured from the fish-eye cameras affixed on the ceiling, and then finds the users’ foot points in each omni-image. When the users’ foot points are obtained, their coordinates in the fisheye-camera image coordinate system (FICS) are transformed into those in the global coordinate system (GCS), to get the actual positions of the users in the indoor environment.
Next, the users’ orientations must be detected. Based on the Hsieh and Tsai method [17], three techniques may be used to compute the users’ orientations. The first technique is to track each user’s locations in consecutively acquired images, and to use the resulting motion vectors of the user’s foot points to compute the user’s orientation.
But when the user is not walking, this method will not work because there is then no more moving vector for use. Therefore, the second technique is to utilize the orientation sensor installed in the user’s device. The orientation sensor measures the azimuth angle of the device by detecting changes and disturbances in the magnetic field in the surrounding environment. However, according to experimental experiences, the azimuth values detected are found not stable enough for the application due to indoor magnetic interferences from various sources. Hence, the third technique is to improve the stability of detected orientations is to attach a color edge mark on the top edge of the user device, and detect this linear mark appearing in the omni-image to compute a more accurate orientation of the user at each visited spot.
12
In this study, for the purpose of identifying the multiple users in the environment, we propose a new user localization method which detects a multi-color edge mark attached on top of the mobile device. With the different combinations of the multiple colors, we have more choices than a single color. That is, we detect the users holding different-colored client devices from acquired images, and identify them by a color classification scheme.
In addition, in order to provide a more convenient interface for merchandise shopping, the proposed system analyzes the images captured from the mobile-device camera, and matches them with the merchandise patterns using SURFs. Also, the field-of-view of the mobile-device camera must be estimated to get a perspective projection matrix for use later. Furthermore, to improving the performance, the server-side system is designed to match the images with the merchandise patterns at detected heights rather than all the patterns at all heights. If the matching succeeds, the server-side system will then calculate the position of the matched merchandise, and sends the corresponding information to the client-side system.
When the client-side system receives the navigation information sent from the server, the system will display the information on the device screen. With the perspective projection matrix described above, the 3D points of the navigation information can be transformed into the 2D screen plane. Then, the navigation information can be overlaid onto the real places or objects in the image taken of the current scene, and the user can so understand the surrounding environment and the merchandise information easily, achieving the major goal of AR-based indoor environment guidance of this study.
13
2.2 System Configuration
In this section, we will introduce the configuration of the proposed system. In Section 2.2.1, we will introduce the hardware configuration. The hardware of the proposed system includes omni-cameras which we use for localization detection of the multiple users, and the smart device which we use as the client-side device. In Section 2.2.2, we will introduce how to connect the hardware over a 4G/LTE network, and how it operates. Finally, we will describe in Section 2.2.3 the software development environment and the operating system we use both in the server-side system and in the client-side system.
2.2.1 Hardware Configuration
The hardware configuration for this study can be decomposed into three parts: the fish-eye cameras, mobile devices, and LTE router customized specially. The fish-eye camera we use is of the model of Axis 207MW, which is made by Axis Communications, and the original lens is replaced with a fisheye lens in this study to expand its field-of-view. The Axis 207MW camera has a dimension of 855540mm (3.3”2.2”1.6”, not including the antenna), and a weight of 190g (0.42 lb., not including the power supply). Its appearance is shown in Figure 2.1. The maximum resolution of the images captured with it is up to 12801024 pixels. For performance efficiency, we use the resolution of 640480 pixels in our system, and the frame rate is up to 15 fps. The cameras can be accessed through wireless networks (IEEE 802.11g/b), but for speed improvement, we access the cameras through the Ethernet.
We build an experimental environment in the Computer Vision Research Center at National Chiao Tung University by installing several fisheye cameras on the ceiling of the center (see Figure 2.2). The images captured from the cameras are analyzed by
14
a virtual machine (VM) in the cloud server to detect the user’s location and orientation.
The server sends the navigation information to the users’ mobile devices so that the users can begin the navigation.
Figure 2.1 The camera used in the proposed system.
The mobile device we use in the experiment is an HTC Flyer tablet made by HTC Corporation. Its appearance is shown in Figure 2.3. The HTC Flyer has a dimension of 19512213.2mm (7.7”4.8”0.5”) and a weight of 420g (0.93 lb). It has a screen size of 7 inches, a camera acquiring 5-megapixel images, and an e-compass that can detect the device orientation in a magnetic field, etc. For performance efficiency, we use the image resolution of 533426 pixels in our system .The user uses the HTC Flyer as the client device, and connects it to the cloud server.
Because there is no commercial 4G/LTE network services in Taiwan now, in this study the 4G/LTE experimental network is provided by a project in cooperation with Broadband Mobile Lab (BML) at National Chiao Tung University. Mobile devices applicable in the 4G/LTE network are not available in the current market, so we use the mobile devices as described above and an LTE router customized specially as a
15
bridge between the 4G/LTE and Wi-Fi networks. The LTE router is a notebook with a 4G/LTE USB Dongle LU221 made by Quanta, Inc. Its appearance is shown in Figure 2.4. The notebook applies the technique of Internet Connection Sharing (ICS) of Windows to share the LTE connection with the WiFi local area network (LAN) by means of Network Address Translation (NAT). The notebook can be connected to the Evolved Packet Core (EPC) via the 4G/LTE network. Therefore, the LTE router can transmit data to a cloud server. On the other hand, the LTE router plays the role of a WiFi access point as well. As a result, a mobile device such as a tablet can access the LTE connection through the WiFi LAN.
Figure 2.2 The camera installed on the ceiling in the indoor environment.
2.2.2 Network Configuration
The configuration of the network used on this study is as shown in Figure 2.5, where the fish-eye cameras and the cloud server are connected through an Ethernet.
The cloud server can access the images captured by the fish-eye cameras in a more reliable way through the Ethernet, and so one can make sure that the system always accesses correct and immediate images and messages.
16
Figure 2.3 The HTC flyer used as the client device in this study.
Figure 2.4 The Quanta 4G/LTE USB Dongle used in this study.
For the purpose of user mobility, we propose the use of the network configuration with 4G/ LTE in this study. LTE is a standard for wireless communication of high-speed data for mobile phones and data terminals. The client device we use is a mobile device, but there is no such mobile device which can connect through the 4G/LTE in Taiwan market now, as mentioned previously. So we use the Customer
17
Premise Equipment (CPE) --- an LTE router connected via the Wi-Fi network to access the service via the E-UTRAN in the 4G/LTE network. And a cloud server of a service provider is connected to an Evolved Packet Core (EPC) through the Internet.
In this work, we apply a private network instead of the Internet to separate possible interfering traffic from the public network.
In short, the client device used in this study can access the cloud server and receive the navigation information through the 4G/LTE network in the environment.
Figure 2.5 illustrates the network configuration described previously and used in this study.
Cloud Server
Wi-Fi Network Client
Camera Camera
LTE Router
LTE Network
EPC
eNB LAN
Figure 2.5 The network architecture of the proposed system.
2.2.3 Software Configuration
The server-side system operates on the Windows 7 Operating System in the VM of the cloud server, and the system is written using the C# programming language in the Microsoft Visual Studio 2010 development environment. The server-side system accesses the cameras by the AXIS Media Control SDK (AMC SDK), which provides the application programming interface (API) for developers to access the camera images or control the cameras using C# and C++ programming languages. It is
18
provided by the manufacturer of the cameras, Axis Communications, Inc.
As to the client-side system, it is written in the Java programming language and operates on the Android 2.3.4 Operating System. The client-side system uses the Qualcomm’s Augmented Reality (QCAR) platform, which provides many useful functions for AR developments on mobile devices. But in our system, we only use the QCAR platform to handle capturing of camera images. The work of rendering 3D augmented objects is conducted by the Android OpenGL API.
2.3 System Design
In this section, we describe in detail the design of the proposed network system. It is a client-server architecture as mentioned, which is composed of a server side and a client side. In Section 2.3.1, the server-side system used for conducting complicated works with heavy computations running on a VM in the Cloud server will be introduced. In Section 2.3.2, the client-side system running on a user’s smart device, which obtains navigation information from the server-side system and displays it on the screen of the device, will be introduced. Finally, in Section 2.3.3, we will introduce the cooperation between the client and server sides.
2.3.1 Server-side System
The server-side system runs on a VM of the cloud server as mentioned, and it is connected to the fish-eye cameras on the ceiling through the Ethernet. In the learning stage, we build an environment map, which includes environment and merchandise information such as the locations and titles of the target places, the camera locations, the feature, price, and brand of each merchandise item, and so on. In the navigation stage, the server accesses the omni-images captured by the fish-eye cameras and the
19
image acquired from the mobile-device camera. The server analyzes the omni-images to detect the users’ locations and orientations at each visited spot and identify them by a color classification scheme. After the server detects the multiple users via the images acquired by the cameras, it sends the corresponding users’ locations, orientations, and the information of nearby places to the users’ client-side systems.
Meanwhile, the server extracts the features of the images acquired by each mobile-device camera and matches them with the features of the merchandise patterns by an SURF algorithm [14]. If matching with the merchandise pattern seen by a user is successful, the server will send the information of the merchandise to the user’s client-side system. All of such information will be updated when the user moves.
When the user wants to reach a certain destination, the server will receive a request from the client, and then plan a path from the user’s location to the destination, and send a set of intermediate points of the path to the user’s client-side system to display.
As a whole, the server is designed mainly for conducting multiple users’
localizations, path planning, and matching with the merchandise items, and these tasks are heavy computational works. Because the client-side system runs on the user’s mobile device, which has lower power and inferior computational capabilities than the cloud server, conducting these heavy computational works on the server, as done in this study, can increase the computational performance and reduce the battery power usage of the client-side system.
2.3.2 Client-side System
The client-side system runs on each user’s mobile device. It transmits the images captured by the mobile-device camera to the server-side system. Because the mobile device has lower power and inferior computational capabilities than a laptop or desktop computer, the client-side system set up on it must be assigned as few works
20
as possible to reduce the power consumption and increase the computational performance. Therefore, most tasks of the client-side system are just information displays, such as view projection, display rendering, and creation of the navigation path’s geometric shape (arrows, thick line segments, etc.).
When a user enters the environment, the user’s client-side system is connected to the cloud server through a 4G/LTE network, and sends the images captured by the camera to the cloud server; meanwhile, it receives relevant information from the server. Then, the client-side system displays the information on the screen of the user’s mobile device.
2.3.3 Cooperation between Client and Server Sides
The server and client side systems are described in Sections 2.3.1 and 2.3.2. Here we describe the cooperation between the client-side and server-side systems in more detail. An illustration of the cooperation between the two systems is shown in Figure 2.6.
When the client is connected to the server, the server detects the multiple users’
locations and orientations, analyzes the images received from the client, and sends the messages, such as location coordinates, the orientation vectors, the nearby environment information, and the matched merchandise information, to the user. The information is updated continuously to make sure that the user can receive correct and immediate messages.
When the user wants to reach a certain destination, the client-side system will send a request, which includes the name of a desired destination or the name of a desired product, to the server. After server receives the request, it plans a path starting from the user’s location and ending at the destination, and sends a set of intermediate points of the path to the client.
21
Finally, the client displays all of information described above in an AR way.
Server
Figure 2.6 Cooperation between client and server sides.
2.4 System Processes
2.4.1 Learning Process
In the learning stage of the proposed system, the goal is to establish an environment map, which includes information about the places available for visits, fish-eye cameras, merchandise patterns, magnetic fields, and obstacle orientations.
The entire learning process is shown in Figure 2.7. Only a brief description of the process is given here. More details of it will be described in Chapter 3.
At first, we establish an environment map in the form of a floor plan drawing.
The floor plan is drawn at a specific ratio relative to the actual size of the environment.
After specifying the ratio, we compute the corresponding size in the unit of pixel. The use of this scaling ratio is necessary for the transformation between the FICS
22
(fisheye-camera image coordinate system) and the GCS (global coordinate system).
Next, the target places for visits in the environment are specified on the environment map. Furthermore, we specify as well the installation information of the fisheye cameras, which includes the locations and heights of the cameras for use in computing the transformations between the FICS and the map coordinate system (MCS).
After the environment map is established, the learning processes is conducted, which can be decomposed into three phases: learning for path planning, learning for user localization and learning for merchandise recognition. The two phases of learning for path planning and learning for user localization are based on Hsieh and Tsai method [17]. The goal of learning for path planning is to “understand” the obstacles information. And in the user localization phase, the camera calibration process and magnetic field learning process are conducted. The camera calibration process is to map the points between different coordinate systems; and the magnetic field learning process is to establish an azimuth map, which keeps a record of four
After the environment map is established, the learning processes is conducted, which can be decomposed into three phases: learning for path planning, learning for user localization and learning for merchandise recognition. The two phases of learning for path planning and learning for user localization are based on Hsieh and Tsai method [17]. The goal of learning for path planning is to “understand” the obstacles information. And in the user localization phase, the camera calibration process and magnetic field learning process are conducted. The camera calibration process is to map the points between different coordinate systems; and the magnetic field learning process is to establish an azimuth map, which keeps a record of four