Traditionally, wireless signals are used for …

(1)

Wireless Communication Systems

@CS.NCTU

Lecture 7: MobileHCI

Instructor: Kate Ching-Ju Lin (林靖茹)

(2)

Traditionally, wireless signals are used for …

Data communication among devices

(3)

Now, we have internet of Things

More and more sensing/wearable devices,

wireless signals everywhere

(4)

Can we use wireless signals to create

human-centric applications ,

not just for data communication?

(5)

Why Device-Free?

(6)

Limitation of Cameras

• Privacy issues

• Line of sight limitation

• Lighting requirement

6

(7)

Limitation of Wearable Devices

• Inconvenient

• High deployment cost

• Feedback overhead

(8)

Device-Free MobileHCI Apps

[MobiCom’13]

[MobiCom’15]

Gesture recognition

Handwriting

Keystroke

Figure 1: WiKey System

Based on this observation, we design a keystroke extraction algorithm that utilizes CSI streams of all transmit-receive antenna (TX-RX) pair pairs to determine the approximate start and end points of individual keystrokes in a given CSI- waveform by continuously matching the trends in CSI time series with the experimentally observed trends using a sliding window approach.

The second technical challenge is to extract distinguishing features for generating classification models for each of the 37 keys (10 digits, 26 alphabets and 1 space-bar). As the keys on a keyboard are closely placed, conventional features such as maximum peak power, mean amplitude, root mean square deviation of signal amplitude, second/third central moment, rate of change, signal energy or entropy, and number of zero crossings cannot be used because the values of these features for adjacent keys are almost identical. To address this challenge, we use the CSI-waveform shapes of each key from each TX-RX antenna pair as features. As the waveforms for each key contain a large number of samples, we apply the Discrete Wavelet Transform (DWT) technique on these waveforms to reduce the number of samples while keeping the shape preserving time and frequency domain information intact. We use the waveforms resulting from the DWT of individual keystrokes as their shape features.

The third technical challenge is to compare shape features of any two keystrokes. The midpoints of extracted CSI-wavforms of different keystrokes rarely align with each other because the start and end points determined by extraction algorithm are never exact. Moreover, the lengths of different keystroke waveforms also differ because the duration of pressing any key is often different. Consequently, the midpoints and lengths of shape features do not match either. Another issue is that the shape of different keystroke waveforms of the same key are often distorted versions of each other because of slightly different formation and dir- ection of motion of hands and fingers while pressing that key. Thus, two shape features cannot be compared using standard measures like correlation coefficient or Euclidean distance. To address this challenge, we use the Dynamic Time Warping (DTW) technique to quantify the distance between the two shape features. DTW can find the minimum distance alignment between two waveforms of different lengths.

The key novelty of this paper is on proposing the first WiFi signal based keystroke recognition approach. Some recent work uses CSI values to recognize various macro aspects of human movements such as falling down [6], household activities [7], detection of human presence [8], and estim- ating the number of people in a crowd [9]. These schemes extract coarse grained information from the CSI values to recognize the macro-movements such as falling down or recognizing fullbody/limb gestures. They cannot be directly

adapted to recognize keystrokes because such coarse grained information does not capture the minor variations in the CSI values caused by human micro-movements such as those of hands and fingers while typing. Some recent work, namely WiHear, uses CSI values to extract the micro-movements of mouth to recognize 9 syllables in the spoken words [10].

However, WiHear uses special hardware including directional antennas and stepper motors to direct WiFi beams towards speaker’s mouth and extract the micro-movements.

We implemented the WiKey system using COTS devices, i.e.

a TP-Link TL-WR1043ND WiFi router and a Lenovo X200 laptop with Intel 5300 WiFi NIC. In the evaluation process, we build a keystroke database of 10 human subjects with IRB approval. WiKey achieves more than 97.5% detection rate for detecting the keystroke and 96.4% recognition accuracy for classifying single keys. In real-world experiments, WiKey can recognize keystrokes in a continuously typed sen- tence with an accuracy of 93.5%.

In this paper, we have shown that fine grained activity recognition is possible by using COTS WiFi devices. Thus, the techniques proposed in this paper can be used for several HCI applications. Examples include zoom-in, zoom-out, scrolling, sliding, and rotating gestures for operating personal computers, gesture recognition for gaming consoles, in-home gesture recognition for operating various household devices, and applications such as writing and drawing in the air. Other than being a potential attack, our WiKey technology can be potentially used to build virtual keyboards where human users type on a printed keyboard.

2. RELATED WORK

2.1 Device Free Activity Recognition

Device-free activity recognition solutions use the variations in wireless channel to recognize human activities in a given environment. Existing solutions can be grouped into three categories: (1) Received Signal Strength (RSS) based, (2) CSI based, and (3) Software Defined Radio (SDR) based.

RSS Based: Sigg et al. proposed activity recognition schemes that utilize RSS values of WiFi signals to recognize four activities including crawling, lying down, standing up, and walking [11, 12]. They achieved activity recognition rates of over 80% for these four activities. To obtain the RSS values from WiFi signals, they used USRPs, which are specialized hardware devices compared to the COTS WiFi devices that we used in our work. While RSS values can be used for recognizing macro-movements, they are not suit- able to recognize the micro-movements such as those of fingers and hands in keyboard typing because RSS values only provide coarse-grained information about the channel variations and do not contain fine-grained information about small scale fading and multi-path eﬀects caused by these micro-movements.

CSI Based: CSI values obtained from COTS WiFI network interface cards (NICs) (such as Intel 5300 and Ath- eros 9390) have been recently proposed for activity recognition [6–10, 13] and localization [14–16]. Han et al. proposed WiFall that detects fall of a human subject in an indoor environment using CSI values [6]. Zhou et al. proposed a passive human detection scheme which exploits multi-path variations for detecting human presence in an indoor environment using CSI values [8]. Zou et al. proposed Electronic Frog Eye that counts the number of people in a crowd using

91

[Mobicom’15]

0 0.2 0.4 0.6 0.8 1

0 10 20 30 40 50 60

Frequency

Tracking error (mm) Pen Marker Pencil

Figure 27: Tracking error of different ma- terials.

0 20 40 60 80 100

Y Coordinate (cm)

X Coordinate (cm)

0 1 2 3 4 5 6 7

Locating error (cm)

Figure 28: Error map of APA

0 20 40 60 80 100

Y Coordinate (cm)

X Coordinate (cm)

0 1 2 3 4 5 6 7

Tracking error (cm)

Figure 29: Error map of Phase-tracking.

0 0.2 0.4 0.6 0.8 1

1 2 3 4 5 6

Accuracy

User Index

Figure 30: Detection accuracy for different users.

15 20 25 30

10 20 30 40 50

Y coordinate (cm)

X coordinate (cm)

Figure 31: mTrack example of letter and word.

0 0.2 0.4 0.6 0.8 1 1.2

1 2 3 4 5 6 7

Accuracy

User Index

Character Word

Figure 32: Character and word recogni- tion accuracy.

peaks can thus be reliably distinguished through fine-grained scan- ning. However, with coarser steering granularity, their peaks tend to merge, leading to an increasing error. Fortunately, background sub- traction effectively mitigates the impact of background reflection, hence reducing the estimation error by 50%.

Figure 24 shows the APA performance under human movement as dynamic background. Human movement at 2m away from re- ceiver does not affect positioning error, since reflecting RSS from human body is much weaker than pen. Estimation error without background subtraction increases to 2.8 when human stands close to receive antenna. However, background subtraction can still con- sistently reduce positioning error even under human movement.

8.1.3 Joint Performance of Tracking and APA.

Recall APA facilitates phase tracking through opportunistic cali- bration. In this experiment, we verify the effectiveness of this joint execution. The target pen moves along a circular trajectory of ra- dius 7 cm. mTrack continuously runs phase tracking, and performs the k/✓-test (Section 6) every 2 seconds. It invokes APA calibration if the test dictates so. Figure 25 shows the tracking error at every 2-second check point. Without APA calibration, the phase track- ing error steadily accumulates over time and reaches 46 cm when moving 150 cm continuously along the circle. In contrast, APA cal- ibration caps the phase tracking error below 10 mm across 90% of the trajectory.

8.2 Performance on a Trackpad

We now evaluate mTrack’s performance in a real trackpad ap- plication. The experiments are conducted in an office environment with natural background (drywall, metal cabinet, a user, and occa- sional human walking by). A 50cm ⇥50cm writing region is created on a wood table. To test the precision of APA in locating anchoring points, the user rests the pen tip on 40 random locations, ensuring the bottom part of the pen is exposed to the antennas. mTrack steers the antennas with granularity of 8 . To test phase tracking, the user draws 10 circles and 10 triangles (with 20 cm perimeter) following printed trajectories in the normal hand-writing speed. Since human- hand deformation will affect phase tracking, testers hold the mid- dle portion of pen, while directional antennas point to the bottom

portion. Due to lack of timing-synchronization between user writ- ing trajectory and tracking estimation, we approximate the tracking error as the minimum projection distance from mTrack’s location estimation to the trajectory.

Types of writing objects. We evaluate APA for 3 writing ob- jects of different reflectivity: metal-surfaced pen, plastic marker, and wood pencil (Figure 15(b)). Our benchmark measurement shows that, at 40 away from the transmit/receive antenna, the SNR of sig- nals reflected by these objects are 12.3 dB, 10.1 dB and 4.7 dB, re- spectively. Figure 26 plots the APA error distribution, which shows 90-percentile error of 2 cm, 4 cm and 16 cm, respectively. Obvi- ously, object with strong reflectivity enables APA to easily combat noise, thus achieving higher precision. Note that the APA precision for pen is lower than the benchmark test in Section 8.1.2, mainly because the presence of user’s hand creates more uncertainties.

Remarkably, mTrack’s phase-tracking algorithm demonstrates the high accuracy in this trackpad application (Figure 27). The 90- percentile errors for pen, marker and pencil are 8 mm, 11 mm and 4.8 cm respectively.

Localization/tracking error across a large region. Distance between the target and the receiver determines the reflected signal strength and hence may affect mTrack’s accuracy. The transmitter and receivers are placed at coordinates (100, 100) cm, (50, 100)cm and (100, 50)cm, respectively. To quantify such location-dependent error, we partition the writing area into 10cm ⇥10cm squares, and repeat the previous precision test on each square. Figure 28 and 29 plot the APA and phase-tracking error across all squares within a 90cm ⇥90cm region.

When the pen is close to both receivers (distance < 60cm), mTrack can achieve high accuracy with APA/tracking error of <1.5 cm and <8 mm, respectively. Accuracy starts degrading when the tar- get moves over 70 cm away from receiver, and hence SNR drops.

Nonetheless, the tracking error is still within 1.5 cm even when the target is 90 cm from the receiver. mmWave attenuates to almost noise floor at 100 cm owing to high pathloss of mmWave signals.

We expect at least two ways of scaling the writing region: increas-

ing the transmit power, and placing more receivers along the x- and

y-axis. We leave such exploration for future work.

(9)

Device-Free HealthCare Apps

[NSDI’14]

Smart Homes that Monitor Breathing and Heart Rate

Fadel Adib Hongzi Mao Zachary Kabelac Dina Katabi Robert C. Miller Massachusetts Institute of Technology

32 Vassar Street, Cambridge, MA 02139

{fadel,hongzi,zek,dk,rcm}@mit.edu

ABSTRACT

The evolution of ubiquitous sensing technologies has led to intelligent environments that can monitor and react to our daily activities, such as adapting our heating and cooling systems, responding to our gestures, and monitoring our elderly.

In this paper, we ask whether it is possible for smart environments to monitor our vital signs remotely, without in- strumenting our bodies. We introduce Vital-Radio, a wireless sensing technology that monitors breathing and heart rate without body contact. Vital-Radio exploits the fact that wireless signals are affected by motion in the environment, including chest movements due to inhaling and exhaling and skin vibrations due to heartbeats. We describe the operation of Vital-Radio and demonstrate through a user study that it can track users’ breathing and heart rates with a median accuracy of 99%, even when users are 8 meters away from the device, or in a different room. Furthermore, it can monitor the vital signs of multiple people simultaneously. We envision that Vital-Radio can enable smart homes that monitor people’s vital signs without body instrumentation, and actively contribute to their inhabitants’ well-being.

Author Keywords Wireless; Vital Signs; Breathing; Smart Homes; Seeing Through Walls; Well-being

Categories and Subject Descriptors H.5.2. Information Interfaces and Presentation: User Interfaces - Input devices and strategies. C.2.2. Network Architecture and Design:

Wireless Communication.

INTRODUCTION

The past few years have witnessed a surge of interest in ubiquitous health monitoring [22, 25]. Today, we see smart homes that continuously monitor temperature and air quality and use this information to improve the comfort of their inhabitants [46, 32]. As health-monitoring technologies advance further, we envision that future smart homes would not only monitor our environment, but also monitor our vital signals, like breathing and heartbeats. They may use this information to enhance our health-awareness, answering questions like

“Do my breathing and heart rates reflect a healthy lifestyle?”

They may also help address some of our concerns by answering questions like “Does my child breathe normally during sleep?” or “Does my elderly parent experience irregular

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

CHI 2015, April 18 - 23 2015, Seoul, Republic of Korea

Copyright is held by the owner/author(s). Publication rights licensed to ACM.

ACM 978-1-4503-3145-6/15/04...$15.00 http://dx.doi.org/10.1145/2702123.2702200

(a) Inhale Motion (b) Exhale Motion Figure 1—Chest Motion Changes the Signal Reflection Time. (a) shows that when the person inhales, his chest expands and becomes closer to the antenna, hence decreasing the time it takes the signal to reflect back to the device. (b) shows that when the person ex- hales, his chest contracts and moves away from the antenna, hence the distance between the chest and the antenna increases, causing an increase in the reflection time.

heartbeats?” Furthermore, if non-intrusive in-home continuous monitoring of breathing and heartbeats existed, it would enable healthcare professionals to study how these signals correlate with our stress level and evolve with time and age, which could have a major impact on our healthcare system.

Unfortunately, typical technologies for tracking vital signals require body contact, and most of them are intrusive. Specif- ically, today’s breath monitoring sensors are inconvenient:

they require the person to attach a nasal probe [19], wear a chest band [43], or lie on a special mattress [3]. Some heart- rate monitoring technologies are equally cumbersome since they require their users to wear a chest strap [18], or place a pulse oximeter on their finger [21]. The more comfortable technologies such as wristbands do not capture breathing and have lower accuracy for heart rate monitoring [12]. Addition- ally, there is a section of the population for whom wearable sensors are undesirable. For example, the elderly typically feel encumbered or ashamed by wearable devices [20, 37], and those with dementia may forget to wear them. Children may remove them and lose them, and infants may develop skin irritation from wearable sensors [40].

In this paper, we ask whether it’s possible for smart homes to monitor our vital signs remotely – i.e., without requiring any physical contact with our bodies. While past research has investigated the feasibility of sensing breathing and heart rate without direct contact with the body [17, 16, 15, 34, 27, 48, 14], the proposed methods are more appropriate for controlled settings but unsuitable for smart homes: They fail in the presence of multiple users or extraneous motion. They typically require the user to lie still on a bed facing the device.

Furthermore, they are accurate only when they are within close proximity to the user’s chest.

[CHI’15, MobiCom’16]

Fall detection

Breathing and heart-rate monitoring

Emotion detection

Sleep Apnea Diagnosis

Figure 13—The various phone positions used for the results in Fig. 12.We experiment with four different positions along a semicircle centered at the subject.

described in §4, the breathing frequency on the phone is rela- tively stable compared to the amplitude variations that are neces- sary for detecting sleep apnea events.

• As the distance increases beyond a meter the accuracies decrease.

This is because the strength of the reflections due to breathing reduces with distance, making the breathing signal noisy. The one-meter range is, however large enough to enable contactless breath monitoring that is non-intrusive, as demonstrated in our clinical study. It also limits the negative effects of environmental changes farther away than a meter on ApneaApp’s accuracies.

• The accuracies are unaffected by audible noise in the environment from the vehicular and foot traffic on the street. Introducing human conversations in the vicinity of the experiments also does not affect these accuracies. This is because, we use a high-pass filter to filter out audible signals below 18 kHz.

Effect of the phone’s orientation.We place the phone 20 cm away and to the left of the subject. We then rotate the phone and compute the accuracies for each phone orientation. As before, for each trial, we perform five experiments for a total of 10 minutes per phone orientation. Fig. 12 plots the results as a function of the phone’s orientation. We observe that the accuracies remain high, demon- strating that during ApneaApp’s operation we do not need to fix the phone orientation.

Effect of the phone’s position.Next, we experiment with the phone at different positions around the subject. Specifically we place the phone is four different positions — near the head, near the legs, and two positions to the left — along a semicircle of radius 40 cm centered at the subject as shown in Fig. 13. Fig. 12 shows that the accuracies are high when the phone is in the left positions and slightly lower when placed near the head and the feet. This is because in the latter positions, the head and the leg effectively block the chest/abdomen motion. We however note that the maximum error is less than 0.13 breaths per minute across all the phone positions.

6.2 Effect of Sleeping Position and Blankets

Next, we evaluate the accuracies for different sleeping positions and in the presence of blankets.

Effect of the subject’s sleeping position.We consider four different

0 0.05

0.1 0.15

0.2 0.25

0.3

Supine Prone Left Right

Breathing Frequency Error (breaths/min)

Sleeping Position 99.95%

98.924%

99.932% 99.91%

Figure 14—Effect of sleeping position. The accuracy is lower when the patient is lying with her face down (prone). In this position, both the signals from the Vernier belt and ApneaApp experience larger variations. We however note that in our clinical study we track the chest movements throughout the sleep duration where the patient’s sleeping position was not controlled.

0 0.005

0.01 0.015

0.02

1 2 3 4 5 6

Breathing Frequency Error (breaths/min)

Blanket Thickness (in cm) 99.95%

99.925%

99.96%

99.95%

Figure 15—Effect of blankets. We use four blankets with thicknesses varying from 2-5 cm. The plot shows that the accuracies are high even when blankets separate the subject from the phone.

lying on the left, and the right. We place the smartphone at a distance of 20 cm to the left of the subjects and measure the breathing rate accuracies. As before, for each sleep position we monitor the breathing rate over chunks of two minutes for a total of ten minutes.

Fig. 14 shows that the average residual error is below 0.16 breaths per minute across all the sleeping positions. We note that the accuracy is lower when the patient is lying with her face down (prone).

In this position, we noticed that both the signals from the Vernier belt and ApneaApp experience a larger variation. We however note that our clinical study tracks the chest movements throughout the sleep duration where the patient’s sleeping position was not controlled.

Effect of Blankets.We measure the breathing frequency accuracy for various blanket thicknesses. The subjects are asked to sleep in the supine position and the phone is placed left of the subject at a distance of 40 cm. We use four blankets with thicknesses varying from 2-5 cm. Fig. 15 shows that the accuracies are not noticeably degraded by the use of blankets. This demonstrates that ApneaApp is well suited for the sleep environment, which is further validated by our clinical study where all the patients used blankets.

6.3 Breathing Signals from Multiple Subjects

As discussed in §3.1, the sonar reflections from multiple subject arrive at different times at the microphone. Thus, ApneaApp can si-

[MobiSys’15]

(10)

WiSee

Device-free gesture recognition using wireless signals [MobiCom’13]

Qifan Pu, Sidhant Gupta, Shyam Gollakota, Shwetak Patel University of Washington

10

(11)

Idea: Doppler shift

• Frequency change of a wave occurs as its source moves relative to the observer

f = c + v _r

c f

source: https://en.wikipedia.org/wiki/Doppler_effect

Velocity of the signal receiver (observer)

f = f f = f c v _r

v _r ⬆ Δf ⬆

Speed of light

(12)

Doppler Effect Caused by Human Mobility

• When a user is mobile, Rx will observe the Doppler effect even if Rx itself is static

⎻ Why?

• If the moving speed is v, what’s the Doppler effect

⎻ Δf ≤ (2f/c) * v à Why?

Detect the gesture by measuring the Doppler effect at Rx à Device-free!

Velocity of Rx along the reflected path is at most 2v

The length of the reflected path varies over time

(13)

Is it that Simple?

• Challenge 1

⎻ The velocity of a human gesture is VERY SMALL (e.g., 0.5 m/s)

⎻ Correspond to a small Doppler shift

e.g., Δf=2fv/C = 17Hz when v = 0.5 m/s and f = 5GHz

• Challenge 2

⎻ WiFi operates in the 20MHz wide band à Corse resolution!!

⎻ Each 802.111 OFDM symbol includes 64 subcarriers à bandwidth of each subcarrier

= 20*10 ⁶ /64 ~ 313KHz

13 Cannot observe 17Hz

within a 312.5KHz band ^{f1 f2} ^f3

Δf = 17Hz

313KHz

(14)

How to Identify Small Shift

even in Wideband Channels?

Idea: Transform the WiFi signals to narrowband pulses via large FFT!

FFT over one symbol FFT over two identical symbol

(15)

Large FFT

• Assume Tx sends two identical symbols, each with N sample

• If Rx performs a 2N point FFT

x _k =

N

n=1

X _n e ^{i2 kn/N} X _n =

N

k=1

x _k e ^{i2 kn/N}

1. Bandwidth of each subcarrier is halved!

2. In theory, odd subcarriers must be 0. Then, if Rx

receives pulse in odd subcarriers à Doppler effect!!

X _2l = 2

N

k=1

x _k e ^{2 kl/N} X _2l+1 = 0

IFFT FFT

Even sub-ch

Odd sub-ch X _n =

N

k=1

x _k e ^{i2 kn/2N} +

2N

k=N +1

x _k e ^{i2 kn/2N}

=

N

k=1

x _k e ^{i2 kn/2N} +

N

k=1

x _k e i2 (k+N )n/2N

=

N

k=1

x _k e ^{i2 kn/2N} (1 + e ^{i n} )

(16)

How Large is FFT Required?

• 2N points FFT à halve the bandwidth

⎻ Each subcarrier is (20/64) /2 = 10(MHz)

• MN points FFT à reduce the bandwidth by M times

⎻ Each subcarrier is 20/M (MHz)

16 To get a resolution of 10Hz, we need (20/64)*10 ⁶ /M = 10

à M = 31,250

(17)

Capturing Movement via Large FFT

FFT over 31,250 symbols → 10Hz per subcarrier

−3 −2 −1 0 1 2 3

x 10⁴ 0

0.2 0.4 0.6 0.8

OFDM Sub−channels

Amplitude

−3 −2 −1 0 1 2 3

x 10⁴ 0

0.2 0.4 0.6 0.8

OFDM Sub−channels

Amplitude

Without movement

With movement

Doppler

shift

(18)

Capturing over Time

Frequency-time Doppler profile of an example gesture (push)

18

time (second)

frequency (Hz)

1.25 2.5 3.75 5 6.25 7.5 8.75

30 20 10 0

−10

−20

−30

8 16 24 32 40dB

Figure 5—Frequency-time Doppler profile of an ex- ample gesture. The user moves her hand towards the re- ceiver.

to be a specific kind of discontinuity between the OFDM symbols. Thus, we can perform interpolation between the OFDM symbols as described earlier. We note, however, that since all the CPs have a fixed length, such an interpolation is equivalent to resampling the OFDM symbols at a constant rate given by Symbol length +CP length

Symbol length , where Symbol length and CP length denote the length of the OFDM symbol and CP respectively. Since such resampling of the symbols does not change the doppler pattern, in practice we simply skip the CPs to reduce the computation.

3.2 Mapping Doppler Shifts to Gestures

So far we described how to transform the wideband 802.11 transmissions into a narrowband signal at the receiver. In this section, we show how to extract the Doppler informa- tion and map it to the gestures. Specifically, we describe the following three steps: (1) Doppler extraction which computes the Doppler shifts from the narrowband signals, (2) Segmen- tation which identifies a set of segments that correspond to a gesture, and (3) Classification which determines the most likely gesture amongst a set of gestures. We describe how WiSee performs each of these steps. We focus on the sin- gle user case; in §3.3, we extend our design to work in the presence of other users.

(1) Doppler Extraction: WiSee extracts the Doppler in- formation by computing the frequency-time Doppler profile of the narrowband signal. To do this, the receiver computes a sequence of FFTs taken over time. Specifically, it computes an FFT over samples in the first half-a-second interval. Such an FFT give a Doppler resolution of 2 Hertz. The receiver then moves forward by a 5 ms interval and computes an- other FFT over the next overlapping half-a-second interval.

It repeats this process to get a frequency-time profile.

Fig. 5 plots the frequency-time Doppler profile (in dB) of a user moving her hand towards the receiver. The plot shows that, at the beginning of the gesture most of the energy is concentrated in the DC (zero) frequency. This corresponds to the signal energy between the transmitter and the receiver, on paths that do not include the human. However, as the user starts moving her hand towards the receiver, we first see increasing positive Doppler frequencies (corresponding to hand acceleration) and then decreasing positive Doppler frequencies (corresponding to hand deceleration).

We note that the WiSee receiver is only interested in the Doppler shifts produced by human gestures. Since the speeds at which a human can typically perform gestures are between 0.25 m/sec and 4 m/sec [12], the Doppler shift of interest at 5 GHz is between 8 Hz and 134 Hz. Thus, the WiSee receiver reduces its computational complexity by analyzing the FFT output corresponding to only these frequencies.

(2) Segmentation: To do this, WiSee leverages the struc- ture of the Doppler profiles, shown in Fig. 6. These corre-

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Figure 6—Frequency-time Doppler profiles of the gestures in Fig. 1. WiSee segments the profiles into sequences of positive and negative Doppler shifts, which uniquely identify each gesture.

spond to the gestures in Fig 1. The plots show that the profiles are a combination of positive and negative Doppler shifts. Further, each gesture comprises of a set of segments that have positive and negative Doppler shifts. For example, the profile in Fig. 6(a) has just one segment with positive Doppler shift. However, Fig. 6(b) has two segments each of which has a positive and a negative Doppler shift. Further, within each segment, the Doppler energy first increases and then decreases (which correspond to acceleration and decel- eration of human body parts).

A WiSee receiver leverages these properties to first find segments and then cluster segments into a gesture. Our pro- cess of finding segments is intuitively similar to packet detec- tion in wireless communication systems. In communication, to detect the beginning of a packet, the receiver computes the average received energy over a small duration. If the ratio between this energy and noise level is greater than a thresh- old, then the receiver detects the beginning of a packet. Sim- ilarly, if this ratio falls below a threshold, the receiver detects the end of the packet. Likewise, in our system, the energy in each segment first increases and then decreases. So the WiSee receiver computes the average energy in the positive and negative Doppler frequencies (other than the DC and the four frequency bins around it). If the ratio between this average energy and the noise level is greater than 3 dB, the receiver detects the beginning of a segment. When this ratio falls below 3 dB, the receiver detects the end of the segment. ³ To cluster segments into a single gesture, WiSee’s receiver uses a simple algorithm: if two segments are separated by less than one second, we cluster them into a single gesture.

(3) Gestures Classification: As described earlier, the Doppler profiles in Fig. 6 can be considered as a sequence of positive and negative Doppler shifts. Further, from the plots, we see that the patterns are unique and diﬀerent across the nine gestures. Thus, the receiver can classify gestures by matching the pattern of positive and negative Doppler shifts. Specifically, there are three types of segments: seg- ments with only positive Doppler shifts, segments with only

3 The noise level is calibrated at the receiver by computing the energy in the non-DC frequencies, in the absence of ges- tures.

Ve lo c ity time

(19)

Detection by Classification

Different gestures correspond to various frequency-time Doppler profiles

19

time (second)

frequency (Hz)

1.25 2.5 3.75 5 6.25 7.5 8.75

30 20 10 0

−10

−20

−30

8 16 24 32 40dB

Figure 5—Frequency-time Doppler profile of an ex- ample gesture. The user moves her hand towards the re- ceiver.

to be a specific kind of discontinuity between the OFDM symbols. Thus, we can perform interpolation between the OFDM symbols as described earlier. We note, however, that since all the CPs have a fixed length, such an interpolation is equivalent to resampling the OFDM symbols at a constant rate given by

Symbol length+CP length

Symbol length

, where Symbol length and CP length denote the length of the OFDM symbol and CP respectively. Since such resampling of the symbols does not change the doppler pattern, in practice we simply skip the CPs to reduce the computation.

3.2 Mapping Doppler Shifts to Gestures

So far we described how to transform the wideband 802.11 transmissions into a narrowband signal at the receiver. In this section, we show how to extract the Doppler informa- tion and map it to the gestures. Specifically, we describe the following three steps: (1) Doppler extraction which computes the Doppler shifts from the narrowband signals, (2) Segmen- tation which identifies a set of segments that correspond to a gesture, and (3) Classification which determines the most likely gesture amongst a set of gestures. We describe how WiSee performs each of these steps. We focus on the sin- gle user case; in §3.3, we extend our design to work in the presence of other users.

(1) Doppler Extraction: WiSee extracts the Doppler in- formation by computing the frequency-time Doppler profile of the narrowband signal. To do this, the receiver computes a sequence of FFTs taken over time. Specifically, it computes an FFT over samples in the first half-a-second interval. Such an FFT give a Doppler resolution of 2 Hertz. The receiver then moves forward by a 5 ms interval and computes an- other FFT over the next overlapping half-a-second interval.

It repeats this process to get a frequency-time profile.

Fig. 5 plots the frequency-time Doppler profile (in dB) of a user moving her hand towards the receiver. The plot shows that, at the beginning of the gesture most of the energy is concentrated in the DC (zero) frequency. This corresponds to the signal energy between the transmitter and the receiver, on paths that do not include the human. However, as the user starts moving her hand towards the receiver, we first see increasing positive Doppler frequencies (corresponding to hand acceleration) and then decreasing positive Doppler frequencies (corresponding to hand deceleration).

We note that the WiSee receiver is only interested in the Doppler shifts produced by human gestures. Since the speeds at which a human can typically perform gestures are between 0.25 m/sec and 4 m/sec [12], the Doppler shift of interest at 5 GHz is between 8 Hz and 134 Hz. Thus, the WiSee receiver reduces its computational complexity by analyzing the FFT output corresponding to only these frequencies.

(2) Segmentation: To do this, WiSee leverages the struc- ture of the Doppler profiles, shown in Fig. 6. These corre-

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Figure 6—Frequency-time Doppler profiles of the gestures in Fig. 1. WiSee segments the profiles into sequences of positive and negative Doppler shifts, which uniquely identify each gesture.

spond to the gestures in Fig 1. The plots show that the profiles are a combination of positive and negative Doppler shifts. Further, each gesture comprises of a set of segments that have positive and negative Doppler shifts. For example, the profile in Fig. 6(a) has just one segment with positive Doppler shift. However, Fig. 6(b) has two segments each of which has a positive and a negative Doppler shift. Further, within each segment, the Doppler energy first increases and then decreases (which correspond to acceleration and decel- eration of human body parts).

A WiSee receiver leverages these properties to first find segments and then cluster segments into a gesture. Our pro- cess of finding segments is intuitively similar to packet detec- tion in wireless communication systems. In communication, to detect the beginning of a packet, the receiver computes the average received energy over a small duration. If the ratio between this energy and noise level is greater than a thresh- old, then the receiver detects the beginning of a packet. Sim- ilarly, if this ratio falls below a threshold, the receiver detects the end of the packet. Likewise, in our system, the energy in each segment first increases and then decreases. So the WiSee receiver computes the average energy in the positive and negative Doppler frequencies (other than the DC and the four frequency bins around it). If the ratio between this average energy and the noise level is greater than 3 dB, the receiver detects the beginning of a segment. When this ratio falls below 3 dB, the receiver detects the end of the segment.

³

To cluster segments into a single gesture, WiSee’s receiver uses a simple algorithm: if two segments are separated by less than one second, we cluster them into a single gesture.

(3) Gestures Classification: As described earlier, the Doppler profiles in Fig. 6 can be considered as a sequence of positive and negative Doppler shifts. Further, from the plots, we see that the patterns are unique and diﬀerent across the nine gestures. Thus, the receiver can classify gestures by matching the pattern of positive and negative Doppler shifts. Specifically, there are three types of segments: seg- ments with only positive Doppler shifts, segments with only

3

The noise level is calibrated at the receiver by computing

the energy in the non-DC frequencies, in the absence of ges-

tures.

(20)

Classification

• Partition signals into segments

• Represent the moving pattern as a sequence of positive/negative Doppler Effects

20 Doppler Effect Value

Positive 1

Negative -1

Both Positive/Negative 2

Compare the received sequence with the set of

pre-defined sequenced

(21)

Practical Issue

• Tx never sends the identical symbols over time

• Solution: Decode and re-encode

⎻ Decode the data symbol as usual

⎻ Re-encode the frequency-domain symbols Y ₁ = H ₁ X ₁

Y ₂ = H ₂ X ₂ à Y ₂ ’= Y ₂ *(X ₁ /X ₂ ) ~= H ₂ X ₁

Y _M = H _M X _M à Y _M ’= Y _M *(X ₁ /X _M ) ~= H _M X ₁

⎻ Convert it back to time-domain y’(m) = IFFT(Y’ _m )

⎻ Perform large FFT for y’(0)~y’(M)

…

(22)

Performance – Accuracy

• Confusion matrix

22 Figure 10—Confusion matrix for gestures in the home scenario: The figure shows that the average de- tection and classification accuracy is 94% across the nine gestures. In contrast, random guesses have an accuracy of 11.1%. This shows that WiSee can extract rich information about gestures from wireless signals.

The occupants move about and have meetings and lunches in the oﬃce as usual. We believe that, given the higher den- sity of people, the oﬃce room is a worse scenario compared to our two-bedroom apartment. We note that there are other scenarios in which WiSee can be evaluated, which are, how- ever, not in the scope of this paper.

Fig. 11 plots the number of false detection events per hour as a function of time. The figure shows results for diﬀer- ent number of repetitions in the preamble. The plot shows that when the receiver uses a preamble with only one rep- etition (i.e., perform the gesture once), the number of false events is, on the average, 15.62 per hour. While this is low, it is expected because typical human gestures do not fre- quently result in a positive Doppler shift followed by a neg- ative Doppler shift. For example, in our experiments, walk- ing caused a continuous monotone Doppler shift that was not confused with alternating positive and negative Doppler shifts. Also, as the number of repetitions in the preamble in- creases, the false detection rate significantly reduces. Specif- ically, with three repetitions, the average false detection rate reduces to 0.13 events per hour; with more than four rep- etitions, the false detection rate is zero. This is expected because it is unlikely that typical human motion would pro- duce a repetitive pattern of positive and negative Doppler shifts. Further, since the WiSee receiver requires repetitive positive and negative Doppler shifts to occur at a particu- lar range of speeds (0.25 m/s to 4 m/s), it is unlikely that even typical environmental and mechanical variations would produce them.

Classifying the target human gestures in the presence of other humans: As described in §3.3, WiSee computes the MIMO channel for the target user that minimizes the interference from the other humans. We would like to evaluate the use of MIMO in classifying a target user’s gestures, in the presence of other moving humans. We run experiments in a 13 feet by 19 feet room with our WiSee receiver and transmitter. We have the target user perform the two gestures in Fig. 1(a) and Fig. 1(b). Our experiments have up to four interfering users in random locations in the room. The users were asked to perform arbitrary gestures using their arms.

0 5 10 15 20 25 30 35 40

12:00am 6:00am 12:00pm 6:00pm

#false positives / hour

Time of Day one rep

two reps three reps four reps five reps

Figure 11—False Detection Rate from a 24-Hour Trace: The figure plots the false detection rate in an oﬃce room with 12 people over a 24-hour period on a weekday.

0 20 40 60 80 100

1 2 3 4

Detection + Classification Accuracy (%)

Number of Interfering Users

1 antenna 2 antennas 3 antennas 4 antennas 5 antennas

Figure 12—WiSee in the presence of other interfer- ing users: The figure plots the detection and classification accuracy of the target user in the presence of other users in a 13 × 19 sq. feet room. The plots show that, given a fixed number of antennas, as the number of interfering users increases, the accuracy decreases. However, with three in- terfering users, the accuracy is still as high as 90% with a five-antenna receiver.

Fig. 12 plots the average recognition accuracy of the tar- get user’s gestures as a function of the number of interfering users. The figure shows results for diﬀerent number of anten- nas at the WiSee receiver. The plots show that using a five- antenna receiver, the accuracy is as high as 90% with three interfering users in the room. Further, using additional an- tennas significantly improves this accuracy in the presence of multiple interfering users. We note however, that for a fixed number of transmitters and antennas at the receiver, the classification accuracy degrades with the number of users (e.g., a conference room setting or a party scenario). For ex- ample, in our experiments, the accuracy is less than 60%

with four interfering users. However, since typical home sce- narios do not have a large number of users in a room, WiSee can enable a significant set of interaction applications for always-available computing embedded in the environment.

Stress-testing WiSee: Since WiSee leverages MIMO to can- cel the signal from the interfering human, it suﬀers from the near-far problem that is typical to interference cancella- tion systems. Specifically, reflections from an interfering user closer to the receiver, can have a much higher power than that of the target user. To evaluate WiSee’s classification accuracy in this scenario, we run the following experiment:

We fix the location of the target user six feet away from the

WiSee receiver. We then change the interfering user’s loca-

tion between three feet and ten feet from the receiver. The

target user performs the two gestures shown in Fig. 1(a) and

Accuracy: .88~1

(23)

Performance – False Detection

False detection can be almost eliminated if the subject repeats the preamble (pre-defined gesture) several times

23 Figure 10—Confusion matrix for gestures in the home scenario: The figure shows that the average de- tection and classification accuracy is 94% across the nine gestures. In contrast, random guesses have an accuracy of 11.1%. This shows that WiSee can extract rich information about gestures from wireless signals.

The occupants move about and have meetings and lunches in the oﬃce as usual. We believe that, given the higher den- sity of people, the oﬃce room is a worse scenario compared to our two-bedroom apartment. We note that there are other scenarios in which WiSee can be evaluated, which are, how- ever, not in the scope of this paper.

Fig. 11 plots the number of false detection events per hour as a function of time. The figure shows results for diﬀer- ent number of repetitions in the preamble. The plot shows that when the receiver uses a preamble with only one rep- etition (i.e., perform the gesture once), the number of false events is, on the average, 15.62 per hour. While this is low, it is expected because typical human gestures do not fre- quently result in a positive Doppler shift followed by a neg- ative Doppler shift. For example, in our experiments, walk- ing caused a continuous monotone Doppler shift that was not confused with alternating positive and negative Doppler shifts. Also, as the number of repetitions in the preamble in- creases, the false detection rate significantly reduces. Specif- ically, with three repetitions, the average false detection rate reduces to 0.13 events per hour; with more than four rep- etitions, the false detection rate is zero. This is expected because it is unlikely that typical human motion would pro- duce a repetitive pattern of positive and negative Doppler shifts. Further, since the WiSee receiver requires repetitive positive and negative Doppler shifts to occur at a particu- lar range of speeds (0.25 m/s to 4 m/s), it is unlikely that even typical environmental and mechanical variations would produce them.

Classifying the target human gestures in the presence of other humans: As described in §3.3, WiSee computes the MIMO channel for the target user that minimizes the interference from the other humans. We would like to evaluate the use of MIMO in classifying a target user’s gestures, in the presence of other moving humans. We run experiments in a 13 feet by 19 feet room with our WiSee receiver and transmitter. We have the target user perform the two gestures in Fig. 1(a) and Fig. 1(b). Our experiments have up to four interfering users in random locations in the room. The users were asked to perform arbitrary gestures using their arms.

0 5 10 15 20 25 30 35 40

12:00am 6:00am 12:00pm 6:00pm

#false positives / hour

Time of Day one rep

two reps three reps four reps five reps

Figure 11—False Detection Rate from a 24-Hour Trace: The figure plots the false detection rate in an oﬃce room with 12 people over a 24-hour period on a weekday.

0 20 40 60 80 100

1 2 3 4

Detection + Classification Accuracy (%)

Number of Interfering Users

1 antenna 2 antennas 3 antennas 4 antennas 5 antennas

Figure 12—WiSee in the presence of other interfer- ing users: The figure plots the detection and classification accuracy of the target user in the presence of other users in a 13 × 19 sq. feet room. The plots show that, given a fixed number of antennas, as the number of interfering users increases, the accuracy decreases. However, with three in- terfering users, the accuracy is still as high as 90% with a five-antenna receiver.

Fig. 12 plots the average recognition accuracy of the tar- get user’s gestures as a function of the number of interfering users. The figure shows results for diﬀerent number of anten- nas at the WiSee receiver. The plots show that using a five- antenna receiver, the accuracy is as high as 90% with three interfering users in the room. Further, using additional an- tennas significantly improves this accuracy in the presence of multiple interfering users. We note however, that for a fixed number of transmitters and antennas at the receiver, the classification accuracy degrades with the number of users (e.g., a conference room setting or a party scenario). For ex- ample, in our experiments, the accuracy is less than 60%

with four interfering users. However, since typical home sce- narios do not have a large number of users in a room, WiSee can enable a significant set of interaction applications for always-available computing embedded in the environment.

Stress-testing WiSee: Since WiSee leverages MIMO to can- cel the signal from the interfering human, it suﬀers from the near-far problem that is typical to interference cancella- tion systems. Specifically, reflections from an interfering user closer to the receiver, can have a much higher power than that of the target user. To evaluate WiSee’s classification accuracy in this scenario, we run the following experiment:

We fix the location of the target user six feet away from the

WiSee receiver. We then change the interfering user’s loca-

tion between three feet and ten feet from the receiver. The

target user performs the two gestures shown in Fig. 1(a) and

(24)

Concluding Remark

• First device-free wireless-based gesture recognition

• Leverage the Doppler Effect to detect gestures

• Improve the resolution using large FFT

• How to detect multiple persons?

⎻ Use multiple antennas

• Limitation: a finite set of detectable gesture

⎻ The Doppler shift patterns of different gestures should be distensible

24

(25)

EchoTag

Infrastructure-free indoor localization tagging [MobiCom’15]

Yu-Chih Tung and Kang Shin

University of Michigan, Ann Arbor

(26)

What is Location Tagging?

26

(27)

What is Location Tagging?

(28)

What is Location Tagging?

28 Locate the position using Acoustic Signals!

HOW?

(29)

Existing Solutions

• Infrastructure free

• Infrastructure-based

(30)

Existing Solutions

• Infrastructure free

⎻ SurroundSense [Mobisys’09] room-level

⎻ Batphone [Mobisys’11] room-level

⎻ RoomSense [AH’11] 300cm

⎻ Horse [Mobisys’05] 200cm

⎻ Geo [Mobisys’11] 100cm

⎻ FM [Mobisys’12] 30cm

• Infrastructure-based

⎻ Luxapose [Mobisys’14] 10cm

⎻ Cricket [Mobicom’00] 10cm

⎻ Guoguo [Mobisys’13] 6-25cm

30 Not accurate

Hard to deploy

(31)

EchoTag

• Active acoustic sensing

• Fine sensing resolution based on built-in sensors (microphone and speaker)

• Low cost and easy deployment

(32)

How to Use EchoTag?

32 (a) Outline contour (b) Sense w/ sound (c) Select app (d) Replay tag

(33)

EchoTag

1. Active acoustic sensing

2. Classification and optimization

(34)

Sound Fingerprint

34 Freq

Freq

Freq Freq

Freq

(a) Hardware imperfection (b) Surface absorption

(c) Multipath fading by reflections from surface & near objects

(a) Hardware imperfection (b) Surface absorption

(c) Multipath fading by reflections from surfaces and near objects

(35)

Sound Fingerprint – Example

Frequency (Hz)

R

L

Frequency (Hz)

R

L

11000 22000 11000 22000

1

0

1

0

1

0

1

0

T ime (se c) T ime (se c)

0 100 0 100

0 100

Frequency (Hz)

R

L

Frequency (Hz)

R

L

11000 22000 11000 22000

1

0

1

0

1

0

1

0

T ime (se c) T ime (se c)

0 100 0 100

0 100

(36)

Volumn Control

36 Similar to the linearity

problem in WiFi

(37)

Classification

• Support Vector Machine (SVM)

⎻ One-against-all multi-class SVM

⎻ NoTag Classifier

(38)

Sensing Optimization

• Acoustic sensing is triggered selectively

⎻ Save energy and reduce annoyance

⎻ Based on WiFi beacons and tilt

38 Trigger EchoTag

(39)

FCC

Electronic Frog Eye: Counting Crowd Using WiFi [INFOCOM’14]

Wei Xi, Jizhong Zhao, Xiang-Yang Li, Kun Zhao, Shaojie Tang, Xue Liu, Zhiping Jiang

Xi’an Jiaotong University, Tsinghua University, Illinois Institute of

Technology, Temple University, McGill University

(40)

People Counting

• Application

⎻ Crowd control,

marketing research, etc

• Existing solutions

⎻ Camera-based:

line-of-sight limitation, lighting requirement, vulnerable to object overlap, privacy concern

⎻ Device-based (RFID tags, sensors, mobile phones):

not scalable, high deployment cost

40 http://www.axis.com/dk/en/solutions-by-

application/people-counting

(41)

Device-free RF-based Counting

• RSS-based

⎻ Leverage attenuation models to localize users

⎻ Poor performance in a multipath-rich environment

• PHY-based

⎻ Exploit raw physical-layer information

⎻ Need special hardware, such as USRP

• CSI-based

⎻ Use fine-grained channel state information (attenuation and phase information of OFDM subcarriers)

⎻ Can be captured by commodity NICs

(42)

Key Idea:

# of People vs. CSI Variance

42 More mobile users à Higher CSI variation

(43)

Why?

• Each user can be regarded as a virtual

antenna, which reflects the signal toward Rx

Rx

2

Tx

1 1

Tx

1

3

Tx

1

4

Tx

1 0

Tx

1

IEEE INFOCOM 2014 - IEEE Conference on Computer Communications

363 Y = Y _static + Y _{from_user}

= HX

H = H _static + H _{from_user} H = H _static + ∑ _u=1..N H _u à

N↑ ⟺ Var(H)↑

(44)

Challenge

• Why it is difficult?

⎻ Should be resistant to environmental changes

⎻ But sensitive to human motion

• Need to learn “short-term” CSI variance

⎻ Long-term average variance is helpless when the crowd number changes frequently

• Problem: How to get short-term variance when the sample size is small?

44 i

(45)

PEM

• Percentage of non-zero element in the dilated CSI matrix

45 M

k = (|h _ij |- h _min )/ (h _max - h _min ) * M

Normalize |h _ij | to (h _min , h _max )

Rx

2

Tx

1 1

Tx

1

3

Tx

1

4

Tx

1 0

Tx1

IEEE INFOCOM 2014 - IEEE Conference on Computer Communications

363

(46)

PEM

• Percentage of non-zero element in the dilated CSI matrix

46 M

k = (|h _ij |- h _min )/ (h _max - h _min ) * M

Normalize |h _ij | to (h _min , h _max )

i

Rx

2

Tx

1 1

Tx

1

3

Tx

1

4

Tx

1 0

Tx1

IEEE INFOCOM 2014 - IEEE Conference on Computer Communications

363 k

1

à set M[k][j]=1

(47)

PEM

• Count the percentage of non-zero elements

47 Rx

2

Tx

1 1

Tx

1

3

Tx

1

4

Tx

1 0

Tx1

IEEE INFOCOM 2014 - IEEE Conference on Computer Communications

363

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Rx

2

Tx

1 1

Tx

1

3

Tx

1

4

Tx

1 0

Tx1

IEEE INFOCOM 2014 - IEEE Conference on Computer Communications

363

1 1 1 1 1

1 1 1 1

1 1 1 1 1 1

1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

Var(H)↑ ⟺ #(1)↑

(48)