Comp2Watch - Snap2Read/Comp2Watch: 增進手持裝置平台上的多媒體瀏覽體驗

Chapter 1 Introduction

1.2 Comp2Watch

Smart phones have some significant progresses which enabled many things that

used to be performed only on the computers. They have already changed the ways of

our life. In fact, more and more people are watching videos on mobiles now, and the

amount of people who watch videos on mobile devices has been growing rapidly during

these past years. The latest report from Nielsen Company [17] shows that the number of

Americans watching mobile video has grown more than 40% from 2009 Q4 to 2010 Q4,

ending the year at nearly 25 million people. Not only has the popularity grown, the

average time that users watching videos on mobile phones has also grown nearly 20% at

the same time. At the end of 2010 Q4, people spent 4 hour 20 minutes per month on

watching mobile videos in average.

We mainly focus on smart phones instead of pad-like computers because they are

pocket portable and therefore will be always at hand. However, to the best of our

knowledge, there are at least three gaps in mobile video watching:

1. Fragment watching time: Users will not watch mobile videos if there are

computers or televisions at hand, the most common situation is that when users have

only a small chunk of time (e.g., while waiting for a bus, standing in line at a store, or

during daily commute in subway). In these situations, users may not have enough time

to watch a complete video, and once the watching process is paused, it will not be easy

to get back to the time point where users leave last time.

Figure 2. Illustration of baseline (left) and proposed method (right). Applying existing video summarization technique directly on mobile interface which has very limited space may results in some problem. Imagine that putting a huge collage of a video into a smart phone: low zoom level makes the frames unclear, while high zoom level images will occupy lots of spaces.

2. Slow/unstable network: Although online video streaming sites (e.g., Youtube,

TED) usually have buffer mechanism to ensure that their videos can be played instantly

instead of users’ having to download the whole video clip before watching, users cannot

get a quick glimpse of the main idea of the content/story until the video ends or has

been played for a while. It is an expensive cost in terms of time and also network

bandwidth.

3. The size of mobile display: In fact, the screen resolution of commercial smart

phones (e.g., Sony Ericsson XPERIA X12: 854*480, HTC Desire HD: 800*480) has

grown to near the resolution of PC’s LCD screens (e.g., XGA standard: 1024*768,

WXGA standard: 1280*800). However, when it comes to physical screen size, the

smart phones (e.g., Apple iPhone4: 3.5 inch, HTC Desire HD: 4.3 inch) are far behind Figure 3. Illustration of screen resolution

comparison (top) and screen size comparison (right). Mobile screens are in green color, and LCD screens are in blue color.

from PC’s LCD screens (usually more than 20 inch). The comparison above means that

smart phones have to put a relatively large content into a tiny space. The detailed

illustrations are shown in Figure 3.

In addition, it is known that performing actions like pressing virtual buttons on a

touch screen is somehow difficult [15]. It is not surprising that interacting with such

screens (e.g., selecting, dragging, or clicking) can be very challenging especially when

the content is too large so that it must be scaled down.

To handle the first two gaps, we use a collage image composed by selected frames.

The time orders of shots are preserved so that it can provide random access via finger

click on the touch screen interface on mobiles. It is much more convenient than

dragging the timeline beneath the video. Also, downloading a single image instead of

the whole video can significantly reduce the network overhead, and taking a glance on

the images can enable users to try to get the story in the video or help users to quickly

filter out videos which they are not interested in.

The advantages above come from the traditional video summarization techniques.

However, the most important issue is the small screen property on such mobile

platforms (see Figure 2). To bridge this last gap, our intuition is based on the ROI

region in the image frame. Figure 4 shows examples of extracted ROI bounding boxes,

and we have observed that cropping the ROI regions from the frames has the chance of

saving spaces without losing much context information in average cases. The latter user

study results support this observation, too.

The key contributions of Comp2Watch are:

1. To the best of our knowledge, Comp2Watch represents one of the first attempts

that enables video summarization on mobile devices, thus presenting another

experience on mobile video watching.

2. We observed several gaps for mobile video watching and we use ROI

extraction to deal with the most challenging one, thus enabling the templates

with non-fixed ratio. And the result collage is more compact.

3. We propose several measurements for Compactness and evaluate them in the

quantitative experiments. For user study, we evaluate both Clearness and Figure 4. The ROIs in video frames in talk videos (left) and movie trailers (right).

White rectangles indicate ROI boundaries.

Context Loss. These experiments show the promising results of the proposed

system.

Chapter 2 Mobile Magazine Reading Enhancement – Snap2Read

The purpose of our system is to segment the document image into homogeneous

blocks with the maximum size (Section 2.1 - Page Segmentation), to classify them into

a set of predefined categories (Section 2.2 - Zone Classification), and finally to render

them to fit for different screen resolutions (Section 2.3 - Mobile Adaptation). The

experimental results are shown as well (Section 2.4).

2.1 Page Segmentation

Previous approaches usually focus on one single language or specific types of

documents. However, we cannot assume our input to be a certain type of magazines for

reading activities on mobile devices, so we do not give any fixed parameters related to

character font size, line spacing or layout structure, which traditional page segmentation

methods do.

Like other morphological methods, our method is a bottom-up approach [5]. The

main idea is to group those small connected components into larger regions by dilation.

However, we dilate iteratively and automatically select the appropriate dilation kernel.

2.1.1 Pre-Processing

The purpose of the preprocessing step is to filter out noises, dividing lines, and

blocks that are possibly images. They tend to be merged with others during dilation, so

we must make sure that other main components (mostly texts) will not be affected.

The detailed steps: (1) To take efficiency into consideration, resize the image to a

proper area measure (i.e., about 900*650). (2) Do global threshold binarization. (3) For

those foreground (i.e., white) pixels, apply connected component analysis. (4) For each

component, if its proportion of height to width is significantly high (i.e., 20 times), then

remove it from the original image. If the component is big enough (i.e., 1.25%) and its

(a) (b) (c) (d)

Figure 5. An illustration of the pre-processing steps: (a) Original image. (b) Binarization. (c) Connected component analysis. Components are in different colors. (d) Noise removal and image block pre-extraction.

(i.e., 0.1), it will be considered to be “possibly image block” and be extracted in

advance. The examples are illustrated in Figure 5.

Note that in the step (2), we have also tried edge detector and local threshold

binarization method, which are widely used in document image processing [5].

However, although the former depicts the salient edges, it also turns complex image

regions into many edge fragments. The latter is mostly used on scanned document

images that only contain texts in order to make them robust to lighting change, but it,

too, has the similar effect mentioned above on image regions. As a result, we decide to

use global threshold binarization to preserve image blocks.

2.1.2 Adaptive Dilation Threshold

In this work, we enlarge small components by dilation operation to group them

together, and the square kernel is applied. It enlarges both vertical and horizontal

regions of a component, but how to determine the kernel size is vital and nontrivial.

A fact is that the main font size may change from one magazine to another, or from

one language to another. Thus, using fixed dilation kernel size for segmentation may not

work on every case, so we need to determine it dynamically.

During the iterative dilation process, we find that the number of components will

drop rapidly at a certain kernel size, and the most suitable size is at the turning point

(e.g., red arrow in Figure 6). The physical meaning of this phenomenon is that a large

number of characters and words are merged into lines or paragraphs at the same time.

We then count the number of component versus kernel sizes (e.g. a blue line in

Figure 3), and the turning point can be found by applying approximate second order

differential. To ensure that almost all components are sufficiently merged, we also add a Figure 6. The relationship of component numbers to kernel size, this figure includes 20 pages (blue lines). The red arrow indicates the turning point which is suitable for the kernel size.

(a) (b) (c) (d)

Figure 7. Dilation for little blocks mergence and redundant blocks removal. (a) An example intermediate image from preprocessing step, in which title and footnote are split into several fragments. (b) Remove the large components and dilate again.

(c) There exists some inside or non-informative block. (d) The final output example which does not contain those redundant blocks.

constraint that the number of components should not be less than a certain number (i.e.,

30).

In the last step, minimum enclosing rectangles are extracted from those

components which are large enough (i.e., at least 0.3% of whole page area), but some

remaining components are not merged well enough (over-segmentation) because the

spacing is larger than main texts (e.g., title, footnote, etc.) Therefore, we use a

1.5-time-large kernel to dilate again for the remaining small components (cf. Figure 7).

Inside blocks and small blocks (i.e., less than 0.1% of whole page area) are also

removed (cf. Figure 7).

2.2 Zone Classification

The rectangles (blocks) produced by the segmentation step above will be

rearranged into meaningful “patches” in accordance with different screen resolutions of

the device during the next adaptation step, while images and text blocks have different

adaptations: Images cannot be further segmented, but text blocks can be split if they are

too large to accommodate themselves to the screen. Therefore, it is necessary to classify

images and text blocks (cf. Figure 1).

2.2.1 Features for Classification

We use the early fusion scheme – multimodal features concatenated as a long one

– to combine features so that we can learn a classifier by SVM (Support Vector

Machine) [6] for each label (e.g., text, image). The three features we used are spatial

feature, GCM (Grid Color Moment), and PHOG (Pyramid of Histograms of Orientation

Gradients.) [7] Their detailed descriptions are as follows.

Spatial feature contains the coordinate and size of a specified region (i.e., x, y,

width and height).

As for color feature (GCM), we adopt the first order moment (mean) and the

second order moment (variance) for color feature. The image will be partitioned into

several (i.e., 8*8) sub-blocks, and for each block, calculate its mean and variance values.

As a result, the GCM feature is a vector with 8 * 8 (blocks) * 3 (color planes) * 2

(moments) = 384 dimensions.

The shape feature (PHOG descriptor [7]) represents the “local shape,” and the

“spatial layout” of the image. To calculate PHOG, first extract edge contours by Canny

edge detector, and the image is divided into 4^l 44 sub-blocks at level l. The HOG of

each grid at each pyramid resolution level is then calculated. In this paper, we set level

up to 2 (i.e., l = 0, 1, 2) and 8 bins for HOG. Thus, by concatenating different level of

resolutions of HOGs, it can be formulated as a vector representation with

0 1 2

(4 + +4 4 ) *8 168= dimensions.

Their dimensions are 4, 384 and 168, respectively. The concatenated feature vector

is then measured 556 dimension, and each dimension will be scaled into [-1, +1].

2.2.2 Model Selection

We use the RBF (Radial Basis Function) kernel for classification, so there are two

main parameters g and c to be determined (i.e., gamma and cost, respectively). In order

to select the suitable model for prediction, we apply 5-fold cross validation on total

1430 page segments and get average accuracy around 0.95.

2.3 Mobile Adaptation

Although the segmented blocks are composed with homogeneous components (cf.

Figure 9 and Figure 10), they cannot be read directly because we extract them as large

as possible for each region type, which may not fit the screen resolution, so we must

adjust the blocks to readable patches.

As mentioned above, only text blocks may need to be further split (e.g., the wide

text block in Figure 1(E).). Generally speaking, the English articles tend to stretch in

vertical direction, while the Chinese articles tend to expand horizontally. Thus, we

adopt a heuristic approach (take English language as an example): For each block, we

scale it along its width, and split it into several patches according to its height, and pad

those patches whose height are not sufficient to prevent distortion. (cf. Figure 8)

As for the reading sequence, images will have higher priority than texts, and then

we rank them from upper left corner to the lower right corner (for Chinese magazines,

rank from upper right corner to lower left corner).

We also provide a transparent overview window at the upper right corner of the

mobile interface to indicate the current location on whole page. Thus users do not have

to zoom-in and zoom-out repeatedly to obtain the geometric information.

2.4 Experimental Results

This section describes our dataset and how we label the ground truth, and it also

shows the evaluation of our work.

(a) (b)

Figure 8. An illustration of adaptation. (a) Original segmented blocks (b) The adapted (scaled, padded) patches. The reading sequence is 1, 2, 3a, 3b, 4a, 4b, etc…, and it is used to guide users so that they can read the page conveniently through clicks without losing track of page context.

2.4.1 Magazine Dataset

To the best of our knowledge, there is no public dataset for page segmentation

evaluation. Previous page segmentation researches usually tend to use their own private

dataset depending either on their target document genre (e.g., newspaper, journal), or on

a specific language. Although we know that in recent years, ICDAR Page Segmentation

Competition has created their own dataset with rich types of sources, only those who

participate in the competition can gain access to the dataset. Furthermore, they do not

provide documents in Asian languages (e.g. Chinese, Japanese, Korean), which do not

have clear bounding boxes for each word, while we expect our system can work well on

both type of languages. Thus we create a dataset on our own.

We selected 4 different popular magazines: for Chinese language, “Common

Wealth” and “Business Weekly” are adopted; we also take 2 English magazine named

“Business Week” and “Science”. For each magazine, we manually filter out

advertisement pages and select 30 scanned pages, and the detail is listed in Table 1.

To collect groundtruth, we use an editor named GEDI [8], a highly configurable

document image annotation tool. It reads an image file, and when annotation is done, it

produces a corresponding XML file in GEDI format.

2.4.2 Page Segmentation and Zone Classification Performance

Because the evaluation metrics of the previous methods are usually computed

pixel-wise, which is aimed at OCR. However, our output is rectangle-based, which is

aimed at locating reading patches. As a result, comparing the two does not make sense.

Although we do not compare them directly, we adopt one of the most widely used

metrics in ICDAR 2005 [9], and try to illustrate our performance with their intuition.

We have annotated three types of entities (i.e., categories): text, image and footnote. For

each entity, the EDM (Entity Detection Metric) is calculated.

First, evaluate how much they overlapped between a ground truth zone and a result

zone by keeping a global matrix MatchScore, which is defined by function Magazine Language # of pages Scanned resolution

Common

Wealth Chinese 30 1184*1573

Business

Weekly Chinese 30 963*1280

Business Week English 30 944*1260

Science English 30 944*1203

Table 1. Magazine dataset for segmentation and classification experiments

( )

result i, and T(s): a function that counts the elements of set s.

Second, three types of matches are defined (i.e., one-to-one, one-to-many and

many-to-one) according to their MatchScore: If the MatchScore of ground truth zone j

and result zone i is higher than the accept threshold (i.e., 0.6), then it is a one-to-one

match. (See Figure 11 for more explanation)

If there are K ground truth zones jk (k = 1, 2…K) overlapping the same result zone

i, and each of their MatchScore is between the accept threshold and the reject threshold

(i.e., 0.1< MatchScore(i,jk) < 0.6, k = 1, 2…K), but their summation is higher than the

accept threshold, then it is a many-to-one match, and vice versa.

For simplicity, the acceptable matched number for each entity is defined as

MatchNumber = (w₁*one-to-one + w₂*one-to-many + w₃*many-to-one), where w₁ = 1

and w₂ = w₃ = 0.75 for partial match penalty. Then DetectRate_t and RecognAccuracy_t for

entity t are defined as DetectRate_t = MatchNumber/N_t, and RecognAccuracy_t =

MatchNumbert/Mt. Nt is the number of ground truth regions for t’th entity, and Mt is the

number of result regions for t’th entity. The DetectRatet and RecognAccuracyt represent

the acceptable ratio among all ground truth zones and all result zones for t’th entity,

respectively. The Entity Detection Metric score for each entity (text, image, footnote) is

then defined as

2*

*

t t

DetectRate RecognAccuracy EDM = DetectRate RecognAccuracy

+

⁽²⁾

The overall page segmentation performance is promising. See the breakdowns in

Table 2. The page segmentation results are sampled in Figure 9 and Figure 10. We

also found the results are satisfactory as rendering them in reading patches in two

Android phones with different resolutions.

(a) (b) (c)

Figure 9. Chinese magazine results. (The blue, red and green bounding boxes indicate text, image and footnote, respectively.) (a) A Common Wealth example (b) A Business Weekly example (c) An over-segmentation example which divides a flow chart into text blocks.

(a) (b) (c)

Figure 10. English magazine results. (a) A Business Week example (b) A Science example (c) An over-segmentation example results from figures with unclear bounding boxes.

The result of Business Weekly seems to have better performance than others (cf.

Table 2(a).), because its layout is less complicated than others and its text block size is

mostly large and rectangle shaped, while Business Week has lower performance on

image category (cf. Table 2(b).) because it has a lot of figures and tables combined with

text explanation inside the bounding boxes, and Footnote category usually has lower

performance because parts of them are removed during redudant rectangle removal step,

but they are thought of as non-informative blocks, our system does not guide users to Figure 11. An illustration of accept threshold selection. The groundtruth blocks are marked as magenta and the result blocks are marked as blue. Low MatchScore mostly comes from those small blocks (about 0.8 for logos and 0.6 for footnotes), because a trifling difference (less than 10 pixels) between the two region boundary can result in a large number of percentage of area measure.

Thus the smaller blocks tend to have the lower MatchScore. However, this situation does not impede the reading process of users. As a result, we set the accept threshold at 0.6.

Common Wealth Business Weekly(a)

Business Week Science

T I F T I F T I F T I F

N_t 213 44 55 203 49 52 195 91 84 245 77 122 M_t 199 53 42 209 54 40 251 134 67 250 81 79 DetectRate_t 0.80 0.93 0.49 0.91 0.91 0.62 0.89 0.76 0.50 0.89 0.81 0.52 RecognAccur

acy_t 0.86 0.77 0.64 0.89 0.83 0.80 0.70 0.51 0.63 0.88 0.77 0.81 EDM_t 0.83 0.84 0.56 0.90 0.87 0.70 0.78 0.61_(b) 0.56 0.88 0.78 0.64

Table 2. Page Segmentation results (T: text. I: image, F: footnote)

Chapter 3 Mobile Video Watching Enhancement – Comp2Watch

3.1 Related Work

We have surveyed some kinds of works which are related to our mobile video

summarization. Previous works include automatic collage generation, video

summarization based on unlimited space, and mobile photo summarization.

Uchihashi, et al. [1] was the first work that attempted to propose a comic-like

layout summarization on videos, and their key contributions are maintaining time order

and enabling the variable frame size in accordance with the importance of a shot, and

we use their work as our baseline. Although we do the similar process for video

summarization, we not only transplant it to mobile environment, but also take a detailed

observation to analyze what has been changed. In section 1, we described three gaps for

mobile video summarization, and the first two gaps do not exist on PC environments,

which are strong supports for video summarization on mobile devices; the last gap is the

main impedance for such possibility, and we try to settle this problem by introducing

ROI extraction.

For collage generation, Rother, et al. [11], Lee, et al. [12] and Goferman, et al. [13]

在文檔中 Snap2Read/Comp2Watch: 增進手持裝置平台上的多媒體瀏覽體驗 (頁 13-0)