Experimental Results - Mobile Magazine Reading Enhancement

Chapter 2 Mobile Magazine Reading Enhancement – Snap2Read

2.4 Experimental Results

This section describes our dataset and how we label the ground truth, and it also

shows the evaluation of our work.

(a) (b)

Figure 8. An illustration of adaptation. (a) Original segmented blocks (b) The adapted (scaled, padded) patches. The reading sequence is 1, 2, 3a, 3b, 4a, 4b, etc…, and it is used to guide users so that they can read the page conveniently through clicks without losing track of page context.

2.4.1 Magazine Dataset

To the best of our knowledge, there is no public dataset for page segmentation

evaluation. Previous page segmentation researches usually tend to use their own private

dataset depending either on their target document genre (e.g., newspaper, journal), or on

a specific language. Although we know that in recent years, ICDAR Page Segmentation

Competition has created their own dataset with rich types of sources, only those who

participate in the competition can gain access to the dataset. Furthermore, they do not

provide documents in Asian languages (e.g. Chinese, Japanese, Korean), which do not

have clear bounding boxes for each word, while we expect our system can work well on

both type of languages. Thus we create a dataset on our own.

We selected 4 different popular magazines: for Chinese language, “Common

Wealth” and “Business Weekly” are adopted; we also take 2 English magazine named

“Business Week” and “Science”. For each magazine, we manually filter out

advertisement pages and select 30 scanned pages, and the detail is listed in Table 1.

To collect groundtruth, we use an editor named GEDI [8], a highly configurable

document image annotation tool. It reads an image file, and when annotation is done, it

produces a corresponding XML file in GEDI format.

2.4.2 Page Segmentation and Zone Classification Performance

Because the evaluation metrics of the previous methods are usually computed

pixel-wise, which is aimed at OCR. However, our output is rectangle-based, which is

aimed at locating reading patches. As a result, comparing the two does not make sense.

Although we do not compare them directly, we adopt one of the most widely used

metrics in ICDAR 2005 [9], and try to illustrate our performance with their intuition.

We have annotated three types of entities (i.e., categories): text, image and footnote. For

each entity, the EDM (Entity Detection Metric) is calculated.

First, evaluate how much they overlapped between a ground truth zone and a result

zone by keeping a global matrix MatchScore, which is defined by function Magazine Language # of pages Scanned resolution

Common

Wealth Chinese 30 1184*1573

Business

Weekly Chinese 30 963*1280

Business Week English 30 944*1260

Science English 30 944*1203

Table 1. Magazine dataset for segmentation and classification experiments

( )

result i, and T(s): a function that counts the elements of set s.

Second, three types of matches are defined (i.e., one-to-one, one-to-many and

many-to-one) according to their MatchScore: If the MatchScore of ground truth zone j

and result zone i is higher than the accept threshold (i.e., 0.6), then it is a one-to-one

match. (See Figure 11 for more explanation)

If there are K ground truth zones jk (k = 1, 2…K) overlapping the same result zone

i, and each of their MatchScore is between the accept threshold and the reject threshold

(i.e., 0.1< MatchScore(i,jk) < 0.6, k = 1, 2…K), but their summation is higher than the

accept threshold, then it is a many-to-one match, and vice versa.

For simplicity, the acceptable matched number for each entity is defined as

MatchNumber = (w₁*one-to-one + w₂*one-to-many + w₃*many-to-one), where w₁ = 1

and w₂ = w₃ = 0.75 for partial match penalty. Then DetectRate_t and RecognAccuracy_t for

entity t are defined as DetectRate_t = MatchNumber/N_t, and RecognAccuracy_t =

MatchNumbert/Mt. Nt is the number of ground truth regions for t’th entity, and Mt is the

number of result regions for t’th entity. The DetectRatet and RecognAccuracyt represent

the acceptable ratio among all ground truth zones and all result zones for t’th entity,

respectively. The Entity Detection Metric score for each entity (text, image, footnote) is

then defined as

2*

*

t t

DetectRate RecognAccuracy EDM = DetectRate RecognAccuracy

+

⁽²⁾

The overall page segmentation performance is promising. See the breakdowns in

Table 2. The page segmentation results are sampled in Figure 9 and Figure 10. We

also found the results are satisfactory as rendering them in reading patches in two

Android phones with different resolutions.

(a) (b) (c)

Figure 9. Chinese magazine results. (The blue, red and green bounding boxes indicate text, image and footnote, respectively.) (a) A Common Wealth example (b) A Business Weekly example (c) An over-segmentation example which divides a flow chart into text blocks.

(a) (b) (c)

Figure 10. English magazine results. (a) A Business Week example (b) A Science example (c) An over-segmentation example results from figures with unclear bounding boxes.

The result of Business Weekly seems to have better performance than others (cf.

Table 2(a).), because its layout is less complicated than others and its text block size is

mostly large and rectangle shaped, while Business Week has lower performance on

image category (cf. Table 2(b).) because it has a lot of figures and tables combined with

text explanation inside the bounding boxes, and Footnote category usually has lower

performance because parts of them are removed during redudant rectangle removal step,

but they are thought of as non-informative blocks, our system does not guide users to Figure 11. An illustration of accept threshold selection. The groundtruth blocks are marked as magenta and the result blocks are marked as blue. Low MatchScore mostly comes from those small blocks (about 0.8 for logos and 0.6 for footnotes), because a trifling difference (less than 10 pixels) between the two region boundary can result in a large number of percentage of area measure.

Thus the smaller blocks tend to have the lower MatchScore. However, this situation does not impede the reading process of users. As a result, we set the accept threshold at 0.6.

Common Wealth Business Weekly(a)

Business Week Science

T I F T I F T I F T I F

N_t 213 44 55 203 49 52 195 91 84 245 77 122 M_t 199 53 42 209 54 40 251 134 67 250 81 79 DetectRate_t 0.80 0.93 0.49 0.91 0.91 0.62 0.89 0.76 0.50 0.89 0.81 0.52 RecognAccur

acy_t 0.86 0.77 0.64 0.89 0.83 0.80 0.70 0.51 0.63 0.88 0.77 0.81 EDM_t 0.83 0.84 0.56 0.90 0.87 0.70 0.78 0.61_(b) 0.56 0.88 0.78 0.64

Table 2. Page Segmentation results (T: text. I: image, F: footnote)

Chapter 3 Mobile Video Watching Enhancement – Comp2Watch

3.1 Related Work

We have surveyed some kinds of works which are related to our mobile video

summarization. Previous works include automatic collage generation, video

summarization based on unlimited space, and mobile photo summarization.

Uchihashi, et al. [1] was the first work that attempted to propose a comic-like

layout summarization on videos, and their key contributions are maintaining time order

and enabling the variable frame size in accordance with the importance of a shot, and

we use their work as our baseline. Although we do the similar process for video

summarization, we not only transplant it to mobile environment, but also take a detailed

observation to analyze what has been changed. In section 1, we described three gaps for

mobile video summarization, and the first two gaps do not exist on PC environments,

which are strong supports for video summarization on mobile devices; the last gap is the

main impedance for such possibility, and we try to settle this problem by introducing

ROI extraction.

For collage generation, Rother, et al. [11], Lee, et al. [12] and Goferman, et al. [13]

have proposed some of the most representative works. [11] formulates the whole

procedure into an energy minimization problem, and they also use graph-cut and

Poisson blending to assemble a smooth collage. [12] follows a similar process (i.e.,

image ranking, ROI selection, ROI packing, and finally image blending) to build a

collage. The strength of [12] is that it can be run efficiently on a mobile phone processor.

Recently, a work that can compose images with arbitrary ROI into a collage has been

proposed [13], the result is more compact and interesting because the space can be filled

up with arbitrary shapes, while [11] and [12] only handle rectangle ROIs.

The main difference of our work from them is that their images have no time order

like video shots, while our output collage must be time-ordered, and thus this layout

problem cannot be solved by their approaches. Most importantly, they do not take the

“smallness issue” (the third gap mentioned above) into consideration since a high-level

view of the whole collage is enough for their application.

3.2 Keyframe Selection

The main difference from [1] in this step is that we put ROI regions into

consideration instead of presenting the whole image. Extracting ROI region can not

only enable the flexibility on frame aspect ratio but also benefit the compactness on the

whole composed image.

First, we apply shot boundary detection on the given videos and choose the middle

frame for each shot as the image presentation of the corresponding shot. We then group

these shot images by common hierarchical clustering method, using predefined distance

threshold (Section 3.2.1). The importance of each shot will be computed in accordance

with shot length, cluster size and ROI ratio to the whole image. Then the importance

scores are quantized into certain level to represent the desired template sizes. Finally,

we filter out shots that are less important or some shots that are similar within a short

period (Section 3.2.2).

3.2.1 Shot Detection and Hierarchical Clustering

For each video, the color histograms of full frames will be extracted for shot

detection. We take a common adaptive threshold method: if two adjacent frames or a

period of frames are measured to be very different, that will be a shot boundary.

After shot boundaries are detected, we take their middle frame to represent the

corresponding shot and use them as a basic unit in the following steps. For simplicity,

we refer to “shot images” as “shots” from now on.

Then a hierarchical clustering step is conducted. The idea of hierarchical clustering

is to merge the two closet clusters iteratively. Here we use both color and PHOG [16]

features to ensure that the grouped shots are similar not only in terms of color histogram,

but also in edge distribution (i.e., shape).

ROI is further detected for each shot using Harel’s work [14]. The ROI region will

be cropped and adapted as the final collage representation. What’s more, ROI

information plays an important role both on shot importance re-weighting and on layout

optimization phase.

3.2.2 Importance Computation

To utilize the space of output collage, the size of all shots must be differentiated by

certain criteria. [1] has defined “importance” as “A shot is important if it is both long

and rare.” Thus they formulate the importance as the length of a shot normalized by its

cluster size to penalize the repeated but discontinuous near-duplicate shots. Therefore

the importance of a shot belongs to cluster is given by:

= log 1

(3) Where is the length of the shot , and (the proportion of shots from the video

that are in cluster ) can be computed from previous clustering result by:

= ∑ (4)

is the total length of all shots in cluster , and is the total cluster number.

However, we think the importance score should not only reflect the shot length and

uniqueness, but also consider ROI propotion on the whole shot; that is, if a shot has a

larger ROI region, it should be given a larger template to represent itself (i.e., higher

importance score). Therefore we replace the importance by:

= (5)

Where and Area are the pixels of the ROI area of shot and the whole

pixels of shot , respectively.

These importance score will be divided into certain levels during a rough

quantization step in order to fit in the pre-defined templates (see Figure 12). During this

step, some shots will be filtered out (i.e. set their level to zero) if they are not important

enough, and others will be assigned sequentially, see Table 3.

The importance level is quantized from the importance score, and it will be used in

one of our cost functions, so we set the level equal to the size of its area of desired

template.

3.3 ROI Packing

The goal of layout packing algorithm is to put all shots into the given two

dimensional space with corresponding size (i.e. importance) while preserving their time

order. To achieve this goal, one heuristic way is to arrange those shots into a

multi-layered layout (i.e. the whole space is divided into row blocks, and these row

blocks contain sub-templates arranged column by column).

Unlike many well studied problems (e.g., bin-packing), such a layout optimization

problem that has the above constraints is NP-hard. In order to make the solution feasible, Table 3. The quantization step from importance score to corresponding level and template size, ! is the importance score of a shot (i.e. !^"#!), $%& is the average

[1] proposed a “row-block-exhaustive” approach (i.e. optimize each row block one by

one). The algorithm is listed as follows:

1. Set the current row block to the top row and the starting shot ' = 1. current row, and cDE, F H is the cost function that measure the difference between the target shot frame image and the matched template.

4. Apply it to current row block and move to the next row block. ' is also increased by the length of the solution.

5. Repeat 2. until all frames are packed.

For more detailed information, please refer to [1].

The following three subsections describe the key changes we have made in this

algorithm to guarantee that it can work well with the extracted ROIs even in an

environment that has a limited space:

We enable the templates with non-fixed aspect ratios since the ROI region is

extracted (3.3.1). The cost function has been modified so that it considers not only the

importance of a shot, but also its aspect ratio (3.3.2). Inter-row optimization has been

introduced to eliminate the monotone layout combinations (3.3.3).

3.3.1 Non-fixed Aspect Ratio Templates

Unlike the baseline approach, we try to enable more flexible templates instead of

fixed aspect ratio templates (See Figure 12). It does not only change the appearance of

output collage, but also fit the ROI content to the template as appropriate as it can be.

3.3.2 The Cost Function

Given a shot and a template I, the cost function in [1] only measures the

difference of size, that is, _{9 JK} = |SizeDI O H|. Where P is the “importance

level” we have mentioned in Table 3. However, it can be replaced by any measure of

difference between the target shot and the available template.

Our templates not only have various sizes, but also have various aspect ratios, to fit

the shot into templates which have different aspect ratio, the shot ROI is first scaled

along the short dimension, and then the ROI region must be extended along the other

dimension to prevent distortion. Since we include those regions outside ROI, the

unwanted areas will be counted (in pixels) into the cost. Given the scaled region ^Q, the Figure 12. The templates used in baseline (left) and the templates used in

proposed method (right). Such change demonstrates the possibility of non-fixed aspect ratio templates, and they can be extended easily.

cost function is then modified as:

= α ∗ _{9 JK}+ β ∗ AreaDI − ^QH (7) Where α and β are predefined weights and they are fixed.

3.3.3 Inter-Row Optimization vs. Intra-Row Optimization

The baseline approach can produce sufficient/diverse layout combinations on the

media whose size (i.e. screen width) is large enough; however, for those mobile devices

that have limited screen size, the generated solution (i.e. template combinations) usually

lacks variety due to the limited solution space. Therefore, we introduce the inter-row

optimization step into the original intra-row optimization.

Our idea is to punish the repeated row sequence in the minimization step, if a row sequence appeared twice, its cost will be multiplied by a coefficient σ, and so on. The

minimization criterion is then modified by:

/ = arg min U31

4 5 678^{9: ;} , < = + ?

@_A

B ∗ V^W; X (8)

Where Y is the number of times that a certain solution has appeared continuously. If a

solution (i.e. template combination in a row) repeated many times, the algorithm above

will tend to use another combination of templates, thus preventing the result collage

from having a monotone layout.

3.4 Experiments

We collect a total of 32 videos (20 of them are talk videos in TED, 12 of them are

popular movie trailers) for the following experiments. The talk dataset and the movie

dataset have 156 shots and 42 shots in average, respectively.

The talk videos are suitable for summarization on mobile because their duration is

usually longer and thus needs random access to recover the watching process if it is

interrupted. Additionally, talk videos usually have a clear subject (e.g., speaker, pictures

on the slide) so we can extract meaningful and effective ROIs from them. We also

include movie trailers that are much more challenging into our experiments in order to

evaluate a general situation. Some example shots can be referred in Figure 4.

3.4.1 Quantitative Evaluation

We expect that the proposed method can represent more informative contents

while the space consumption remains near to the baseline. Several measurements have

been proposed to evaluate our result. First, “ROI Ratio” is defined by:

Z[ Z \ ] = 1

b c\ b Z \ ] = 1

6 / is the adjusted scale after adaptation of shot image . Finally “Collage size ratio”

is given by:

60% of the area is cropped out. However, the ROI cannot be directly put into the collage

without adaptation due to the aspect ratio. After the adapt step, nearly half of the space

in both datasets has been saved.

As for the last two columns, it shows that the content in the proposed method can

give more clear subjects in the collage than the baseline while using the same space.

Table 4. Compactness Measurements Result. The first row represents the result of talk videos, and the second row is for the movie trailers.

Dataset ROI ratio Adapted ratio Enlarged ratio Collage area ratio

Talk 36% 52% 1.81 109%

Movie 38% 57% 1.70 104%

3.4.2 User Study

The usability of a summarization system (especially on mobile) is relatively

subjective, so we also conduct a user study that includes several aspects to evaluate the

proposed method.

We have invited 24 people: 15 of them are male, 9 of them are female. Their

occupation distribution is: 6 undergraduates, 12 graduates, 4PhD, and 2 administrative

stuffs.

We provide two identical smart phones to conduct this user study: HTC desire with

Android 2.2 platform whose resolution is 800*480.

Figure 13. Illustration of user study. We provide 2 identical smart phones for users (left: baseline, right: proposed).

Four questions are listed below:

Q1. Clearness of both approaches.

Q2. The Context Loss in our approach.

Q3. The impression of templates with non-fixed aspect ratios.

Q4. The overall rating.

Figure 14. Q1 - The comparison of clearness in bar chart.

The first question asks user to evaluate the degree of clearness of the subject in

the content, from 1 (not clear) to 4 (very clear). Figure 14 shows the average score

among 24 users. The baseline got a score near the borderline, while our approach was

scored between “Clear” and “Very clear”.

Figure 15. The result of Q2 in pie chart.

The second question is about the loss of context information. Although the

proposed method can enlarge the content, it also makes the context cropped, so we are

curious about how serious it is. The result (see Figure 15) shows that over half of

users think that the context information of proposed method has been affected slightly

by cropping ROI, nearly 40% people think that it is not affected, and only 4% (i.e. one

person) think that it is seriously affected. Note that the cropping process is harmful for

context information in general. However, the effect is not noticeable when such an

application is in some environment with a limited space. In comparison with baseline,

even though it keeps all context information, it is usually too small to be recognized.

Only in some cases (e.g., a big scene that can distinguish the position of the subject)

the baseline can maintain enough context information.

Figure 16. The result of Q3 in pie chart.

The third question is “Does changing aspect ratio affect your impression or does

this arrangement make you uncomfortable?” We propose this question for we are

concerned that users may want to stick with the original aspect ratio because they feel

that all shots which have the fixed aspect ratio is much more like a video. Yet the

result (see Figure 16) shows that nearly 80% people do not care about this issue.

在文檔中 Snap2Read/Comp2Watch: 增進手持裝置平台上的多媒體瀏覽體驗 (頁 26-0)