Chapter 2 Mobile Magazine Reading Enhancement – Snap2Read
2.4 Experimental Results
This section describes our dataset and how we label the ground truth, and it also
shows the evaluation of our work.
(a) (b)
Figure 8. An illustration of adaptation. (a) Original segmented blocks (b) The adapted (scaled, padded) patches. The reading sequence is 1, 2, 3a, 3b, 4a, 4b, etc…, and it is used to guide users so that they can read the page conveniently through clicks without losing track of page context.
2.4.1 Magazine Dataset
To the best of our knowledge, there is no public dataset for page segmentation
evaluation. Previous page segmentation researches usually tend to use their own private
dataset depending either on their target document genre (e.g., newspaper, journal), or on
a specific language. Although we know that in recent years, ICDAR Page Segmentation
Competition has created their own dataset with rich types of sources, only those who
participate in the competition can gain access to the dataset. Furthermore, they do not
provide documents in Asian languages (e.g. Chinese, Japanese, Korean), which do not
have clear bounding boxes for each word, while we expect our system can work well on
both type of languages. Thus we create a dataset on our own.
We selected 4 different popular magazines: for Chinese language, “Common
Wealth” and “Business Weekly” are adopted; we also take 2 English magazine named
“Business Week” and “Science”. For each magazine, we manually filter out
advertisement pages and select 30 scanned pages, and the detail is listed in Table 1.
To collect groundtruth, we use an editor named GEDI [8], a highly configurable
document image annotation tool. It reads an image file, and when annotation is done, it
produces a corresponding XML file in GEDI format.
2.4.2 Page Segmentation and Zone Classification Performance
Because the evaluation metrics of the previous methods are usually computed
pixel-wise, which is aimed at OCR. However, our output is rectangle-based, which is
aimed at locating reading patches. As a result, comparing the two does not make sense.
Although we do not compare them directly, we adopt one of the most widely used
metrics in ICDAR 2005 [9], and try to illustrate our performance with their intuition.
We have annotated three types of entities (i.e., categories): text, image and footnote. For
each entity, the EDM (Entity Detection Metric) is calculated.
First, evaluate how much they overlapped between a ground truth zone and a result
zone by keeping a global matrix MatchScore, which is defined by function Magazine Language # of pages Scanned resolution
Common
Wealth Chinese 30 1184*1573
Business
Weekly Chinese 30 963*1280
Business Week English 30 944*1260
Science English 30 944*1203
Table 1. Magazine dataset for segmentation and classification experiments
( )
result i, and T(s): a function that counts the elements of set s.
Second, three types of matches are defined (i.e., one-to-one, one-to-many and
many-to-one) according to their MatchScore: If the MatchScore of ground truth zone j
and result zone i is higher than the accept threshold (i.e., 0.6), then it is a one-to-one
match. (See Figure 11 for more explanation)
If there are K ground truth zones jk (k = 1, 2…K) overlapping the same result zone
i, and each of their MatchScore is between the accept threshold and the reject threshold
(i.e., 0.1< MatchScore(i,jk) < 0.6, k = 1, 2…K), but their summation is higher than the
accept threshold, then it is a many-to-one match, and vice versa.
For simplicity, the acceptable matched number for each entity is defined as
MatchNumber = (w1*one-to-one + w2*one-to-many + w3*many-to-one), where w1 = 1
and w2 = w3 = 0.75 for partial match penalty. Then DetectRatet and RecognAccuracyt for
entity t are defined as DetectRatet = MatchNumber/Nt, and RecognAccuracyt =
MatchNumbert/Mt. Nt is the number of ground truth regions for t’th entity, and Mt is the
number of result regions for t’th entity. The DetectRatet and RecognAccuracyt represent
the acceptable ratio among all ground truth zones and all result zones for t’th entity,
respectively. The Entity Detection Metric score for each entity (text, image, footnote) is
then defined as
2*
t*
tt
t t
DetectRate RecognAccuracy EDM = DetectRate RecognAccuracy
+
(2)The overall page segmentation performance is promising. See the breakdowns in
Table 2. The page segmentation results are sampled in Figure 9 and Figure 10. We
also found the results are satisfactory as rendering them in reading patches in two
Android phones with different resolutions.
(a) (b) (c)
Figure 9. Chinese magazine results. (The blue, red and green bounding boxes indicate text, image and footnote, respectively.) (a) A Common Wealth example (b) A Business Weekly example (c) An over-segmentation example which divides a flow chart into text blocks.
(a) (b) (c)
Figure 10. English magazine results. (a) A Business Week example (b) A Science example (c) An over-segmentation example results from figures with unclear bounding boxes.
The result of Business Weekly seems to have better performance than others (cf.
Table 2(a).), because its layout is less complicated than others and its text block size is
mostly large and rectangle shaped, while Business Week has lower performance on
image category (cf. Table 2(b).) because it has a lot of figures and tables combined with
text explanation inside the bounding boxes, and Footnote category usually has lower
performance because parts of them are removed during redudant rectangle removal step,
but they are thought of as non-informative blocks, our system does not guide users to Figure 11. An illustration of accept threshold selection. The groundtruth blocks are marked as magenta and the result blocks are marked as blue. Low MatchScore mostly comes from those small blocks (about 0.8 for logos and 0.6 for footnotes), because a trifling difference (less than 10 pixels) between the two region boundary can result in a large number of percentage of area measure.
Thus the smaller blocks tend to have the lower MatchScore. However, this situation does not impede the reading process of users. As a result, we set the accept threshold at 0.6.
Common Wealth Business Weekly(a)
Business Week Science
T I F T I F T I F T I F
Nt 213 44 55 203 49 52 195 91 84 245 77 122 Mt 199 53 42 209 54 40 251 134 67 250 81 79 DetectRatet 0.80 0.93 0.49 0.91 0.91 0.62 0.89 0.76 0.50 0.89 0.81 0.52 RecognAccur
acyt 0.86 0.77 0.64 0.89 0.83 0.80 0.70 0.51 0.63 0.88 0.77 0.81 EDMt 0.83 0.84 0.56 0.90 0.87 0.70 0.78 0.61(b) 0.56 0.88 0.78 0.64
Table 2. Page Segmentation results (T: text. I: image, F: footnote)
Chapter 3
Mobile Video Watching Enhancement – Comp2Watch
3.1 Related Work
We have surveyed some kinds of works which are related to our mobile video
summarization. Previous works include automatic collage generation, video
summarization based on unlimited space, and mobile photo summarization.
Uchihashi, et al. [1] was the first work that attempted to propose a comic-like
layout summarization on videos, and their key contributions are maintaining time order
and enabling the variable frame size in accordance with the importance of a shot, and
we use their work as our baseline. Although we do the similar process for video
summarization, we not only transplant it to mobile environment, but also take a detailed
observation to analyze what has been changed. In section 1, we described three gaps for
mobile video summarization, and the first two gaps do not exist on PC environments,
which are strong supports for video summarization on mobile devices; the last gap is the
main impedance for such possibility, and we try to settle this problem by introducing
ROI extraction.
For collage generation, Rother, et al. [11], Lee, et al. [12] and Goferman, et al. [13]
have proposed some of the most representative works. [11] formulates the whole
procedure into an energy minimization problem, and they also use graph-cut and
Poisson blending to assemble a smooth collage. [12] follows a similar process (i.e.,
image ranking, ROI selection, ROI packing, and finally image blending) to build a
collage. The strength of [12] is that it can be run efficiently on a mobile phone processor.
Recently, a work that can compose images with arbitrary ROI into a collage has been
proposed [13], the result is more compact and interesting because the space can be filled
up with arbitrary shapes, while [11] and [12] only handle rectangle ROIs.
The main difference of our work from them is that their images have no time order
like video shots, while our output collage must be time-ordered, and thus this layout
problem cannot be solved by their approaches. Most importantly, they do not take the
“smallness issue” (the third gap mentioned above) into consideration since a high-level
view of the whole collage is enough for their application.
3.2 Keyframe Selection
The main difference from [1] in this step is that we put ROI regions into
consideration instead of presenting the whole image. Extracting ROI region can not
only enable the flexibility on frame aspect ratio but also benefit the compactness on the
whole composed image.
First, we apply shot boundary detection on the given videos and choose the middle
frame for each shot as the image presentation of the corresponding shot. We then group
these shot images by common hierarchical clustering method, using predefined distance
threshold (Section 3.2.1). The importance of each shot will be computed in accordance
with shot length, cluster size and ROI ratio to the whole image. Then the importance
scores are quantized into certain level to represent the desired template sizes. Finally,
we filter out shots that are less important or some shots that are similar within a short
period (Section 3.2.2).
3.2.1 Shot Detection and Hierarchical Clustering
For each video, the color histograms of full frames will be extracted for shot
detection. We take a common adaptive threshold method: if two adjacent frames or a
period of frames are measured to be very different, that will be a shot boundary.
After shot boundaries are detected, we take their middle frame to represent the
corresponding shot and use them as a basic unit in the following steps. For simplicity,
we refer to “shot images” as “shots” from now on.
Then a hierarchical clustering step is conducted. The idea of hierarchical clustering
is to merge the two closet clusters iteratively. Here we use both color and PHOG [16]
features to ensure that the grouped shots are similar not only in terms of color histogram,
but also in edge distribution (i.e., shape).
ROI is further detected for each shot using Harel’s work [14]. The ROI region will
be cropped and adapted as the final collage representation. What’s more, ROI
information plays an important role both on shot importance re-weighting and on layout
optimization phase.
3.2.2 Importance Computation
To utilize the space of output collage, the size of all shots must be differentiated by
certain criteria. [1] has defined “importance” as “A shot is important if it is both long
and rare.” Thus they formulate the importance as the length of a shot normalized by its
cluster size to penalize the repeated but discontinuous near-duplicate shots. Therefore
the importance of a shot belongs to cluster is given by:
= log 1
(3) Where is the length of the shot , and (the proportion of shots from the video
that are in cluster ) can be computed from previous clustering result by:
= ∑ (4)
is the total length of all shots in cluster , and is the total cluster number.
However, we think the importance score should not only reflect the shot length and
uniqueness, but also consider ROI propotion on the whole shot; that is, if a shot has a
larger ROI region, it should be given a larger template to represent itself (i.e., higher
importance score). Therefore we replace the importance by:
= (5)
Where and Area are the pixels of the ROI area of shot and the whole
pixels of shot , respectively.
These importance score will be divided into certain levels during a rough
quantization step in order to fit in the pre-defined templates (see Figure 12). During this
step, some shots will be filtered out (i.e. set their level to zero) if they are not important
enough, and others will be assigned sequentially, see Table 3.
The importance level is quantized from the importance score, and it will be used in
one of our cost functions, so we set the level equal to the size of its area of desired
template.
3.3 ROI Packing
The goal of layout packing algorithm is to put all shots into the given two
dimensional space with corresponding size (i.e. importance) while preserving their time
order. To achieve this goal, one heuristic way is to arrange those shots into a
multi-layered layout (i.e. the whole space is divided into row blocks, and these row
blocks contain sub-templates arranged column by column).
Unlike many well studied problems (e.g., bin-packing), such a layout optimization
problem that has the above constraints is NP-hard. In order to make the solution feasible, Table 3. The quantization step from importance score to corresponding level and template size, ! is the importance score of a shot (i.e. !"#!), $%& is the average
[1] proposed a “row-block-exhaustive” approach (i.e. optimize each row block one by
one). The algorithm is listed as follows:
1. Set the current row block to the top row and the starting shot ' = 1. current row, and cDE, F H is the cost function that measure the difference between the target shot frame image and the matched template.
4. Apply it to current row block and move to the next row block. ' is also increased by the length of the solution.
5. Repeat 2. until all frames are packed.
For more detailed information, please refer to [1].
The following three subsections describe the key changes we have made in this
algorithm to guarantee that it can work well with the extracted ROIs even in an
environment that has a limited space:
We enable the templates with non-fixed aspect ratios since the ROI region is
extracted (3.3.1). The cost function has been modified so that it considers not only the
importance of a shot, but also its aspect ratio (3.3.2). Inter-row optimization has been
introduced to eliminate the monotone layout combinations (3.3.3).
3.3.1 Non-fixed Aspect Ratio Templates
Unlike the baseline approach, we try to enable more flexible templates instead of
fixed aspect ratio templates (See Figure 12). It does not only change the appearance of
output collage, but also fit the ROI content to the template as appropriate as it can be.
3.3.2 The Cost Function
Given a shot and a template I, the cost function in [1] only measures the
difference of size, that is, 9 JK = |SizeDI O H|. Where P is the “importance
level” we have mentioned in Table 3. However, it can be replaced by any measure of
difference between the target shot and the available template.
Our templates not only have various sizes, but also have various aspect ratios, to fit
the shot into templates which have different aspect ratio, the shot ROI is first scaled
along the short dimension, and then the ROI region must be extended along the other
dimension to prevent distortion. Since we include those regions outside ROI, the
unwanted areas will be counted (in pixels) into the cost. Given the scaled region Q, the Figure 12. The templates used in baseline (left) and the templates used in
proposed method (right). Such change demonstrates the possibility of non-fixed aspect ratio templates, and they can be extended easily.
cost function is then modified as:
= α ∗ 9 JK+ β ∗ AreaDI − QH (7) Where α and β are predefined weights and they are fixed.
3.3.3 Inter-Row Optimization vs. Intra-Row Optimization
The baseline approach can produce sufficient/diverse layout combinations on the
media whose size (i.e. screen width) is large enough; however, for those mobile devices
that have limited screen size, the generated solution (i.e. template combinations) usually
lacks variety due to the limited solution space. Therefore, we introduce the inter-row
optimization step into the original intra-row optimization.
Our idea is to punish the repeated row sequence in the minimization step, if a row sequence appeared twice, its cost will be multiplied by a coefficient σ, and so on. The
minimization criterion is then modified by:
/ = arg min U31
4 5 6789: ; , < = + ?
@A
B ∗ VW; X (8)
Where Y is the number of times that a certain solution has appeared continuously. If a
solution (i.e. template combination in a row) repeated many times, the algorithm above
will tend to use another combination of templates, thus preventing the result collage
from having a monotone layout.
3.4 Experiments
We collect a total of 32 videos (20 of them are talk videos in TED, 12 of them are
popular movie trailers) for the following experiments. The talk dataset and the movie
dataset have 156 shots and 42 shots in average, respectively.
The talk videos are suitable for summarization on mobile because their duration is
usually longer and thus needs random access to recover the watching process if it is
interrupted. Additionally, talk videos usually have a clear subject (e.g., speaker, pictures
on the slide) so we can extract meaningful and effective ROIs from them. We also
include movie trailers that are much more challenging into our experiments in order to
evaluate a general situation. Some example shots can be referred in Figure 4.
3.4.1 Quantitative Evaluation
We expect that the proposed method can represent more informative contents
while the space consumption remains near to the baseline. Several measurements have
been proposed to evaluate our result. First, “ROI Ratio” is defined by:
Z[ Z \ ] = 1
b c\ b Z \ ] = 1
6 / is the adjusted scale after adaptation of shot image . Finally “Collage size ratio”
is given by:
60% of the area is cropped out. However, the ROI cannot be directly put into the collage
without adaptation due to the aspect ratio. After the adapt step, nearly half of the space
in both datasets has been saved.
As for the last two columns, it shows that the content in the proposed method can
give more clear subjects in the collage than the baseline while using the same space.
Table 4. Compactness Measurements Result. The first row represents the result of talk videos, and the second row is for the movie trailers.
Dataset ROI ratio Adapted ratio Enlarged ratio Collage area ratio
Talk 36% 52% 1.81 109%
Movie 38% 57% 1.70 104%
3.4.2 User Study
The usability of a summarization system (especially on mobile) is relatively
subjective, so we also conduct a user study that includes several aspects to evaluate the
proposed method.
We have invited 24 people: 15 of them are male, 9 of them are female. Their
occupation distribution is: 6 undergraduates, 12 graduates, 4PhD, and 2 administrative
stuffs.
We provide two identical smart phones to conduct this user study: HTC desire with
Android 2.2 platform whose resolution is 800*480.
Figure 13. Illustration of user study. We provide 2 identical smart phones for users (left: baseline, right: proposed).
Four questions are listed below:
Q1. Clearness of both approaches.
Q2. The Context Loss in our approach.
Q3. The impression of templates with non-fixed aspect ratios.
Q4. The overall rating.
Figure 14. Q1 - The comparison of clearness in bar chart.
The first question asks user to evaluate the degree of clearness of the subject in
the content, from 1 (not clear) to 4 (very clear). Figure 14 shows the average score
among 24 users. The baseline got a score near the borderline, while our approach was
scored between “Clear” and “Very clear”.
Figure 15. The result of Q2 in pie chart.
The second question is about the loss of context information. Although the
proposed method can enlarge the content, it also makes the context cropped, so we are
curious about how serious it is. The result (see Figure 15) shows that over half of
users think that the context information of proposed method has been affected slightly
by cropping ROI, nearly 40% people think that it is not affected, and only 4% (i.e. one
person) think that it is seriously affected. Note that the cropping process is harmful for
context information in general. However, the effect is not noticeable when such an
application is in some environment with a limited space. In comparison with baseline,
even though it keeps all context information, it is usually too small to be recognized.
Only in some cases (e.g., a big scene that can distinguish the position of the subject)
the baseline can maintain enough context information.
Figure 16. The result of Q3 in pie chart.
The third question is “Does changing aspect ratio affect your impression or does
this arrangement make you uncomfortable?” We propose this question for we are
concerned that users may want to stick with the original aspect ratio because they feel
that all shots which have the fixed aspect ratio is much more like a video. Yet the
result (see Figure 16) shows that nearly 80% people do not care about this issue.