Chapter 1 Introduction
1.2 Comp2Watch
Smart phones have some significant progresses which enabled many things that
used to be performed only on the computers. They have already changed the ways of
our life. In fact, more and more people are watching videos on mobiles now, and the
amount of people who watch videos on mobile devices has been growing rapidly during
these past years. The latest report from Nielsen Company [17] shows that the number of
Americans watching mobile video has grown more than 40% from 2009 Q4 to 2010 Q4,
ending the year at nearly 25 million people. Not only has the popularity grown, the
average time that users watching videos on mobile phones has also grown nearly 20% at
the same time. At the end of 2010 Q4, people spent 4 hour 20 minutes per month on
watching mobile videos in average.
We mainly focus on smart phones instead of pad-like computers because they are
pocket portable and therefore will be always at hand. However, to the best of our
knowledge, there are at least three gaps in mobile video watching:
1. Fragment watching time: Users will not watch mobile videos if there are
computers or televisions at hand, the most common situation is that when users have
only a small chunk of time (e.g., while waiting for a bus, standing in line at a store, or
during daily commute in subway). In these situations, users may not have enough time
to watch a complete video, and once the watching process is paused, it will not be easy
to get back to the time point where users leave last time.
Figure 2. Illustration of baseline (left) and proposed method (right). Applying existing video summarization technique directly on mobile interface which has very limited space may results in some problem. Imagine that putting a huge collage of a video into a smart phone: low zoom level makes the frames unclear, while high zoom level images will occupy lots of spaces.
2. Slow/unstable network: Although online video streaming sites (e.g., Youtube,
TED) usually have buffer mechanism to ensure that their videos can be played instantly
instead of users’ having to download the whole video clip before watching, users cannot
get a quick glimpse of the main idea of the content/story until the video ends or has
been played for a while. It is an expensive cost in terms of time and also network
bandwidth.
3. The size of mobile display: In fact, the screen resolution of commercial smart
phones (e.g., Sony Ericsson XPERIA X12: 854*480, HTC Desire HD: 800*480) has
grown to near the resolution of PC’s LCD screens (e.g., XGA standard: 1024*768,
WXGA standard: 1280*800). However, when it comes to physical screen size, the
smart phones (e.g., Apple iPhone4: 3.5 inch, HTC Desire HD: 4.3 inch) are far behind Figure 3. Illustration of screen resolution
comparison (top) and screen size comparison (right). Mobile screens are in green color, and LCD screens are in blue color.
from PC’s LCD screens (usually more than 20 inch). The comparison above means that
smart phones have to put a relatively large content into a tiny space. The detailed
illustrations are shown in Figure 3.
In addition, it is known that performing actions like pressing virtual buttons on a
touch screen is somehow difficult [15]. It is not surprising that interacting with such
screens (e.g., selecting, dragging, or clicking) can be very challenging especially when
the content is too large so that it must be scaled down.
To handle the first two gaps, we use a collage image composed by selected frames.
The time orders of shots are preserved so that it can provide random access via finger
click on the touch screen interface on mobiles. It is much more convenient than
dragging the timeline beneath the video. Also, downloading a single image instead of
the whole video can significantly reduce the network overhead, and taking a glance on
the images can enable users to try to get the story in the video or help users to quickly
filter out videos which they are not interested in.
The advantages above come from the traditional video summarization techniques.
However, the most important issue is the small screen property on such mobile
platforms (see Figure 2). To bridge this last gap, our intuition is based on the ROI
region in the image frame. Figure 4 shows examples of extracted ROI bounding boxes,
and we have observed that cropping the ROI regions from the frames has the chance of
saving spaces without losing much context information in average cases. The latter user
study results support this observation, too.
The key contributions of Comp2Watch are:
1. To the best of our knowledge, Comp2Watch represents one of the first attempts
that enables video summarization on mobile devices, thus presenting another
experience on mobile video watching.
2. We observed several gaps for mobile video watching and we use ROI
extraction to deal with the most challenging one, thus enabling the templates
with non-fixed ratio. And the result collage is more compact.
3. We propose several measurements for Compactness and evaluate them in the
quantitative experiments. For user study, we evaluate both Clearness and Figure 4. The ROIs in video frames in talk videos (left) and movie trailers (right).
White rectangles indicate ROI boundaries.
Context Loss. These experiments show the promising results of the proposed
system.
Chapter 2
Mobile Magazine Reading Enhancement – Snap2Read
The purpose of our system is to segment the document image into homogeneous
blocks with the maximum size (Section 2.1 - Page Segmentation), to classify them into
a set of predefined categories (Section 2.2 - Zone Classification), and finally to render
them to fit for different screen resolutions (Section 2.3 - Mobile Adaptation). The
experimental results are shown as well (Section 2.4).
2.1 Page Segmentation
Previous approaches usually focus on one single language or specific types of
documents. However, we cannot assume our input to be a certain type of magazines for
reading activities on mobile devices, so we do not give any fixed parameters related to
character font size, line spacing or layout structure, which traditional page segmentation
methods do.
Like other morphological methods, our method is a bottom-up approach [5]. The
main idea is to group those small connected components into larger regions by dilation.
However, we dilate iteratively and automatically select the appropriate dilation kernel.
2.1.1 Pre-Processing
The purpose of the preprocessing step is to filter out noises, dividing lines, and
blocks that are possibly images. They tend to be merged with others during dilation, so
we must make sure that other main components (mostly texts) will not be affected.
The detailed steps: (1) To take efficiency into consideration, resize the image to a
proper area measure (i.e., about 900*650). (2) Do global threshold binarization. (3) For
those foreground (i.e., white) pixels, apply connected component analysis. (4) For each
component, if its proportion of height to width is significantly high (i.e., 20 times), then
remove it from the original image. If the component is big enough (i.e., 1.25%) and its
(a) (b) (c) (d)
Figure 5. An illustration of the pre-processing steps: (a) Original image. (b) Binarization. (c) Connected component analysis. Components are in different colors. (d) Noise removal and image block pre-extraction.
(i.e., 0.1), it will be considered to be “possibly image block” and be extracted in
advance. The examples are illustrated in Figure 5.
Note that in the step (2), we have also tried edge detector and local threshold
binarization method, which are widely used in document image processing [5].
However, although the former depicts the salient edges, it also turns complex image
regions into many edge fragments. The latter is mostly used on scanned document
images that only contain texts in order to make them robust to lighting change, but it,
too, has the similar effect mentioned above on image regions. As a result, we decide to
use global threshold binarization to preserve image blocks.
2.1.2 Adaptive Dilation Threshold
In this work, we enlarge small components by dilation operation to group them
together, and the square kernel is applied. It enlarges both vertical and horizontal
regions of a component, but how to determine the kernel size is vital and nontrivial.
A fact is that the main font size may change from one magazine to another, or from
one language to another. Thus, using fixed dilation kernel size for segmentation may not
work on every case, so we need to determine it dynamically.
During the iterative dilation process, we find that the number of components will
drop rapidly at a certain kernel size, and the most suitable size is at the turning point
(e.g., red arrow in Figure 6). The physical meaning of this phenomenon is that a large
number of characters and words are merged into lines or paragraphs at the same time.
We then count the number of component versus kernel sizes (e.g. a blue line in
Figure 3), and the turning point can be found by applying approximate second order
differential. To ensure that almost all components are sufficiently merged, we also add a Figure 6. The relationship of component numbers to kernel size, this figure includes 20 pages (blue lines). The red arrow indicates the turning point which is suitable for the kernel size.
(a) (b) (c) (d)
Figure 7. Dilation for little blocks mergence and redundant blocks removal. (a) An example intermediate image from preprocessing step, in which title and footnote are split into several fragments. (b) Remove the large components and dilate again.
(c) There exists some inside or non-informative block. (d) The final output example which does not contain those redundant blocks.
constraint that the number of components should not be less than a certain number (i.e.,
30).
In the last step, minimum enclosing rectangles are extracted from those
components which are large enough (i.e., at least 0.3% of whole page area), but some
remaining components are not merged well enough (over-segmentation) because the
spacing is larger than main texts (e.g., title, footnote, etc.) Therefore, we use a
1.5-time-large kernel to dilate again for the remaining small components (cf. Figure 7).
Inside blocks and small blocks (i.e., less than 0.1% of whole page area) are also
removed (cf. Figure 7).
2.2 Zone Classification
The rectangles (blocks) produced by the segmentation step above will be
rearranged into meaningful “patches” in accordance with different screen resolutions of
the device during the next adaptation step, while images and text blocks have different
adaptations: Images cannot be further segmented, but text blocks can be split if they are
too large to accommodate themselves to the screen. Therefore, it is necessary to classify
images and text blocks (cf. Figure 1).
2.2.1 Features for Classification
We use the early fusion scheme – multimodal features concatenated as a long one
– to combine features so that we can learn a classifier by SVM (Support Vector
Machine) [6] for each label (e.g., text, image). The three features we used are spatial
feature, GCM (Grid Color Moment), and PHOG (Pyramid of Histograms of Orientation
Gradients.) [7] Their detailed descriptions are as follows.
Spatial feature contains the coordinate and size of a specified region (i.e., x, y,
width and height).
As for color feature (GCM), we adopt the first order moment (mean) and the
second order moment (variance) for color feature. The image will be partitioned into
several (i.e., 8*8) sub-blocks, and for each block, calculate its mean and variance values.
As a result, the GCM feature is a vector with 8 * 8 (blocks) * 3 (color planes) * 2
(moments) = 384 dimensions.
The shape feature (PHOG descriptor [7]) represents the “local shape,” and the
“spatial layout” of the image. To calculate PHOG, first extract edge contours by Canny
edge detector, and the image is divided into 4l 44 sub-blocks at level l. The HOG of
each grid at each pyramid resolution level is then calculated. In this paper, we set level
up to 2 (i.e., l = 0, 1, 2) and 8 bins for HOG. Thus, by concatenating different level of
resolutions of HOGs, it can be formulated as a vector representation with
0 1 2
(4 + +4 4 ) *8 168= dimensions.
Their dimensions are 4, 384 and 168, respectively. The concatenated feature vector
is then measured 556 dimension, and each dimension will be scaled into [-1, +1].
2.2.2 Model Selection
We use the RBF (Radial Basis Function) kernel for classification, so there are two
main parameters g and c to be determined (i.e., gamma and cost, respectively). In order
to select the suitable model for prediction, we apply 5-fold cross validation on total
1430 page segments and get average accuracy around 0.95.
2.3 Mobile Adaptation
Although the segmented blocks are composed with homogeneous components (cf.
Figure 9 and Figure 10), they cannot be read directly because we extract them as large
as possible for each region type, which may not fit the screen resolution, so we must
adjust the blocks to readable patches.
As mentioned above, only text blocks may need to be further split (e.g., the wide
text block in Figure 1(E).). Generally speaking, the English articles tend to stretch in
vertical direction, while the Chinese articles tend to expand horizontally. Thus, we
adopt a heuristic approach (take English language as an example): For each block, we
scale it along its width, and split it into several patches according to its height, and pad
those patches whose height are not sufficient to prevent distortion. (cf. Figure 8)
As for the reading sequence, images will have higher priority than texts, and then
we rank them from upper left corner to the lower right corner (for Chinese magazines,
rank from upper right corner to lower left corner).
We also provide a transparent overview window at the upper right corner of the
mobile interface to indicate the current location on whole page. Thus users do not have
to zoom-in and zoom-out repeatedly to obtain the geometric information.
2.4 Experimental Results
This section describes our dataset and how we label the ground truth, and it also
shows the evaluation of our work.
(a) (b)
Figure 8. An illustration of adaptation. (a) Original segmented blocks (b) The adapted (scaled, padded) patches. The reading sequence is 1, 2, 3a, 3b, 4a, 4b, etc…, and it is used to guide users so that they can read the page conveniently through clicks without losing track of page context.
2.4.1 Magazine Dataset
To the best of our knowledge, there is no public dataset for page segmentation
evaluation. Previous page segmentation researches usually tend to use their own private
dataset depending either on their target document genre (e.g., newspaper, journal), or on
a specific language. Although we know that in recent years, ICDAR Page Segmentation
Competition has created their own dataset with rich types of sources, only those who
participate in the competition can gain access to the dataset. Furthermore, they do not
provide documents in Asian languages (e.g. Chinese, Japanese, Korean), which do not
have clear bounding boxes for each word, while we expect our system can work well on
both type of languages. Thus we create a dataset on our own.
We selected 4 different popular magazines: for Chinese language, “Common
Wealth” and “Business Weekly” are adopted; we also take 2 English magazine named
“Business Week” and “Science”. For each magazine, we manually filter out
advertisement pages and select 30 scanned pages, and the detail is listed in Table 1.
To collect groundtruth, we use an editor named GEDI [8], a highly configurable
document image annotation tool. It reads an image file, and when annotation is done, it
produces a corresponding XML file in GEDI format.
2.4.2 Page Segmentation and Zone Classification Performance
Because the evaluation metrics of the previous methods are usually computed
pixel-wise, which is aimed at OCR. However, our output is rectangle-based, which is
aimed at locating reading patches. As a result, comparing the two does not make sense.
Although we do not compare them directly, we adopt one of the most widely used
metrics in ICDAR 2005 [9], and try to illustrate our performance with their intuition.
We have annotated three types of entities (i.e., categories): text, image and footnote. For
each entity, the EDM (Entity Detection Metric) is calculated.
First, evaluate how much they overlapped between a ground truth zone and a result
zone by keeping a global matrix MatchScore, which is defined by function Magazine Language # of pages Scanned resolution
Common
Wealth Chinese 30 1184*1573
Business
Weekly Chinese 30 963*1280
Business Week English 30 944*1260
Science English 30 944*1203
Table 1. Magazine dataset for segmentation and classification experiments
( )
result i, and T(s): a function that counts the elements of set s.
Second, three types of matches are defined (i.e., one-to-one, one-to-many and
many-to-one) according to their MatchScore: If the MatchScore of ground truth zone j
and result zone i is higher than the accept threshold (i.e., 0.6), then it is a one-to-one
match. (See Figure 11 for more explanation)
If there are K ground truth zones jk (k = 1, 2…K) overlapping the same result zone
i, and each of their MatchScore is between the accept threshold and the reject threshold
(i.e., 0.1< MatchScore(i,jk) < 0.6, k = 1, 2…K), but their summation is higher than the
accept threshold, then it is a many-to-one match, and vice versa.
For simplicity, the acceptable matched number for each entity is defined as
MatchNumber = (w1*one-to-one + w2*one-to-many + w3*many-to-one), where w1 = 1
and w2 = w3 = 0.75 for partial match penalty. Then DetectRatet and RecognAccuracyt for
entity t are defined as DetectRatet = MatchNumber/Nt, and RecognAccuracyt =
MatchNumbert/Mt. Nt is the number of ground truth regions for t’th entity, and Mt is the
number of result regions for t’th entity. The DetectRatet and RecognAccuracyt represent
the acceptable ratio among all ground truth zones and all result zones for t’th entity,
respectively. The Entity Detection Metric score for each entity (text, image, footnote) is
then defined as
2*
t*
tt
t t
DetectRate RecognAccuracy EDM = DetectRate RecognAccuracy
+
(2)The overall page segmentation performance is promising. See the breakdowns in
Table 2. The page segmentation results are sampled in Figure 9 and Figure 10. We
also found the results are satisfactory as rendering them in reading patches in two
Android phones with different resolutions.
(a) (b) (c)
Figure 9. Chinese magazine results. (The blue, red and green bounding boxes indicate text, image and footnote, respectively.) (a) A Common Wealth example (b) A Business Weekly example (c) An over-segmentation example which divides a flow chart into text blocks.
(a) (b) (c)
Figure 10. English magazine results. (a) A Business Week example (b) A Science example (c) An over-segmentation example results from figures with unclear bounding boxes.
The result of Business Weekly seems to have better performance than others (cf.
Table 2(a).), because its layout is less complicated than others and its text block size is
mostly large and rectangle shaped, while Business Week has lower performance on
image category (cf. Table 2(b).) because it has a lot of figures and tables combined with
text explanation inside the bounding boxes, and Footnote category usually has lower
performance because parts of them are removed during redudant rectangle removal step,
but they are thought of as non-informative blocks, our system does not guide users to Figure 11. An illustration of accept threshold selection. The groundtruth blocks are marked as magenta and the result blocks are marked as blue. Low MatchScore mostly comes from those small blocks (about 0.8 for logos and 0.6 for footnotes), because a trifling difference (less than 10 pixels) between the two region boundary can result in a large number of percentage of area measure.
Thus the smaller blocks tend to have the lower MatchScore. However, this situation does not impede the reading process of users. As a result, we set the accept threshold at 0.6.
Common Wealth Business Weekly(a)
Business Week Science
T I F T I F T I F T I F
Nt 213 44 55 203 49 52 195 91 84 245 77 122 Mt 199 53 42 209 54 40 251 134 67 250 81 79 DetectRatet 0.80 0.93 0.49 0.91 0.91 0.62 0.89 0.76 0.50 0.89 0.81 0.52 RecognAccur
acyt 0.86 0.77 0.64 0.89 0.83 0.80 0.70 0.51 0.63 0.88 0.77 0.81 EDMt 0.83 0.84 0.56 0.90 0.87 0.70 0.78 0.61(b) 0.56 0.88 0.78 0.64
Table 2. Page Segmentation results (T: text. I: image, F: footnote)
Chapter 3
Mobile Video Watching Enhancement – Comp2Watch
3.1 Related Work
We have surveyed some kinds of works which are related to our mobile video
summarization. Previous works include automatic collage generation, video
summarization based on unlimited space, and mobile photo summarization.
Uchihashi, et al. [1] was the first work that attempted to propose a comic-like
layout summarization on videos, and their key contributions are maintaining time order
and enabling the variable frame size in accordance with the importance of a shot, and
we use their work as our baseline. Although we do the similar process for video
summarization, we not only transplant it to mobile environment, but also take a detailed
observation to analyze what has been changed. In section 1, we described three gaps for
mobile video summarization, and the first two gaps do not exist on PC environments,
which are strong supports for video summarization on mobile devices; the last gap is the
main impedance for such possibility, and we try to settle this problem by introducing
ROI extraction.
For collage generation, Rother, et al. [11], Lee, et al. [12] and Goferman, et al. [13]
For collage generation, Rother, et al. [11], Lee, et al. [12] and Goferman, et al. [13]