Chapter 1 Introduction
1.3 Organization of This Thesis
The remainder of this thesis is organized as follows. In chap 2, we will state the background knowledge, including the color spaces, MPEG-2 standard, shot change detection methods, and MPEG-7 multimedia description schemes. In chap 3, we will discuss our proposed baseball video analysis and summarization procedure, including the field dominant color detection, shot classification, pitching shot detection, PSU segmentation, change event detection, highlight event detection, and multi-level summarization. In chap 4, we will describe the implementation of our experiments in detail, and present the experimental results.
In chap 5, we will state conclusions and future works.
Chapter 2 Background Knowledge
In this chapter we will introduce the background knowledge of our research. In section 2.1, we will describe the color models. In section 2.2, the MPEG-2 bitstream structure is stated. In section 2.3, the shot change detection methods are described. In section 2.4, we will introduce the MPEG-7 multimedia description schemes.
2.1 Color models
Color is a powerful descriptor that facilitates object identification and extraction. The purpose of a color model is to facilitate the specification of colors in some standard [12]. Most of the color models in use now are apply to hardware or some special applications. In this section, we will introduce the RGB, YCbCr, and HSI color models.
RGB Color Model:
The RGB color model could be represented in a three-dimensional coordinate system. A color is associated with a three-dimensional vector of (R, G, B). The RGB color model is the most commonly used hardware oriented model, but it is not nature for human visual perception. The RGB color model is shown in Fig. 2.1.
B
Fig. 2.1 The RGB Color Model
YCbCr Color Model:
The YCbCr color model separates the luminance Y and two chromaticity values Cb, Cr from the color. Taking advantage of this property, the luminance and chromaticity can be coded in different number of bits. It is useful in image compression and widely used in JPEG and MPEG. The conversion of RGB to YCbCr is given as Eq. 2.1 [13].
0.299 0.587 0.114
The HSI color model is formed with Hue, Saturation, and Intensity. Hue is used to describe a pure color. Saturation shows the degree of the dilution of pure color. Intensity is the brightness of the color. This model decouples the luminance and chromaticity, and is nature for human visual perception. Therefore, it has the advantages in some applications. In this paper, the video color features that we extract are based on the HSI color model. The HSI color model and the conversion functions from RGB are shown as follows.
( ) ( )
2.2 Overview of MPEG-2 Standard
MPEG-2 [14], an ISO/IEC standard which is proposed by Moving Picture Experts Group (MPEG) in order to support applications of future digital TV and the high quality video compression. In our experiments, all of the test videos are MPEG-2 format. The standard MPEG-2 video stream contains six layers: sequence, group of picture (GOP), picture, slice,
macroblock (MB), and block. Fig. 2.3 [17] illustrates the MPEG video stream structure.
Fig. 2.3 The MPEG-2 video stream structure
Sequence:
The sequence layer is the highest syntactic structure of the coded bitstream. It holds some parameters that are used in the decode process.
Group of pictures (GOP):
The GOP provides the random access point. It consists of several frames, including three
types of frames (pictures), I frame (Intra-coded frame), P frame (Predictive-coded frame), and (Bidirectionally predictive-coded frame). Fig. 2. 4 shows the structure of the GOP.
Forward Motion compensation
I B B P B B P B B P B B
Fig. 2. 4 The GOP structure
icture:
pictures (frames), I frame (Intra-coded frame), P frame (Predic
error propagation influencing the whole frame. It is a series of macroblocks.
P
MPEG defines three types of
tive-coded frame), and B frame (Bidirectionally predictive-coded frame). They use different coding methods. Each type of frames consists of three components, a luminance component (Y), and two chrominance components (Cb, Cr). During the encoding and the decoding process, I frames use only the information itself, and they are also used as reference for P frames and B frames. P frames are coded using forward motion estimation and motion compensation from the previous I frame or P frame, and they serve as reference for P frames and B frames. B frames are coded using forward and backward motion compensation from previous and future reference frames. They do not serve as reference.
Slice layer:
A slice avoids the bitstream
All the macroblocks of a slice shall be in the same horizontal row.
1 2 3 4 5 6 7 8 9 10 11 12
I 13
Bidirectional Motion Compensation GOP
Macroblock (MB) layer:
MB is the basic unit used for motion estimation and compensation. MB is a 16*16 size regio
er:
s a set of 8*8 pixels and is a basic unit used in DCT transform.
Inter Coding:
g is done through motion compensation process. During the encoding process each
n in the frame. It consists of a section of luminance component and the spatially corresponding chrominance components. There are four types of MB in MPEG-2. IMB (intra-coded MB) can be coded by itself. FMB (forward-prediction MB), BMB (backward-prediction MB), and BIMB (bidirectionally-prediction MB) perform forward reference, backward reference, and bidirectionally reference respectively. I frames contain only IMBs. P frames contain IMBs and FMBs. B frames contain IMBs, FMBs, BMBs, and BIMBs.
Block lay A block i
-Inter-codin
MB of P and B frame is tested to compare the costs of motion compensation and intra-coding, and the one which is more economic will be chosen. During the motion compensation process, the encoder finds the best matching region in the reference frame(s) and calculates the prediction error and one or two motion vectors (MVs) for each MB of current frame. Fig. 2.5 [15] illustrates the motion compensation examples.
Fig. 2.5 (a) Motion compensation example for a FMB or a BMB(b) Motion compensation example for a BIMB
Spatial domain compression:
During the Spatial domain compression process, blocks of the frame are inputted into DCT transform, then the quantization, Zig-Zag scan transform, run length coding and entropy coding are performed. The flow chat is shown in the Fig. 2.6. Blocks in IMB contain the original information, and blocks in FMB, BMB, and BIMB contain the prediction errors of motion compensation.
Fig. 2.6 Flow chart of spatial domain process
2.3 Shot Change Detection Method
A shot is defined as a continuous sequence of frames which is captured from a single camera during a time period [16]. It could be represented as an event or a subject. The shot change detection is the first step of video processing. In some applications, such as video retrieval and indexing, a shot is a basic unit of a video segment. A shot change is a transition between two adjacent shots. There are two main types of shot changes: abrupt type and gradual type. The abrupt type is that the transition from one shot to another is a single frame, and the gradual type is that shot changes occur across multiple frames with some video editing skills making the transition look smooth. Fig. 2.7 illustrates the two types of transitions.
Fig. 2.7 Abrupt and gradual shot change chart
The shot change detection is to detect the shot change locations by comparing the difference of adjacent frames. Some shot change detection methods are proposed such as the pixel based method, the histogram based method, the feature based method, the DCT coefficient based method, the DC image based method, and the MB based method. These approaches can be roughly distinguished into uncompressed domain and compressed domain methods. The former is used for raw data, and the latter is used for compressed data such as MPEG. In the following paragraphs, we will briefly discuss these methods.
Uncompressed domain:
The pixel based method detects the shot change by the pair-wise pixel comparison of frames. It is good for content variation but is too sensitive to noise and computation costly.
The histogram based method compares the histograms of adjacent frames. This method is more robust to noise but lacks the pixel spatial information affecting the accuracy. The feature based method extracts the edge features of the corresponding frames. It is good for obtaining the shape characteristics but is computation costly and too sensitive to noise.
Compressed domain:
The DCT coefficient based method uses the I frame DCT coefficients to detect the shot change positions. This method is computation efficient. However, it only utilizes I frames that will result in missing the precise shot change positions. The DC image is a thumbnail
consisting of DC values of the original MPEG frame. The DC image based method compares the adjacent DC images to find the shot change locations. This method is efficient and robust to noise. The MB based methods [20] use the MB information to find the shot change positions out. This type of methods is simple and fast, but the MB information may differ with different encoders.
Lee [17] proposed a compressed domain shot change detection method on the MPEG. He firstly used the computationally efficient MB information to select the shot change candidate frames. In the second phase, only the DC images of the selected candidate frames are extracted to further detect the precise shot changes. Thus, the computation is efficient in his method, and the two-phase detection process makes the more precise detection results. In this thesis, our test baseball videos have been preprocessed with the shot change detection using the program provided by Lee [17].
2.4 Overview of MPEG-7 Multimedia Description Schemes
MPEG-7 [1] is an ISO/IEC standard proposed by Moving Picture Experts Group (MPEG), and has a formal name “Multimedia Content Description Interface”. MPEG-7 will not replace other MPEG standards such as, MPEG-1, MPEG-2 and MPEG-4, since it is intended to provide a comprehensive set of tools for describing the multimedia content instead of the content itself. However, MPEG-7 does not standardize approaches for multimedia content description. The objective of MPEG-7 is only to standardize the interfaces (descriptors) between the client application and the search engine [22].
MPEG-7 defines Multimedia Description Schemes (MDS) that combine low-level, and high-level features to describe the multimedia content. Fig.2.8 [1] illustrates an overview of the organization of MPEG-7 MDS which consists of six components: Basic Elements, Content Description, Content management, Content Organization, Navigation and Access, and User Interaction.
Fig. 2.8 The overview of the MPEG-7 Multimedia Description Scheme.
(1)Basic Elements:
These description schemes (DSs) provide specific data-type and mathematical structures, and address specific needs for multimedia data descripiton such as time position, persons, individuals, groups, organization, and textual annotation.
(2)Content Management:
These DSs describe the creation information, usage information and media description.
(3)Content Description:
The content description elements describle the structure (regions, video frames, audio segments) and semantic (objects, events, abstract annotation) of the multimedia data. In the structural aspect, multimedia data is described in the viewpoint of content structure. In the conceptual aspect, the multimedia is described in the view point of real-world semantics and conceptual notions.
(4) Navigation and Access:
These DSs provide the facilitating browsing and retrieval of multimedia content by defining the summaries, decompositions, and variations of the multimedia content. There are
two types of summaries DSs, hierarchical mode and sequential mode. Hierarchical mode summaries have multiple levels. The levels close to root provide coarser summaries, and the higher levels provide the more detail summaries. Sequential mode summaries provide a sequences of video frames, possibly synchronized with audio, which may compose of a slide-show or audio-visual skim [1].
(5)Content Organization:
The content organization DSs organize the collection of multimedia content segments, and describe their common properties. They can group multimedia contents into clusters and describe the relation of among clusters.
(6) User Interaction:
These DSs deal with user information, such as personal preferences and user histories.
Chapter 3 The Proposed Method
In this chapter, we discuss our baseball video analysis and summarization scheme in derail. In section 3.1, we describe the field dominant color detection method. In section 3.2, we use motion features and color features to classify shots into some categories and detect the pitching shots. In section 3.3, we retrieve the pitching semantic units (PSUs), then detect the changes events and the highlight events, and provide the multi-level baseball video summaries.
3.1 The Field Dominant Color Detection Method
There are two dominant colors in a baseball field: grass and sand colors. They are very useful when using in shot classification and identification. However they differ from stadium to stadium. Thus, the dominant color ranges of baseball fields are different. Fig. 3.1 shows two different baseball fields and corresponding HSI histograms. The charts in the left column belong to (a), and the others belong to (b). It reveals that the dominant colors of these two fields are notable different. Thus, we cannot set a specific value for the color of the field.
(a) (b)
Fig. 3.1 Two different fields and their HSI dominant color histograms (continued)
Fig. 3.1 Two different fields and their HSI dominant color histograms (continued)
Fig. 3.1 Two different fields and their HSI dominant color histograms
In [3], Zhong et al. built a tennis court dominant color information database that contains enough games so that the color information of a new game will be similar to one in the
database. In [2], A. Ekin et al assume the existence of a single dominant color (a tone of grass) that indicates the soccer field, and detect this dominant color range by analyzing the HSI color histograms. In our method, we suppose that baseball fields contain two dominant colors, and we use the histogram analysis method to detect the color ranges.
Fig. 3.2 displays the flow chart of the field dominant color detection process. First, two predefined rough field dominant color ranges, RGS (a tone of grass with hue and saturation ), and RSD (a tone of sand with hue and saturation ) are used to choose appropriate frames for further analysis. The precise field dominant color ranges are the subsets of the rough ranges. Fig. 3.3 shows two rough field dominant color ranges. Second, we choose the appropriate frames containing enough field dominant color ratios for further analysis later. The choosing rules specified in Eq. 3.1-3.3 eliminating the inappropriate frames, like close-up, spectators, lounge and so on. Because the regions of grass and sand are always presented in the bottom of the frame,
60 ∼180
>0.06 0 ∼60 >0.06
RatioRGS is defined as the ratio of RGS in the bottom half frame, and RatioRSD is the ratio of RSD in the bottom half frame.
, , and are corresponding thresholds in the rules.
TRGS TRSD TRGS RSD+
Rough color ranges
Frames choosing
HSI histograms analysis
Precise color ranges
Fig. 3.2 Flow chart of field dominant color detection
Fig. 3.3 Rough Dominant Color Ranges
Finally, we analyze the histograms of the choosing frames to get the precise field dominant color ranges. We compute two sets of histograms from the choosing frames. One is about the grass color, where the grass color information comes from the pixels in RGS. The other is about the sand color, where the sand color information comes from the pixels in RSD.
Each set of histograms contains three components Hue, Saturation, and Intensity. Then we analyze each histogram according to Eq. 3.4-3.9 [2], where H refers to the histogram, is the peak index of each histogram,
ipeak
[
imin,imax]
is the determined range of the histogram, K is avalue between 0 and 1 used to adjust and . Fig. 3.4 illustrates the concept of the histogram analysis. After analyzing six histogram ranges we obtain two precise dominant color ranges, the grass color range (GS), and the sand color range (SD).
imin imax
Fig. 3.4 Histogram Analysis Chart
3.2 Shot Classification and Pitching Shot Detection
In this section, we first classify shots into some categories according to the motion features and color features. The shot classification results will be used to detect the pitching shots and highlight events in the following analysis processes.
3.2.1 Shot Classification Motion energy:
Zhang et al. [21] proposed a concept of motion energy for video frames to display the motion activities of frames. The video frames are divided into a number of blocks. The motion energy of the frame is the summation of the displacements of the total blocks from the frame to the frame. In a baseball video, each shot may contain
nth
(n−1)t nth
different content variations and different number of frames. We compute motion energy of each shot in a baseball video to estimate shot motion activities, and classify them into categories according to their motion energy. We do not compute motion information from consecutive frames with computationally expensive operations, instead, we directly use the motion vectors (MVs) information of P frames to efficiently approximate the motion energy.
In Eq. 3.10, we define the motion energy of a frame,ME f , as the summation of total ( ) the shot, and is the number of total P frames in the shot. After obtaining the motion energy, then we classify shots into specified categories. Eq.3.12 presents that a shot belongs to S (small motion energy shot), M (median motion energy shot), or L (large motion energy shot) according to the magnitude of motion energy, where and are the corresponding thresholds.
Flow magnitude between frames:
In [18], V. Kobla et al. proposed a concept of flow of frames to display the motion relation of adjacent frames in the video. They use the forward-predicted motion vectors and
backward predicted motion vectors of B frames in a Sub-GOP structure to derive the flow vectors of each frame in the Sub-GOP structure. Fig. 3.5 represents the flows between the frames in a SGOP structure, where R andi R are two reference frames, each of which may j be an I frame or a P frame. The flow of the kth frame flow in the video can be thought as k the collection of its flow vectors. The flow magnitude of a frame is defined as the summation of its total flow vector lengths. The flow direction relating to video play sequence is always in backward, so we are more interested in flow magnitude.
Fig. 3.5 Flow between Frames
Through the flow magnitude information, we can estimate the motion activities of the corresponding successive frames. If flow magnitudes of frames are large, the content of corresponding frames is active. On the contrary, if flow magnitudes of frames are small, the content of corresponding frames is static. If a number of successive flow magnitudes are very small in a shot, then we can suppose that the shot contains static states or still frames. Some still frames caused by video editing usually appear with a replay. Fig. 3.6(a) presents a static state (motionless content) in a shot, and Fig. 3.6 (c) is the corresponding flow magnitude chart.
Fig. 3.6 (b) shows a replay, and Fig. 3.6 (d) is the corresponding flow magnitude chart. In Fig.
3.6 (c), and (d), we can see that a number of successive flow magnitudes which are very small, and we name this phenomenon SFMS (Successive Flow Magnitudes are Small).
Fig. 3.6 (a) static state shot (b) replay shot (c) flow magnitudes of (a) (d) flow magnitudes of (b)
We classify the shots that contain SFMS phenomenon as R (containing static states or replay characteristics) type shots. Eq.3.13 defines a rule to detect SFMS, where Mag flow( k) is the flow magnitude of the kth frame, and fm s is a small flow magnitude threshold and _ n is a successive frames number threshold. In Eq.3.14, if a shot contains SFMS then it belongs to R type.
l number of total frames of the shot
− − =
∈ ⎧⎨ =⎩
Grass color ratio:
If the grass color ratio of a shot is high, then we suppose that this shot is more relevant to in-field events. On the other hand, if the ratio is low, then we suppose that this shot is irrelevant to in-field events. Therefore, we compute the grass color ratio of shots and classify them into Gs (small grass color ratio) type and Gl (large grass color ratio) type. Eq.3.15 indicates computation of the grass ratio of a shot, , where means the grass color ratio of the frame
( )
gs shot gs f( k)
f , and the frames k are the sampling frames included in the corresponding shot. In Eq.3.18, a shot is classified intoGsor by shot
f , and the frames k are the sampling frames included in the corresponding shot. In Eq.3.18, a shot is classified intoGsor by shot