In this section, we review previous studies on visual content adaptation. Ac-cording to the design purposes, they are classified into three major categories, including transcoding, cropping, and hybrid. Meanwhile, their advantages and disadvantages to small displays will be briefly described as well.
In earlier works [CV05, AWSZ05, XLS05], the techniques of video transcod-ing have been extensively explored. The basic transcodtranscod-ing process is to con-vert a coded video signal from its original format into another one. An output format is determined entirely based on network and device constraints. Well-known transcoding methods include spatial resolution adjustment, temporal res-olution adjustment, bit-rate adjustment, and coding syntax conversion [AWSZ05, XLS05]. Scalable video coding is considered a special kind of transcoding tech-niques [WOZ01, Tun02]. The scalability is accomplished by providing multiple
versions of a video stream so that the same contents of lower qualities are obtain-able in different clients, e.g., Tung [Tun02] developed a unified MPEG-4 video codec that supports universal scalability. For clients with small displays, spatial transcoding is always required but it causes excessive spatial resolution reduction or visual quality degradation. Once a visual content is scaled down more than its minimal perceptible size, the quality of service (QoS) or quality of experience (QoE) is usually far from acceptable [CXF+03, KMS05]. For example, some im-portant details, such as the gesture of a drama actor or the ball location of a sport game, are not easy or even impossible to be recognized. Another difficulty with the spatial transcoding occurs when the aspect ratio of a target screen is inconsis-tent with the source video. If we linearly reduce both dimensions of the video to fit into the screen, it leaves black borders (sometimes known as the letterbox) and wastes valuable display resource. On the other hand, if the video is non-linearly resized to occupy the whole screen, the resulting shape distortion of objects will annoy the viewer [Zet98].
Much attention is then put on cropping-based approaches [NYHK05, CSE05].
First, Mohan et el [MSL99] proposed a general framework for adapting multimedia web documents, in which each media item (e.g., a video clip) is described with a multimodal and multiresolution representation hierarchy called the InfoPyramid.
An importance value is subjectively assigned to each of the item combinations as the transcoding hint for content servers to dynamically select the best output.
Similar ideas are also applied in Lee’s work [LCC+01]. Instead of treating one video frame as a whole, selective presentation (or frame cropping) is allowed to improve the visibility of user’s regions of interest (ROIs). Following their work, Chen et al [CXF+03] developed an image adaptation system based on visual at-tention model. Using a simulated cognitive mechanism of human visual system
Figure 3.2: Examples of semantic distortion in adapted videos: (a) and (c) are two original frames from the classical film “Lawrence of Arabia”, and (b) and (d) are the corresponding adapted results using [fil], respectively. With partial coverage, the two men of (a) no longer look at each other’s eyes when they are chatting in (b), and the man in (d) seems more like to burn himself with the burning match rather than just hold it in (c). (Courtesy of FlikFX Ltd.)
(HVS) [IKN98, PS00, CCW05], the most important region is automatically de-termined. Better perceptual results have also been reported in other ROI-based video applications [LCS03, HWC05, HWG04]. However, from the viewpoint of content authors, it not only destroys the carefully worked-out compositions but also distorts the overall conveyed messages. For example, if a visual scene is com-posed of multiple key objects, some of them are necessarily thrown away and a single ROI would fail to show the overview of their interrelationship. Moreover, significant information loss leads to viewer’s misinterpretation about the original meaning that the authors want to communicate.
To preserve a complete video context or to clarify the specific user interest is not an either-or problem. Some hybrid approaches that lie between the two oppo-site extremes have been proposed. Liu et al [LXMZ03] presented a novel solution
for browsing large pictures on mobile devices. All of the important regions are serially displayed and an optimal browsing path is calculated according to pre-dicted shifting of visual attention. Pan and Scan [Zet98] addressed an analogous technique for high-resolution video sources, but its discontinuous nature severely annoys the audience [Zet98, fil]. Besides, the requirement of additional temporal resolutions conflicts with the primary video structures. The FilkFX corporation developed an awarded commercial system for transferring wide-screen films to the 4:3 aspect ratio of TV screens [fil]. The intention is to generate a visually approx-imate replacement without object distortions. Therefore, each of the film frames is condensed by eliminating the background portions of little significance and the main actors are artificially brought together to concentrate viewer’s attention.
However, without considering the original spatial interrelationship of video ob-jects, semantic distortions are often generated, cf. Figure 3.2. Recent work intro-duces non-uniform manipulations of the background and foreground information.
Setlut et al [STG+04] decomposed an image into separate objects and unequally shrank them according to their relative visual importance. A side effect is that the relative size of different objects may be altered. Liu et al [LG05] exploited a non-linear warping transformation to emphasize the attractive foreground image regions but severe visual distortions are inevitable. Overall, the maintenance issue of user-perceived visual rationality in adapted results is not well addressed in the adaptation literature. Furthermore, although experiments show that non-uniform processing is more flexible to achieve superior performance, most discussions are confined to still images. These observations motivate our approach for motion pictures.