A brief history - Computer Vision:

In this section, I provide a brief personal synopsis of the main developments in computer vision over the last 30 years (Figure 1.6); at least, those that I find personally interesting and which appear to have stood the test of time. Readers not interested in the provenance of various ideas and the evolution of this field should skip ahead to the book overview in Section1.3.

1.2 A brief history 11

Digital image processing Blocks world, line labeling Generalizedcylinders

197

Generalized cylinders Pictorial structures Stereo correspondence Intrinsic images Optical flow Structure from motion

Image pyramids Scale-space processing Shape from shading, texture, and focus Physically-based modeling

1980

Regularization Markov Random Fields Kalman filters 3D range data processing Projective invariants Factorization

Factorization Physics-based vision Graph cuts Particle filtering Energy-based segmentation Facerecognitionanddetection

1990

Face recognition and detection Subspace methods Image-based modeling and rendering Texture synthesis and inpainting Computational photography

2000

Feature-based recognition MRF inference algorithms Category recognition Learning

Figure 1.6 A rough timeline of some of the most active topics of research in computer vision.

1970s. When computer vision first started out in the early 1970s, it was viewed as the visual perception component of an ambitious agenda to mimic human intelligence and to endow robots with intelligent behavior. At the time, it was believed by some of the early pioneers of artificial intelligence and robotics (at places such as MIT, Stanford, and CMU) that solving the “visual input” problem would be an easy step along the path to solving more difficult problems such as higher-level reasoning and planning. According to one well-known story, in 1966, Marvin Minsky at MIT asked his undergraduate student Gerald Jay Sussman to “spend the summer linking a camera to a computer and getting the computer to describe what it saw” (Boden 2006, p. 781).⁵ We now know that the problem is slightly more difficult than that.⁶

What distinguished computer vision from the already existing field of digital image pro-cessing (Rosenfeld and Pfaltz 1966; Rosenfeld and Kak 1976) was a desire to recover the three-dimensional structure of the world from images and to use this as a stepping stone to-wards full scene understanding. Winston(1975) andHanson and Riseman(1978) provide two nice collections of classic papers from this early period.

Early attempts at scene understanding involved extracting edges and then inferring the 3D structure of an object or a “blocks world” from the topological structure of the 2D lines (Roberts 1965). Several line labeling algorithms (Figure1.7a) were developed at that time (Huffman 1971;Clowes 1971;Waltz 1975;Rosenfeld, Hummel, and Zucker 1976;Kanade 1980). Nalwa(1993) gives a nice review of this area. The topic of edge detection was also

5Boden(2006) cites (Crevier 1993) as the original source. The actual Vision Memo was authored by Seymour Papert (1966) and involved a whole cohort of students.

6 To see how far robotic vision has come in the last four decades, have a look at the towel-folding robot at http://rll.eecs.berkeley.edu/pr/icra10/(Maitin-Shepard, Cusumano-Towner, Lei et al. 2010).

(a) (b) (c)

(d) (e) (f)

Figure 1.7 Some early (1970s) examples of computer vision algorithms: (a) line label-ing (Nalwa 1993) c 1993 Addison-Wesley, (b) pictorial structures (Fischler and Elschlager 1973) c 1973 IEEE, (c) articulated body model (Marr 1982) c 1982 David Marr, (d) intrin-sic images (Barrow and Tenenbaum 1981) c 1973 IEEE, (e) stereo correspondence (Marr 1982) c 1982 David Marr, (f) optical flow (Nagel and Enkelmann 1986) c 1986 IEEE.

an active area of research; a nice survey of contemporaneous work can be found in (Davis 1975).

Three-dimensional modeling of non-polyhedral objects was also being studied (Baum-gart 1974;Baker 1977). One popular approach used generalized cylinders, i.e., solids of revolution and swept closed curves (Agin and Binford 1976;Nevatia and Binford 1977), of-ten arranged into parts relationships⁷(Hinton 1977;Marr 1982) (Figure1.7c). Fischler and Elschlager(1973) called such elastic arrangements of parts pictorial structures (Figure1.7b).

This is currently one of the favored approaches being used in object recognition (see Sec-tion14.4andFelzenszwalb and Huttenlocher 2005).

A qualitative approach to understanding intensities and shading variations and explaining them by the effects of image formation phenomena, such as surface orientation and shadows, was championed byBarrow and Tenenbaum(1981) in their paper on intrinsic images (Fig-ure1.7d), along with the related 2¹/2-D sketchideas ofMarr(1982). This approach is again seeing a bit of a revival in the work ofTappen, Freeman, and Adelson(2005).

More quantitative approaches to computer vision were also developed at the time, in-cluding the first of many feature-based stereo correspondence algorithms (Figure1.7e) (Dev

7In robotics and computer animation, these linked-part graphs are often called kinematic chains.

1.2 A brief history 13

1974;Marr and Poggio 1976;Moravec 1977;Marr and Poggio 1979;Mayhew and Frisby 1981;Baker 1982;Barnard and Fischler 1982;Ohta and Kanade 1985;Grimson 1985; Pol-lard, Mayhew, and Frisby 1985;Prazdny 1985) and intensity-based optical flow algorithms (Figure1.7f) (Horn and Schunck 1981;Huang 1981;Lucas and Kanade 1981;Nagel 1986).

The early work in simultaneously recovering 3D structure and camera motion (see Chapter7) also began around this time (Ullman 1979;Longuet-Higgins 1981).

A lot of the philosophy of how vision was believed to work at the time is summarized in David Marr’s (1982) book.⁸ In particular, Marr introduced his notion of the three levels of description of a (visual) information processing system. These three levels, very loosely paraphrased according to my own interpretation, are:

• Computational theory: What is the goal of the computation (task) and what are the constraints that are known or can be brought to bear on the problem?

• Representations and algorithms: How are the input, output, and intermediate infor-mation represented and which algorithms are used to calculate the desired result?

• Hardware implementation: How are the representations and algorithms mapped onto actual hardware, e.g., a biological vision system or a specialized piece of silicon? Con-versely, how can hardware constraints be used to guide the choice of representation and algorithm? With the increasing use of graphics chips (GPUs) and many-core ar-chitectures for computer vision (see SectionC.2), this question is again becoming quite relevant.

As I mentioned earlier in this introduction, it is my conviction that a careful analysis of the problem specification and known constraints from image formation and priors (the scientific and statistical approaches) must be married with efficient and robust algorithms (the engineer-ing approach) to design successful vision algorithms. Thus, it seems that Marr’s philosophy is as good a guide to framing and solving problems in our field today as it was 25 years ago.

1980s. In the 1980s, a lot of attention was focused on more sophisticated mathematical techniques for performing quantitative image and scene analysis.

Image pyramids (see Section3.5) started being widely used to perform tasks such as im-age blending (Figure1.8a) and coarse-to-fine correspondence search (Rosenfeld 1980;Burt and Adelson 1983a,b;Rosenfeld 1984;Quam 1984;Anandan 1989). Continuous versions of pyramids using the concept of scale-space processing were also developed (Witkin 1983;

Witkin, Terzopoulos, and Kass 1986;Lindeberg 1990). In the late 1980s, wavelets (see Sec-tion 3.5.4) started displacing or augmenting regular image pyramids in some applications

8More recent developments in visual perception theory are covered in (Palmer 1999;Livingstone 2008).

(a) (b) (c)

(d) (e) (f)

Figure 1.8 Examples of computer vision algorithms from the 1980s: (a) pyramid blending (Burt and Adelson 1983b) c 1983 ACM, (b) shape from shading (Freeman and Adelson 1991) c 1991 IEEE, (c) edge detection (Freeman and Adelson 1991) c 1991 IEEE, (d) physically based models (Terzopoulos and Witkin 1988) c 1988 IEEE, (e) regularization-based surface reconstruction (Terzopoulos 1988) c 1988 IEEE, (f) range data acquisition and merging (Banno, Masuda, Oishi et al. 2008) c 2008 Springer.

(Adelson, Simoncelli, and Hingorani 1987;Mallat 1989;Simoncelli and Adelson 1990a,b;

Simoncelli, Freeman, Adelson et al. 1992).

The use of stereo as a quantitative shape cue was extended by a wide variety of shape-from-Xtechniques, including shape from shading (Figure1.8b) (see Section12.1.1andHorn 1975;Pentland 1984;Blake, Zimmerman, and Knowles 1985;Horn and Brooks 1986,1989), photometric stereo (see Section12.1.1andWoodham 1981), shape from texture (see Sec-tion12.1.2andWitkin 1981;Pentland 1984;Malik and Rosenholtz 1997), and shape from focus (see Section12.1.3andNayar, Watanabe, and Noguchi 1995).Horn(1986) has a nice discussion of most of these techniques.

Research into better edge and contour detection (Figure1.8c) (see Section4.2) was also active during this period (Canny 1986; Nalwa and Binford 1986), including the introduc-tion of dynamically evolving contour trackers (Secintroduc-tion5.1.1) such as snakes (Kass, Witkin, and Terzopoulos 1988), as well as three-dimensional physically based models (Figure1.8d) (Terzopoulos, Witkin, and Kass 1987;Kass, Witkin, and Terzopoulos 1988;Terzopoulos and Fleischer 1988;Terzopoulos, Witkin, and Kass 1988).

Researchers noticed that a lot of the stereo, flow, shape-from-X, and edge detection

al-1.2 A brief history 15 gorithms could be unified, or at least described, using the same mathematical framework if they were posed as variational optimization problems (see Section3.7) and made more ro-bust (well-posed) using regularization (Figure1.8e) (see Section3.7.1andTerzopoulos 1983;

Poggio, Torre, and Koch 1985;Terzopoulos 1986b;Blake and Zisserman 1987;Bertero, Pog-gio, and Torre 1988;Terzopoulos 1988). Around the same time,Geman and Geman(1984) pointed out that such problems could equally well be formulated using discrete Markov Ran-dom Field(MRF) models (see Section3.7.2), which enabled the use of better (global) search and optimization algorithms, such as simulated annealing.

Online variants of MRF algorithms that modeled and updated uncertainties using the Kalman filter were introduced a little later (Dickmanns and Graefe 1988;Matthies, Kanade, and Szeliski 1989; Szeliski 1989). Attempts were also made to map both regularized and MRF algorithms onto parallel hardware (Poggio and Koch 1985; Poggio, Little, Gamble et al.1988; Fischler, Firschein, Barnard et al. 1989). The book byFischler and Firschein (1987) contains a nice collection of articles focusing on all of these topics (stereo, flow, regularization, MRFs, and even higher-level vision).

Three-dimensional range data processing (acquisition, merging, modeling, and recogni-tion; see Figure1.8f) continued being actively explored during this decade (Agin and Binford 1976;Besl and Jain 1985;Faugeras and Hebert 1987;Curless and Levoy 1996). The compi-lation byKanade(1987) contains a lot of the interesting papers in this area.

1990s. While a lot of the previously mentioned topics continued to be explored, a few of them became significantly more active.

A burst of activity in using projective invariants for recognition (Mundy and Zisserman 1992) evolved into a concerted effort to solve the structure from motion problem (see Chap-ter7). A lot of the initial activity was directed at projective reconstructions, which did not require knowledge of camera calibration (Faugeras 1992;Hartley, Gupta, and Chang 1992;

Hartley 1994a;Faugeras and Luong 2001;Hartley and Zisserman 2004). Simultaneously, fac-torizationtechniques (Section7.3) were developed to solve efficiently problems for which or-thographic camera approximations were applicable (Figure1.9a) (Tomasi and Kanade 1992;

Poelman and Kanade 1997;Anandan and Irani 2002) and then later extended to the perspec-tive case (Christy and Horaud 1996;Triggs 1996). Eventually, the field started using full global optimization (see Section7.4andTaylor, Kriegman, and Anandan 1991;Szeliski and Kang 1994;Azarbayejani and Pentland 1995), which was later recognized as being the same as the bundle adjustment techniques traditionally used in photogrammetry (Triggs, McLauch-lan, Hartley et al. 1999). Fully automated (sparse) 3D modeling systems were built using such techniques (Beardsley, Torr, and Zisserman 1996;Schaffalitzky and Zisserman 2002;Brown and Lowe 2003;Snavely, Seitz, and Szeliski 2006).

Work begun in the 1980s on using detailed measurements of color and intensity combined

(a) (b) (c)

(d) (e) (f)

Figure 1.9 Examples of computer vision algorithms from the 1990s: (a) factorization-based structure from motion (Tomasi and Kanade 1992) c 1992 Springer, (b) dense stereo match-ing (Boykov, Veksler, and Zabih 2001), (c) multi-view reconstruction (Seitz and Dyer 1999) 1999 Springer, (d) face tracking (c Matthews, Xiao, and Baker 2007), (e) image segmenta-tion (Belongie, Fowlkes, Chung et al. 2002) c 2002 Springer, (f) face recognition (Turk and Pentland 1991a).

with accurate physical models of radiance transport and color image formation created its own subfield known as physics-based vision. A good survey of the field can be found in the three-volume collection on this topic (Wolff, Shafer, and Healey 1992a;Healey and Shafer 1992;

Shafer, Healey, and Wolff 1992).

Optical flow methods (see Chapter8) continued to be improved (Nagel and Enkelmann 1986;Bolles, Baker, and Marimont 1987;Horn and Weldon Jr. 1988;Anandan 1989;Bergen, Anandan, Hanna et al. 1992;Black and Anandan 1996;Bruhn, Weickert, and Schn¨orr 2005;

Papenberg, Bruhn, Brox et al. 2006), with (Nagel 1986;Barron, Fleet, and Beauchemin 1994;

Baker, Black, Lewis et al. 2007) being good surveys. Similarly, a lot of progress was made on dense stereo correspondence algorithms (see Chapter 11, Okutomi and Kanade(1993, 1994);Boykov, Veksler, and Zabih(1998);Birchfield and Tomasi(1999);Boykov, Veksler, and Zabih(2001), and the survey and comparison inScharstein and Szeliski(2002)), with the biggest breakthrough being perhaps global optimization using graph cut techniques (Fig-ure1.9b) (Boykov, Veksler, and Zabih 2001).

1.2 A brief history 17 Multi-view stereo algorithms (Figure1.9c) that produce complete 3D surfaces (see Sec-tion11.6) were also an active topic of research (Seitz and Dyer 1999;Kutulakos and Seitz 2000) that continues to be active today (Seitz, Curless, Diebel et al. 2006). Techniques for producing 3D volumetric descriptions from binary silhouettes (see Section11.6.2) continued to be developed (Potmesil 1987;Srivasan, Liang, and Hackwood 1990;Szeliski 1993; Lau-rentini 1994), along with techniques based on tracking and reconstructing smooth occluding contours (see Section11.2.1andCipolla and Blake 1992;Vaillant and Faugeras 1992;Zheng 1994;Boyer and Berger 1997;Szeliski and Weiss 1998;Cipolla and Giblin 2000).

Tracking algorithms also improved a lot, including contour tracking using active contours (see Section5.1), such as snakes (Kass, Witkin, and Terzopoulos 1988), particle filters (Blake and Isard 1998), and level sets (Malladi, Sethian, and Vemuri 1995), as well as intensity-based (direct) techniques (Lucas and Kanade 1981;Shi and Tomasi 1994;Rehg and Kanade 1994), often applied to tracking faces (Figure1.9d) (Lanitis, Taylor, and Cootes 1997;Matthews and Baker 2004;Matthews, Xiao, and Baker 2007) and whole bodies (Sidenbladh, Black, and Fleet 2000;Hilton, Fua, and Ronfard 2006;Moeslund, Hilton, and Kr¨uger 2006).

Image segmentation (see Chapter 5) (Figure1.9e), a topic which has been active since the earliest days of computer vision (Brice and Fennema 1970;Horowitz and Pavlidis 1976;

Riseman and Arbib 1977;Rosenfeld and Davis 1979;Haralick and Shapiro 1985;Pavlidis and Liow 1990), was also an active topic of research, producing techniques based on min-imum energy (Mumford and Shah 1989) and minmin-imum description length (Leclerc 1989), normalized cuts(Shi and Malik 2000), and mean shift (Comaniciu and Meer 2002).

Statistical learning techniques started appearing, first in the application of principal com-ponent eigenface analysis to face recognition (Figure1.9f) (see Section14.2.1andTurk and Pentland 1991a) and linear dynamical systems for curve tracking (see Section5.1.1andBlake and Isard 1998).

Perhaps the most notable development in computer vision during this decade was the increased interaction with computer graphics (Seitz and Szeliski 1999), especially in the cross-disciplinary area of image-based modeling and rendering (see Chapter13). The idea of manipulating real-world imagery directly to create new animations first came to prominence with image morphing techniques (Figure1.5c) (see Section3.6.3andBeier and Neely 1992) and was later applied to view interpolation (Chen and Williams 1993;Seitz and Dyer 1996), panoramic image stitching (Figure1.5a) (see Chapter9 andMann and Picard 1994; Chen 1995;Szeliski 1996;Szeliski and Shum 1997;Szeliski 2006a), and full light-field rendering (Figure1.10a) (see Section 13.3andGortler, Grzeszczuk, Szeliski et al. 1996;Levoy and Hanrahan 1996;Shade, Gortler, He et al. 1998). At the same time, image-based modeling techniques (Figure1.10b) for automatically creating realistic 3D models from collections of images were also being introduced (Beardsley, Torr, and Zisserman 1996;Debevec, Taylor, and Malik 1996;Taylor, Debevec, and Malik 1996).

(a) (b) (c)

(d) (e) (f)

Figure 1.10 Recent examples of computer vision algorithms: (a) image-based rendering (Gortler, Grzeszczuk, Szeliski et al. 1996), (b) image-based modeling (Debevec, Taylor, and Malik 1996) c 1996 ACM, (c) interactive tone mapping (Lischinski, Farbman, Uyttendaele et al.2006a) (d) texture synthesis (Efros and Freeman 2001), (e) feature-based recognition (Fergus, Perona, and Zisserman 2007), (f) region-based recognition (Mori, Ren, Efros et al.

2004) c 2004 IEEE.

2000s. This past decade has continued to see a deepening interplay between the vision and graphics fields. In particular, many of the topics introduced under the rubric of image-based rendering, such as image stitching (see Chapter 9), light-field capture and rendering (see Section13.3), and high dynamic range (HDR) image capture through exposure bracketing (Figure1.5b) (see Section10.2andMann and Picard 1995;Debevec and Malik 1997), were re-christened as computational photography (see Chapter10) to acknowledge the increased use of such techniques in everyday digital photography. For example, the rapid adoption of exposure bracketing to create high dynamic range images necessitated the development of tone mapping algorithms (Figure 1.10c) (see Section 10.2.1) to convert such images back to displayable results (Fattal, Lischinski, and Werman 2002;Durand and Dorsey 2002; Rein-hard, Stark, Shirley et al. 2002;Lischinski, Farbman, Uyttendaele et al. 2006a). In addition to merging multiple exposures, techniques were developed to merge flash images with non-flash counterparts (Eisemann and Durand 2004;Petschnigg, Agrawala, Hoppe et al. 2004) and to interactively or automatically select different regions from overlapping images (Agarwala,

在文檔中 Computer Vision: (頁 32-41)