Chapter 7 Experimental Results
7.3 Real-time camera match-moving method
The following experiments were performed to assess the precision of the proposed system and to compare the estimated camera positioning with its actual motion. The experiments comprised an angular measurement and a displacement measurement, whereas each measurement comprised three submeasurements in three different axial directions, namely, X-axis, Y-axis, and Z-axis.
For the angular measurement, an actual camera was mounted on an electric rotary plate (see Figure 7.21.a) for simulating the rotation of the camera about its Y-axis, as would be enabled by a cameraman. Simultaneously, the precise angular variation of the rotation was measured with an electric angle meter (see Figure 7.21.b). At the same time, the movements in the X-axis direction and Z-axis direction of the actual camera were detected and recorded by an electronic level placed above the actual camera.
Figure 7.20. Male part of the user servey histogram between random and MK-CPN methods.
137
The displacement measurement was performed by mounting an actual camera on a rail for simulating the “dolly” or displacement of a typical camera enabled by a cameraman, that is, the linear displacement of the camera in the X-axis or Z-axis directions. In addition, the displacement was measured with a laser range finder (see Figure 7.22).
Table 7.5. Angular and Linear Displacement Error Measurement
Angular
Average error per degree
Displacement Average error per centimeter
X-axis ± 0.03 ± 0.04
Y-axis ± 0.01 ± 0.04
Z-axis ± 0.07 ± 0.04
Avg. Accurac y (%) 96% 96%
(a) (b)
Figure 7.21. Angular measurement tools (a) electric rotary plate (b) electric angle meter.
(a) (b)
Figure 7.22. Displacement measurement tools (a) dolly rail (b) laser range finder.
138
Both the angular and displacement experimental results are provided in Table 7.5.
These results showed that the proposed system can provide satisfactory precision in X-axis and Y-X-axis angular measurements but limited precision in Z-X-axis rotation. However, practical video recordings involve limited amounts of Z-axis rotation. The displacement measurements also showed the accuracy of the system for each 1-cm displacement, and the error was less than 0.04 cm.
139
Chapter 8
Conclusions and Future Work
Our automatic lecture recording system could be applied to a wide range of presentations, ranging from large academic or business seminars, speeches, and presentations, to small general lectures and classroom teaching. In addition to recording the valuable information in a speech, our system could cut personnel costs required for shooting presentations. Whereas a single-camera system might produce a monotonous video, our system uses a variety of shooting rules to produce diverse videos, which audiences experience as unusually immersive and engaging.
We proposed an automatic real-time lecture recording system, the smart lecture recording (SLR) system. The proposed system combines three virtual cameramen (VCs) and a virtual director (VD).
There are three VCs in the proposed system: speaker cameraman (SC), audience cameraman (AC), and hall cameraman (HC); all three use camera subsystems that include pan-tilt-zoom (PTZ) cameras. The SC is composed of a PTZ camera and a depth camera. The SC can track the speaker and perform pose recognition in real time. Three postures are predefined for lecture scenarios, namely pointing, illustrating, and relaxing.
The VD system has two main jobs: shot selection and visual instruction. This VD system gets multiple views from VC and then considers four types of content analysis criteria to evaluate the quality of each view. We also implemented a learning mechanism for choosing the most suitable shot of the input shots. Our learning mechanism is more flexible than a traditional decision-making model. The experimental result demonstrated that our system can actually be applied in the real world. This research mainly focused on three topics: image feature extraction, assessment of image quality
140
by content analysis, and multiple decision-making for shot selection.
Previous systems have commonly applied optical analysis and aesthetic analysis to estimate the quality levels of shots. Our VD applies common optical analysis and aesthetic analysis, but it also applies action analysis to consider the factors that may interest viewers, and continuity analysis to consider the fluency of a shot when switching.
Real directors do not follow fixed, standardized rules for shot selection. In our design of a multiple decision-making system for shot selection, we proposed a learning mechanism based on our counter propagation network (CPN). Our system can learn skills from real directors, and experiments have proven that our system works like a professional director in the real world. Because our system’s learning patterns involve supervised learning, it must be provided with training materials that contain labeled expected output. In the present work, the expected output was provided by a student with directorial experience. If possible, professional directors could provide more extensive expected output to later versions of this system, which would bring the experimental results of future versions closer to professional broadcasting standards.
To facilitate the experimental recordings, the current VD system was implemented on a laptop. The system simultaneously operated on signals received from three cameras, and it was required to calculate vast quantities of information at any given moment. To prevent latency and jitter on the output screen, salient object detection was implemented on cameras. The VC calculated salient information and passed it to the VD. Thus, the VD host efficiently allocated its computing power to producing an extremely smooth broadcast.
For video editing and postproduction, we present a system that provides a real-time automatic camera match-moving method for virtual–real synthesis before film postproduction. The proposed system consists of two subsystems. One is the
high-141
accuracy camera-tracking system, which can reconstruct the camera angles of live-action footage in real time. The other subsystem is the real-time virtual–real preview system, which controls virtual cameras and optimizes the stereo parameter settings. The preview system is implemented as a computer graphics software plug-in that inputs the virtual camera key frames photographed from specific camera angles and outputs the rendered footage. The goal of the proposed method is to build a quick and robust match-moving method for VFX synthesis before postproduction.
Because the field of view of each depth camera is limited to approximately 57 degrees and the real object is generally illuminated by high-power lighting during the shooting, the depth information acquired from the depth camera may not be perfect, and thus, the precision of the measured depth map can be adversely affected. Therefore, to enlarge the depth measurement range of the depth camera, a future depth camera unit could combine several depth cameras facing in different directions to generate different depth frames; the depth frames of those depth cameras could be composed into one depth map.
In the future, the VRMM component and speech transcription are considerate to integrate into online SLR system. To combine the VRMM component into the SLR system, the temporal depth fusion need to be modified to PTZ camera version, because PTZ camera only have pan, tilt, and zoom operations but without sift operation.
Nowadays, thanks to the development of deep learning technology, the accuracy of speech transcription has been significantly improved. The speech transcription allows the SLR system to instantly synthetic subtitles to recording video that make the SLR system more practical. Furthermore, we intend to upgrade the PTZ cameras to 1080p (Full HD) high-resolution cameras to replace the existing 480p PTZ cameras. High-resolution camera is now the mainstream equipment and allows users to have a better experience. However, the use of high-resolution camera will also bring additional load
142
to the system. For example, high-resolution images can cause memory usage to rise to more than six times. In addition, the content analysis of the rating algorithm must also take into account the operational efficiency of the problem.
In addition, we want to integrate deep learning and active learning into our system.
As we discussed in Chapter 1, the limitation of CNN is its local feature learning property where content analysis usually require global and temporal information.
However, CNN is very suitable for object recognition or detection. For example, it is possible to use CNN to detect humans and tools in the shot. Therefore, we could use the information to increase event detection accuracy. In some situations, unlabeled data is abundant but manual labeling is exorbitantly expensive. In this scenario, learning algorithms can actively query for labels. The labeling task would be done automatically by a decision system or manually by a professional user. This type of iterative supervised learning is called active learning. Because the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. The concept diagram is shown in Figure 8.1.
The benefits of active learning are as follows: it enables the achievement of online learning, it converges rapidly, and it demonstrates a high recognition rate.
The proposed SLR system with the deep learning and active learning concepts will not be limited to shooting live speeches; it will also be practicable for recording live performances, video games, concerts, product launches, and numerous other events.
Compared to the common criteria system, which is more dependent on the shooting situation, our system could utilize diverse methods to make the application system more flexible by offering various types of training materials.
143
Figure 8.1. Concept of active learning.
144
Appendix
1. Projection Screen Location
In general, the projection screen area is much brighter than the other areas, so this characteristic can be used to detect the screen area. First, convert the color image from the RGB color space to the HSV color space, and then connect the pixels with high values (V) into areas (Figure A.1). Unsuitable areas such as lights and windows must be filtered.
Before we detect the screen area, a color transform is performed. The color space proposed here is the HSV color space (Figure A.2). The three channels of HSV are Hue, Saturation, and Value (intensity).
To find the screen, we consider the value channel. By Otsu's method, which automatically performs clustering-based image thresholding and the connected component technique, we can obtain several candidate regions (see Figure A.3). Then, we use the size filter to find the most appropriate region and locate the area (as in
Figure A.1. Result of screen detection (red/green rectangles).
Figure A.2. HSV color space (from Wikipedia).
145
Figure A.4).
2. Laser Point Detection
When a speaker shines a laser spot on the screen, it encourages the audience to focus on the highlighted text and pictures. The laser points irradiated by the laser pen have a higher intensity in the screen area. The detection of any laser spot depends on whether that laser spot is located on the screen area or not; the spot will have the highest intensity in the screen area. By using those features, we could design an algorithm to detect whether the speaker was using a laser pointer.
The detection procedure is performed between the frames of each successive frame pair within the projection screen area. First, try to find any pixels that differ from those of the previous frame and check whether the difference is higher than the constant
Figure A.3. Candidates of projection screen (including light and windows).
Figure A.4. Detected projection screen (green rectangle).
146
threshold. Next, check whether the intensity of the pixels is higher than the threshold.
If both answers are “yes,” then perform the connected component technique to group the pixels together and filter unsuitable areas by size; the laser light point can be located within a relevant region.
After finding the laser point in the projection screen area, our system orders the PTZ camera to fill the whole area of the shot with the projection screen (see Figure A.5) and simultaneously to send a message notifying the VD that the speaker is using a laser pointer now.
3.Baton Detection
When a speaker uses a baton, that speaker typically waves that baton to emphasize the content on the projection screen and to instruct the audience. Suppose the speaker waves the baton continuously in the screen area when the baton is used to assist his or her speech; then our system can apply motion detection to find the area in which the baton moves. Just like laser spot detection, baton detection is limited to the projection screen. The baton is thinner and more elongated than the speaker (see Figure A.6). By using those features, our algorithm can detect whether the speaker is using a baton. The detection procedure is performed between the frames of each successive frame pair;
only the pixels within the projection screen area are compared. First, we calculate the image difference and compare them against a predefined threshold to extract the pixels that suggest a large motion. Then, we perform the connected component technique to
Figure A.5. Laser spot detection. The PTZ camera moves and shoots the whole screen area.
147
find the rectangular bounding box and we filter unsuitable areas by size; the baton can be located within the region returned by this method.
In addition, when a speaker uses a baton, the focus of the picture is also in the baton’s region. After finding the area where the baton is waving in the screen area, our system controls the PTZ camera to zoom in, to shoot the baton’s area, and simultaneously to send a message to notify the VD that the speaker is using a baton. If the system has been tracking the baton, but at some point, the baton is no longer detected, then the VC must zoom out and go back to tracking the speaker.
Figure A.6. Baton detection. The PTZ camera moves and shoots the area in which the baton is waving.
148
149
150
151
152
153
5. Visual Instruction List Table
Table A.2. Visual instruction list (speaker posture: pointing)
Event
154
155
Table A.3. Visual instruction list (speaker posture: illustrating)
Event
156
157
Table A.4. Visual instruction list (speaker posture: relaxing)
Event
158
159
References
[1] L. A. Rowe, D. Harley, P. Pletcher, and S. Lawrence, “BIBS: A Lecture Webcasting System,” Berkeley Multimedia Research Center, 2001.
[2] Y. Rui, L. He, A. Gupta, and Q. Liu, “Building an Intelligent Camera Management System,” Proceedings of the ACM International Conference on Multimedia, vol. 9, pp. 2-11, 2001.
[3] M. Bianchi, “AutoAuditorium: A Fully Automatic, Multi-Camera System to Televise Auditorium Presentations,” Proceedings of the Joint DARPA/NIST Smart Spaces Technology Workshop, 1998.
[4] M. Bianchi, “Automatic Video Production of Lectures Using an Intelligent and Aware Environment,” Proceedings of the International Conference on Mobile and Ubiquitous Multimedia, pp. 117-123, 2004.
[5] G. D. Abowd, “Classroom 2000: An Experiment with the Instrumentation of a Living Educational Environment,” IBM Systems Journal, vol. 38, no. 4, pp. 508-530, 1999.
[6] G. Cruz and R. Hill, “Capturing and Playing Multimedia Events with STREAMS,”
Proc. ACM Int’l Conf. on Multimedia, pp. 193-200, 1994.
[7] C. Zhang, Y. Rui, J. Crawford, and L.W. He, “An Automated End-to-end Lecture Capture and Broadcasting System,” Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), vol. 4, no. 1, pp. 2-11, 2008.
[8] R. Yong, G. Anoop, G. Jonathan, and L.W. He, “Automating Lecture Capture and Broadcast: Technology and Videography,” Multimedia Systems, vol. 10, no. 1, pp.
3-15, 2004.
[9] R. Baecker, “A Principled Design for Scalable Internet Visual Communications with Rich Media, Interactivity, and Structured Archives,” Proceedings of the Centre for
160
Advanced Studies on Collaborative research, pp. 16-29, 2003.
[10] M. Onishi and K. Fukunaga, “Shooting the Lecture Scene Using Computer-Controlled Cameras based on Situation Understanding and Evaluation of Video Images,” Proceedings of the International Conference on Mobile and Ubiquitous Multimedia, pp. 781-784, 2004.
[11] C. F. Juang and C. M. Chang, “Human Body Posture Classification by a Neural Fuzzy Network and Home Care System Application,” Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 37, no. 6, pp. 984-994, 2007.
[12] M. Ozeki, Y. Nakamura, and Y. Ohta, “Human Behavior Recognition for an Intelligent Video Production System,” Proceedings of the IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing, pp.
1153-1160, 2002.
[13] K. H. Cheng, C. H. Hsieh, C. C. Wang, “Human Action Recognition Using 3D Body Joints,” Proceedings of the International Conference on Computer Vision, Graphics, and Image Processing (IPPR), session D2-2, 2011.
[14] S. Y. Lin, Z. H. You, and Y. P. Hung, “A Real-Time Action Recognition Approach with 3D Tracked Body Joints and Its Application,” Proceedings of the International Conference on Computer Vision, Graphics, and Image Processing (IPPR), session B5-2, 2011.
[15] C. M. Huang, Y. R. Chen, and L. C. Fu, “Visual Tracking of Human Head and Arms Using Adaptive Multiple Importance Sampling on a Single Camera in Cluttered Environments,” IEEE Transactions on Sensors, vol. 14, no. 7, pp. 2267-2275, 2014.
[16] C. T. Lu and S.W. Chen, “Automatic Lecture Recording System,” Proceedings of the International Conference on Computer Vision, Graphics, and Image Processing (IPPR), session D1-3, 2011.
161
[17] C. Zhang, Y. Rui, L. He, and M. Wallick, “Hybrid speaker tracking in an automated lecture room,” Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 81-84, 2005.
[18] T. Yokoi, and H. Fujiyoshi,”Virtual camerawork for generating lecture video from high resolution images,” Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 751-754, 2005.
[19] Q. Huang, Y. T. Cui, and S. Samarasekera, ”Content based active video data acquisition via automated cameramen,” Proceedings of the IEEE International Conference on Image Processing (ICIP), p.p.808-812, 1998.
[20] M. Wallick, Y. Rui, and L. He “A portable solution for automatic lecture room camera management,” Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 987-990, 2004.
[21] T. Y. Li and X. Y. Xiao, “An Interactive Camera Planning System for Automatic Cinematographer,” Proceedings of the IEEE International Conference on Multimedia Modelling, 2005.
[22] F. Lampi, S. Kopf, M.Benz, and W. Effelsberg, “An automatic cameraman in a lecture recording system,” Proceedings of the International Workshop on Educational Multimedia and Multimedia Education, p.p. 11-18, 2007.
[23] Y. Cheng, “Mean Shift, Mode Seeking, and Clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 790-799, 1995.
[24] P. Ekman and W. V. Friesen, “Nonverbal Behavior and Psychopathology,” The psychology of Depression, pp. 203-233, 1969.
[25] M. Gleicher and J. Masanz, “Towards Virtual Videography,” Proceedings of the ACM International Conference on Multimedia, pp. 375-378, 2000.
[26] S. Okuni, S. Tsuruoka, G. P. Rayat, H. Kawanaka, and T. Shinogi, “Video Scene
162
Segmentation Using the State Recognition of Blackboard for Blended Learning,”
International Conference on Convergence Information Technology, pp. 2437-2442, 2007.
[27] M. Kumano, Y. Ariki, M. Amano, K. Uehara, “Video Editing Support System Based on Video Grammar and Content Analysis,” Proceedings of the International Conference on Pattern Recognition (ICPR), vol. 2, pp. 1031-1036, 2002.
[28] T. Wang, A. Mansfield, R. Hu, and J. Collomosse, “An Evolutionary Approach to Automatic Video Editing,” Proceedings of the International Conference on Visual Media Production (CVMP), pp. 127-134, 2009.
[29] E. Machnicki and L. Rowe, “Virtual Director: Automating a Webcast,” SPIE Multimedia Computer Network, pp. 208-225, 2002.
[30] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-Aware Saliency Detection,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 10, pp.
1915-1926, 2012.
[31] C. J. Fang, S. W. Chen, and C. S. Fu, “Automatic Change Detection of Driving Environments in a Vision-Based Driver Assistance System,” IEEE Transactions on Neural Networks, vol. 14, no.3, pp.646-657, 2003.
[32] F. Wang, C. W. Ngo, and T. C. Pong, “Synchronization of Lecture Videos and Electronic Slides by Video Text Analysis,” Proceedings of the ACM International Conference on Multimedia, pp. 315-318, 2003.
[33] S. Fiori, “A theory for learning based on rigid bodies dynamics,” IEEE Transactions on Neural Networks, vol. 13, no. 3, pp. 521-531, 2002.
[34]Y. H. Hsiao and C. C. Chen, “A Sparse Sample Collection and Representation Method Using Re-weighting and Dynamically Updating OMP for Fish Tracking,
163
“ Proceedings of the IEEE International Conference on Image Processing, pp. 3494-3497, 2016.
[35]B. F. Wu, C. T. Lin, and C. J. Chen, “Real-time Lane and Vehicle Detection Based on a Single Camera Model, ” International Journal of Computers and Applications, vol. 32, no.2, pp. 149-159, 2010.
[36]R. Setiono,W. K. Leow, and J.M. Zurada, “Extraction of rules from artificial neural networks for nonlinear regression,” IEEE Transactions on Neural Networks, vol 13, no.3, pp. 564-577, 2002.
[37] C. C. Balázs, “Approximation with Artificial Neural Networks,” Faculty of Sciences,Eötvös Loránd University , 2001.
[38]Y. Bengio, “Learning Deep Architectures for AI,“ Foundations and Trends in Machine Learning. vol.2, no.1, pp.1-127, 2009.
[39]Y. Bengio, A. Courville, and P. Vincent, ”Representation Learning: A Review and New Perspectives,“ IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no.8, pp. 1798-1828, 2013.
[40]L. Deng, and D. Yu, ”Deep Learning: Methods and Applications, ” Foundations and Trends in Signal Processing. vol.7 no.3-4, pp.1-199, 2014.
[41] J. Schmidhuber, “Deep Learning in Neural Networks: An Overview, ” Neural Networks, vol. 61 pp.85-117, 2015.
[42] Y. Bengio, “Practical recommendations for gradient-based training of deep architectures, ” 2012.
[43]C. Metz, “Facebook's 'Deep Learning' Guru Reveals the Future of AI, ” Wired, 2013.
164
[44] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
[44] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and