Chapter 4 Open-Domain Question Answering on Heterogeneous Data
6. QA on Video Films
6.3 QA Experiment on Video OCR
The test data come from 19 Discovery programs, about 298KB, which are pronounced in English and with Chinese captions. Each program is about 1 hour long.
Total 15,353 lines of captions are extracted. On average, there are 808 lines in a program.
A basic information unit of a film is considered as a passage for extracting answers.
The passages are segmented at pause duration longer than 5 seconds in our experiments.
There are totally 3,876 passages. On average, there are 204 passages in a program.
39 questions are collected from the web site of Discovery Channel (http://chinese.
discovery.com/sch/), as links in the program list page. They offer questions according to each program for educational purpose. The QA results are not good. Original model
answered 3 questions, while OCR-similarity- integrated model answered one more. The main reason is that these questions are pretty hard in question level (Moldovan, et al., 2000).
7. Conclusion
This paper sketches a new view of question answering on heterogeneous data.
Table 3 compares the heterogeneous data in QA task. After defining information passages and similarity measurement, our QA system is capable of handling data consisting of plain texts, summaries, HTML documents with tables, and videos.
Table 3. Comparison of Heterogeneous Data
Plain text Summary Table Video
Document Document Document and Table Film, Captions Sentence, Passage Sentence, Passage Interpretation,
Value-Cells
Film Fragment Divided by Pause Lexical Matching Lexical Matching Lexical Matching Lexical Matching and
OCR Similarity Presented as Text Text Text or Tables Film Fragment
There are several interesting future directions, for example, how query-based summarization can be helpful a QA task, how to integrate the context of tables, and so on.
Besides, background linguistic technologies for OCR texts, such as word segmentation, IR, and named entity extraction, have to be redefined.
Appendix A.
(1) Question Foci
PERSON, LOCATION, TIME, QUANTITY, SELECTION, METHOD, DESCRIPTION, REASON, and OBJECT.
(2) Hand-Tagged Questions
These are some examples of hand-tagged questions for training Question-Focus
decision rules. Boxed texts are question words. A question focus is given in front of each question, and is printed in bold.
LOCATION
(Where is Grass Valley?) TIME
(When did Taiwan history start?) METHOD
(How to improve the absorption of Calcium?)
Appendix B.
Question Focus Decision Rules
These are some examples of Question-Focusdecision rules. “Term” isthequestion word found in the sentence, and TermNext (TermPrev) is the term following (preceding) the question word.
Rule 3: Term= (where)-> class LOCATION Rule 17: Term = (who)-> class PERSON
Rule 21: Term = (how), TermNext = (to come, to do)-> class METHOD
Appendix C.
Chinese Questions for Experiments on Plain Text and Summarization.
Q1.
(Where does the first sunlight shine on China?) Q77.
(Whatkind ofgameis“TheHero”?) Q280. 21
(What will be the star industry in the 21stcentury?)
References
Chang, C.Y. (1997) A Discourse Analysis of Questions in Mandarin Conversation, Master Thesis, National Taiwan University, June 1997.
Chen,H.H.and Huang,S.J.(1999)“A Summarization System forChineseNewsfrom MultipleSources,”Proceedings of 4thIRAL, Taiwan, pp. 1-7, 1999.
Chen,H.H.,Tsai,S.C.,and Tsai,J.H.(2000)“Mining Tables from Large Scale HTML Texts,”Proceedings of 18th COLING, pp. 166-172, 2000.
Chen, H.H. and Lin, C.J. (2000) "A Multilingual News Summarizer," Proceedings of 18th COLING, pp. 159-165, 2000.
Chen,K.J.,Huang,C.R.,Chang, L.P.,and Hsu,H.L.(1996) “SinicaCorpus:Design Methodology for Balanced Corpora,” Proceedings of the 11th PACLIC 11, pp.
167-176, 1996.
Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien (2004) “Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval,”Proceedings of the 27th ACM-SIGIR, pp.
146-153.
Christiane Fellbaum (Ed.) (1998) WordNet: An Electronic Lexical Database, The MIT Press, 1998.
Sanda Harabagiu, Dan Moldovan, Marius Pasca, Rada Mihalcea, Mihai Surdeanu, Razvan Bunescu, Roxana Girju, Vasile Rus, and Paul Morarescu (2001), “The Role of Lexico-Semantic Feedback in Open-Domain Textual Question Answering,”the Proceedings of the 39thACL and 10thEACL, pp. 274-281, 2001.
Sanda Harabagiu, Marius Pasca, and Steve Maiorano (2000), “Experiments with Open-Domain Textual Question Answering,”the Proceedings of the 18th COLING, pp. 292-298, 2001.
LynetteHirschman and R.Gaizauskas(2001)“NaturalLanguageQuestion Answering:
theView from Here,”NaturalLanguage Engineering,CambridgeUniversity Press, Vol. 7, No. 4, 2001, pp. 275-300.
Eduard Hovy, Ulf Hermjakob, and Chin-Yew Lin (2001), “The Use of External Knowledge in Factoid QA,”the proceedings of TREC 2001, pp. 644-652, 2001.
Hovy, E. and Marcu, D. (1998a) Automated Text Summarization, Tutorial in 17th COLING-ACL, Montreal, Quebec, Canada, 1998.
Hovy, E. and Marcu, D. (1998b) Multilingual Text Summarization, Tutorial in AMTA-98, 1998.
Hurst,M.(1999)“Layoutand Language:A CorpusofDocumentsContaining Tables,” Proceedings of AAAI Fall Symposium, 1999.
Hurst,M.and Douglas,S.(1997)“Layoutand Language:Preliminary Experimentsin Assigning LogicalStructureto TableCells,”ProceedingsofANLP ‘97,pp. 217-220, 1997.
Lin, C.J. and Chen, H.H., “Description of NTU System at TREC-9 QA Track,”
Proceedings of The Ninth Text REtrieval Conference (TREC-9), 2000, pp. 389-406.
Liu, C.C. (2001) Video OCR and Video Search, Master Thesis, National Taiwan University, 2001.
Mani,I.and Bloedorn,E.(1997)“Multi-document Summarization by Graph Search and Matching,” Proceedings of 4th National Conference on Artificial Intelligence, Providence, pp. 622-628.
Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Girju, R., Goodrum, R., Rus, V.
(2000)“The Structureand Performance of an Open-Domain Question Answering System,”Proceedings of 38thACL, pp. 563-570, October 2000.
Ng, H.T.; Lim, C.Y. and Koo, J.L.T. (1999) “Learning to Recognize Tables in Free Text,”Proceedingsof37thACL, pp. 443-450, 1999.
Quinlan, J.R. (1993) C4.5: Programs for Machine Learning, Morgan Kauffman, 1993.
Deepak Ravichandran and Eduard Hovy (2002), “Learning Surface TextPatternsfora Question Answering System,”the Proceedings of ACL, 2002.
Radev,D.R.and McKeown,K.R.(1998)“Generating NaturalLanguage Summariesfrom Multiple On-LineSources,”Computational Linguistics, Vol. 24, No. 3, pp. 469-500, 1998.
Sato,T.,Kanade,T.,Hughes,E.K.,Smith,M.A.,and Satoh,S.(1999)“Video OCR:
Indexing Digital News Libraries by Recognition of Superimposed Captions,” Multimedia Systems, Vol. 7, pp. 385-394, 1999.
Singhal,A.,Abney,S.,Bacchiani,M.,Collins,M.,Hindle,D.,Pereira,F.(1999)“AT&T at TREC-8,”Proceedings of TREC 8, Gaithersburg, pp. 317-330, November 1999.
M. M. Soubbotin (2001), “PatternsofPotentialAnswerExpressions as Clues to the Right Answers,”the Proceedings of TREC 2001, pp. 293-302, 2001.
Ellen Voorhees (2000) “QA Track Overview (TREC) 9,”[on-line] Available:
http://trec.nist.gov/ presentations/TREC9/qa/index.htm
Ellen Voorhees (2001) “Overview of the TREC 2001 Question Answering Track,”the Proceedings of TREC-10, pp. 42-51, 2001.
Ellen Voorhees (2002) “Overview of the TREC 2002 Question Answering Track,”
Proceedings of the Eleventh Text Retrieval Conference, Gaithersburg, Maryland, November 19-22, 2002.