{eser,rajase,rsriram,vaithyan,huaiyu}@us.ibm.com ABSTRACT
Existing search tools provide little support for information interaction activities beyond simple keyword-based source retrieval. In order to effectively support the full lifecycle of information interaction activities, tools need to support users in extracting, retrieving, manipulating, presenting, collaborating, and using information in their day-to-day activities. In this paper, we present our efforts in expanding the simple search interaction paradigm beyond source retrieval to collaborative information extraction and argue that we need to move towards an information interaction language, one that is simple and informal, yet supports users in the full lifecycle of information activities.
Author Keywords
Information Interaction, Search, Collaboration, Information Extraction.
INTRODUCTION
Searching is the most basic way of accessing information on the web. Nielson study points out that 88% of the web interactions start with a search [4]. Advances in search technologies transformed the way we think about information seeking. Advanced search algorithms, large scale systems, and simple search interaction paradigm played significant roles to bring quality, performance, and usability into the overall information interaction experience.
Over the years researchers developed various models of user-information interaction models that go beyond source-retrieval tasks [3,1,6,5]. For example, Marchionini’s model also covers information extraction, analysis, and use [3]. In this paper we consider information interaction activities that begins with source retrieval and spans various activities including information extraction, manipulation, presentation, and integration/use. Source retrieval refers to finding relevant source documents in response to a user query. Information extraction, in our context, refers to examining and extracting valuable and relevant pieces of information from these documents, perhaps building a knowledge base of extracted information. Once extracted, information can be further manipulated to satisfy more complex queries that relate information in the knowledge base, perhaps aggregating information extracted from
multiple source documents/queries. Bates’ model also suggests that information needs are likely to be fulfilled not by one query/document but by a set of queries and documents [1]. Optionally, users may want to present information visually (or in other forms) to develop further insight and/or share with others. Next, information is utilized in some form in work activities, perhaps integrated to business processes (Figure 1.) Finally, search is typically an iterative process where the results of a query lead to further queries. As Shneiderman et al. suggest queries are likely to be refined and reformulated to satisfy information needs [6].
Figure 1. Information Lifecycle Activities.
Without a question, search tools have been very successful in answering user information needs for keyword-based source retrieval. The simple keyword textbox and search button interface significantly lowered barriers to entry allowing millions of users search and retrieve documents of interest. Without the complexity of formal query languages, logic expressions, users simply enter search terms into a textbox and find relevant information. We believe these search terms are in fact a simple and informal information interaction language. It is simple in that users do not have to specify any predicates indicating any kind of relationship among terms; it is informal as it does not necessarily
conform to formal language syntax.
While in current tools information interaction language is limited to source-retrieval tasks we strongly believe it has to expand to cover the full lifecycle of information interaction lifecycle, capable of allowing users to express how information should be extracted, manipulated, presented, and integrated with processes. A 2002 study by Spink et al. points out that over the years the query formulation skills of the users have not improved significantly beyond typical queries containing just a few terms [7]. This is further evidence that cautions us that whatever information interaction languages the community may develop have to be simple and informal, yet able to satisfy needs in the full lifecycle of information interaction.
Below, we describe AVATAR, which demonstrates our efforts in expanding the simple search interaction language beyond source retrieval to collaborative information extraction and search.
AVATAR: COLLABORATIVE INFORMATION EXTRACTION AND SEARCH
AVATAR is an information extraction and search engine that exploits annotators to automatically tag information in documents. Annotations are performed by using high-precision information extraction techniques to extract facts (e.g. date, time, phone number), concepts (e.g. person, organization), and relationships (e.g. person’s phone number) from text [2]. These facts, concepts, and relationships are represented and indexed in a structured data store such that queries can run efficiently.
In AVATAR, search queries are interpreted in the context of extracted information and converted into one or more precise queries over the structured store. For example, a query “steve phone” can utilize results from several annotators such as Person, PhoneNumber, PersonPhone, and PersonHomePage and yield multiple interpretations, including web pages that contain Steve’s phone number, pages that mention Steve and a phone number, and pages authored by Steve where he mentions a phone number (Figure 2.) These interpretations in effect begin a dialogue between the system and the user to satisfy user’s information needs.
AVATAR supports end-user development of annotators to facilitate collaborative information extraction (Figure 3).
Our approach enables end-users to define annotator patterns for information extraction, just like they would do search, leveraging the same simple interaction paradigm. This is achieved by allowing users to simply type in patterns to detect facts, concepts, and relationships into a textbox (Figure 4.)
Figure 2. User query terms (e.g. “steve phone”) are interpreted in the context of annotators (e.g. Person, PhoneNumber, PersonPhone, PersonHomePage), returning precise query results with various possible interpretations.
It is important to note that supporting collaborative end-user development of annotators is a must in information extraction for several reasons. First, not all users might know all the patterns to detect a piece of information. Only through collaboratively building on each other’s patterns it is possible to achieve high recall rates. Second, a lot of the concepts are context and location sensitive. For example, phone number patterns vary significantly across different parts of the world. It is impractical to expect any system to have complete coverage. Finally, sufficiently large numbers of annotators are necessary to satisfy any realistic information needs. A work of this scale is only achievable through collaboration within communities.
In AVATAR, to define an new annotator, users can either start from scratch or build on existing annotators. For example, to define an annotator to tag cell phone numbers, one can use the built-in phone number annotator and contextual text around it, by the use of simple patterns such as, cell:PhoneNumber, cell atPhoneNumber, PhoneNumber(cell) , PhoneNumber(mobile) (Figure 4.)
Figure 3. AVATAR annotators can automatically tag content using high-precision information extraction techniques.
Figure 4. Users can collaboratively develop new annotators (e.g. CellPhoneNumber) building on existing annotators (e.g.
PhoneNumber) through simple search like interface by defining patterns.
Annotator patterns can be of varying levels of complexity.
Expert users can use regular expressions to detect concepts purely based on the format of the information. Novice users on the other hand can utilize existing annotators and AVATAR’s natural language relaxation heuristics to define precise annotators. Typically, these patterns can build on a single annotator along with contextual text to specialize a concept, such as phone number vs. cell phone number.
Other patterns can build on two annotators to define a relationship between two concepts, such as a person’s phone number using Person and PhoneNumber annotators. In this case, the user would define the relationship often by specifying text that would occur between the concepts, such as Person can be reached at PhoneNumber. Users can also use dictionaries to define concepts. In this case, the dictionary would simply contain a collection of words associated with the defined concept. For example, to define an annotator to tag U.S. States, the user would build a dictionary of state names, such as “AZ”, “CA”, “NY”, etc. and upload the dictionary to the system.
In order to support collaborative development, AVATAR also provides a facility to import and export annotator definitions. This way, users can share their annotators with their friends and colleagues, who can further build on these.
We expect three groups of AVATAR users: 1) Expert users who are technically savvy to build complex annotators from scratch, 2) Knowledgeable users, who are less technically savvy but can build on existing annotators, and 3) Novice users who never develop their annotators but can utilize shared annotators to improve the quality of their search. We believe by leveraging the simple search-like interaction
model for annotator development we support users of variety of skills to perform effective information extraction.
A Use Case: Finding Product Prices
Let’s go through a use case scenario to demonstrate the collaborative information extraction with AVATAR. In this use case several users each with different skills and backgrounds will build annotators leveraging each other’s annotators.
Consider Aaron who works in the finance department, developing custom applications. As a programmer he is comfortable with regular expressions and often uses them in his applications. Aaron does his searches using AVATAR, and has developed a regular expression based annotator to detect numbers, called Number, with the regular expression
\d+(\.\d{0,2})?. Using the Number annotator Aaron keeps track of various figures in the budget, which are typically distributed over many spreadsheets, documents, and internal web pages.
Michael is a product manager and uses AVATAR from time to time to search the web to check various blog sites to see what consumers are saying about his products. To do this more effectively he has developed a simple dictionary based annotator, Product, which detects product names such as Garmin Streetpilot, Nuvi, Forerunner, etc.
Mark on the other hand works in the human resource department and is interested in purchasing a GPS for his niece. While he doesn’t have any software development skills he is a savvy web user. Using AVATAR, he built two simple annotators to help him find a good price for the product. One annotator is simply built on Aaron’s Number annotator to identify prices, called Price using the pattern
\$ Number to detect dollar signs preceding numbers. Another one, called ProductPrice, uses the Product and Price annotators to detect product prices using the following patterns:
1. Product costs Price 2. Product is available at Price 3. Product: Price
4. Product Price: Price
Using the ProductPrice annotator Mark simply does a web search to find retailers and their prices for various products he is interested in, all in the search results without even going to the retailer web site.
CONCLUSION
In order to effectively support the full lifecycle of information interaction activities, tools need to support users in extracting, retrieving, manipulating, presenting, collaborating, and using information in their day-to-day activities. Towards this end, we present AVATAR, an information extraction and search engine that exploits the familiar search interaction paradigm/language to also perform information extraction using annotators to tag content. AVATAR aims to leverage communities of
patterns to extraction valuable information by building on each others work.
We believe that research on information interaction languages is a fruitful area, one that the HCI research community can contribute significantly. In our experience with AVATAR, the design of such an information interaction language necessitates going beyond usability and also addressing expressibility and scalability concerns.
From this perspective, we found collaboration among HCI and Database researchers particularly rewarding.
REFERENCES
1. Bates, M. (1989). The Design of Browsing and Berrypicking Techniques for the Online Search Interface. Online Review. Vol. 13, No. 5, pp. 407–424.
2. Kandogan, E., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., and Zhu, H. 2006. Avatar semantic search: a database approach to information retrieval. In Proc. ACM SIGMOD International Conference on Management of Data (Chicago, IL, USA, June 27 - 29,
2006). SIGMOD '06. pp. 790-792.
3. Marchionini, Gary N. (1995). Information seeking in electronic environments. Cambridge, Eng.: Cambridge University Press.
4. Nielsen, J. (2004) When Search Engines Become Answer Engines. www.useit.com/alertbox/20040816.html 5. Pirolli, P. & Card, S.K. (1999). Information Foraging.
Psychological Review. APA, Vol. 106, No. 4, pp. 643–
675.
6. Shneiderman, B., Byrd, D., & Croft, B. (1998). Sorting Out Search – A User-Interface Framework for Text Searches. Communication of the ACM. ACM Press, Vol. 41, No. 4, pp. 95–98.
7. Spink, A., Jansen, B., Wolfram, D., and Saracevic, T.
(2002). From E-Sex to E-Commerce: Web Search Changes. IEEE Computer, Vol. 35, No. 3. IEEE Computer Society, 107—109.
8. White, R. W., Kules, B., Durcker, S., Schraefel, M.
(2006). Communications of the ACM, Vol. 49. No 4.