Introduction - 運用短文件分類技術改良微網誌政府服務之研究

Because of the prevalence of Web2.0 techniques, the World Wide Web(WWW) has become a huge information source. In contrast to the traditional Web, where users passively read content created by website owners, Web2.0 enables individuals to serve as information providers and express opinions about different issues. The Web has thus become a knowledge base for various applications (Craven et al. 2000), and governments have found that they can exploit it to facilitate communications with the public about government services (Wigand 2010). In the past, there was a lack of efficient communication channels, but various Web2.0 applications, such as micro-blogs and social media, can now provide instant and efficient communications between governments and citizens. The techniques have also motivated the development of the Government2.0 paradigm, which refers to the use of Web2.0 technologies to publicize and commoditize government services (Nam 2011). Specifically, Government2.0 aims to establish direct and instant communication channels between governments and citizens through online comments, live chats, and message threads. Depending on the characteristics of specific services, governments can choose the technique that is most convenient people to use (Nam 2011).

Among current Web2.0 techniques, micro-blogging is the most popular application.

Micro-blogging is a new weblog formalism that enables users to exchange information through short messages (Java et al. 2007). In Nielsen’s 2009 survey on the size of the Internet community¹, Twitter² achieved the largest community growth rate, and it is now the most

1 http://blog.nielsen.com/nielsenwire/online_mobile/twitters-tweet-smell-of-success/

2 http://twitter.com

famous micro-blogging platform. As shown in Table 1-1, Twitter’s growth rate between 2008 and 2009 was1382%, which was far greater than that of other Internet communities.

Table 1-1The most popular social network communities growth rate in 08-09.

Rank Site Feb 08 Feb 09 % growth 1 Twitter.com 475,000 7,038,000 1382%

2 Zimbio 809,000 2,752,000 240%

3 Facebook 20,043,000 65,704,000 228%

4 Multiply 821,000 2,394,000 192%

5 Wikia 1,381,000 3,758,000 172%

The main reason for Twitter’s popularity is its availability (Sankaranarayanan et al.

2009). Since mobile computation is ubiquitous, users can send tweets (i.e., short messages) anytime anywhere through various mobile devices (e.g., cell phones). In addition, the restricted length of tweets (i.e., no more than 140 characters) forces users to express opinions in a clear and concise manner. Twitter’s friendly interface also motivates users to form virtual communities. Because the communities are quite different from people’s normal social circles in the real world, users generally acquire different information from Twitter. This subculture and various properties explain the huge success of Twitter (Sankaranarayanan et al. 2009).

Java et al. (2007) investigated the topological and geographical properties of Twitter, and found that users rely heavily on tweets to communicate their daily activities and share the latest information.

Many governments are aware of the merits of tweeting and now utilize it in various government services (Wigand 2010). The increasing usage of Twitter appears to be associated with the fact that it is free and it provides instant communications (Wigand 2010). For instance, New York City extended its 311 system with Twitter³ so that citizens can submit questions about the city’s facilities or request help from a facility. City officials provide

3 https://twitter.com/#!/311NYC

instant feedback and address users’ requests in a timely manner. NASA also utilized Twitter to publicize the MarsPhoenix landing project⁴. The service attracted nearly 160,000 followers who helped NASA share information about the progress of the mission.

Although there are many advantages in using Twitter to communicate about government services, there is a risk of information overload. As the membership of Twitter is growing exponentially, the number of daily tweets about government services may be too large for officials to handle manually. Thus, there is an urgent need for text mining techniques to manage the large volume of tweets. Text classification is a classic text mining method for managing a vast number of documents (Manning et al. 2008). Given a set of categories and the corresponding training documents, a classification algorithm learns a classifier, which then assigns new documents to the appropriate categories automatically. Many effective classification algorithms have been proposed for classifying documents, but they may not be suitable for tweet classification because tweets are short (Phan et al. 2008; Chen et al. 2011).

Hence, the text sparseness of tweets would reduce the classification accuracy.

In this paper, we modify the Naive Bayes classification model (Manning et al. 2008) to classify the daily tweets of government services. To resolve the text sparseness problem, we extend training tweets with Wikipedia⁵, the biggest online encyclopedia. In addition, we model the temporal patterns of tweets to refine the prior probability of our naive Bayes classifier. Besides, we also apply two kinds of different smoothing techniques to leverage the temporal distribution. In addition to using Wikipedia as the external knowledge base, we take user contents (e.g., the tweets posted by users) which are not directly reported to 311NYC into consideration. Finally, we apply wordNet⁶, which is a large lexical database of English, to reformulate the conditional probabilities of Naïve Bayes model. Evaluations based on the NYC 311 Twitter service show that the proposed approach classifies daily tweets accurately.

4 https://twitter.com/#!/MarsPhoenix

5 http://www.wikipedia.org/

6 http://wordnet.princeton.edu/

It also alleviates the text sparseness problem.

The remainder of the paper is organized as follows. In Section 2, we discuss the concept of Government2.0 and review related works on short text classification methods. The proposed classification approach is presented in Section 3, and its performance is evaluated in Section 4. Then, in Section 5, we present our conclusions.

在文檔中運用短文件分類技術改良微網誌政府服務之研究 (頁 10-14)