國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
The input to the model is !i , and the output could be !i 2, !i 1, !i+1, !i+2 . So the task here is "predicting the context given a word". Also, the context is not limited to its immediate context, training instances can be created by skipping a constant number of words in its context.
2.3 Sentiment Analysis
Sentiment analysis, also known as opinion mining, refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. It is widely applied to reviews and social media for a variety of applications, raging from marketing to customer service.
Bo and Lillian [12] make a summary and an introduction on current trends of opinion mining and sentiment analysis. Also, it covers techniques and approaches that promise to directly enable opinion-oriented information seeking systems. They introduce the new challenges raised by sentiment-aware applications, too. Liu and Bing [19] provides some important research and survey on Sentiment analysis. Also, it introduce some structured method the perform on unstructured data(opinions). Mukherjee and Bhattacharyya [21]
advocates an approach to identify feature specific expressions of opinion in product re-views. The system learns the set of significant relations to be used by dependency parsing and a threshold parameter to merge the opinion expressions. Kouloumpis and Efthymios et al. [18] investigate the utility of linguistic features for detecting the sentiment of Twitter messages. the messages in the micro-blog and its tags are captured and analysed by supervised approach. Passonneau and Rebecca [22] use POS-specific prior polarity features and use a tree kernel to obviate the need for tedious feature engineering.
3 Methodology
The methodology of AppCAT described below, the architecture overview is shown in Figure 1 and source codes of AppCAT are available on author’s Github [4].
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 1: AppCAT architecture
The first part of AppCAT architecture is review preprocess including download, trans-lation, tokenize, filter stop words, stemming and construct dictionary. Second part is fetch product feature including define initial product feature, and form product feature set. Third part is sentiment analysis on product feature including finding review subject, analyze sentiment. The final part is calculating the sentiment of each product feature to this app, provides a bar chart for visualization.
3.1 Review preprocess
Here we introduce how to download the reviews of apps and do preprocess including translation, filter stop words, tokenize, stemming and dictionary construction.
‧
App reviews data are collected from App Store api [6], the content of api includes app track id and list of reviews. In each review contains a user id and the user’s rating to this app(from 1 star to 5 star), the review title and the content of the review. An example of review is below.
After download is complete, AppCAT then performs natural language preprocess.
Described next section.
3.1.2 Natural Language Processing
Reviews in App Store often presented in various languages like English, Japanese, Spanish or French, we need to translate all the reviews to one single language to facilitate the word tokenizing, stemming and product feature extraction afterwards. Here we choose English as our translation target because most comments are originally from English. Google translate api were applied here to do the trick. After all the descriptions are translated to English, word tokenizing is applied to breaking sentences into words and elements. For each token in the list, we use NLTK in python to do word stemming, reducing inflected words to it word stem. From example, "argue", "argued", "arguing", and "argus" reduce to the stem "argu". After stemming process, for each token, we filter the stop words.
Stop words are words commonly appear in documents like "the", "is", "at", "which", and need to be filtered out before we start processing. After process progress completed, reviews are stored into mongodb, a no-sql database, for further process and analysis.
3.1.3 Filter and Add Frequent Words
AppCAT maintains a dictionary to filter tokens. tokens that is not in dictionary would be filtered out. The default dictionary is English words. Some words appear frequently in reviews are not English, like facebook, UI, wma, etc. If the word is frequently appear in reviews but not in English, AppCAT mark it as frequent words and store it into dictionary.
After that, tokens that are not in dictionary are filtered out.
‧
Table 2: Examples of frequent words that is not in English dictionary Frequent Words
Like all products and services, applications are also products that have various product feature. For example, price, energy consumption, efficiency, user interface, etc. User may leave comments to those product features. Currently, AppCAT defines some product features that user often leave their comments and care about. And by using Word2vec that transform keywords to vector and find similar keywords to initial product features.
Finally we can form product feature sets.
3.2.1 Define Initial Product Feature Keywords
According to a recent survey [11], what mobile application user really care about is the privacy and application performance. So currently APPCAT take these into product feature concerns. According to AppDynamic reports [1], 86% of users have uninstalled apps after only using them once due to the poor performance of those apps. A test conduct by JWPlayer [7] suggests that 70% of app users will leave within 11 seconds of waiting for content (like waiting internet connection within an app) to load. According to above survey and research, AppCAT define some initial keywords as those domain specific features, by use some keywords like "performance", "ui","memory". As a result,