Bookle ：書籍推薦系統 - 基於讀者的評論與閱讀行為

(1)

國立交通大學

資訊學院資訊科技（IT）產業研發碩士專班

碩士論文

Bookle

：書籍推薦系統

基於讀者的評論與閱讀行為

Bookle : A Book Recommender System

Based on Readers’ Reviews and Reading Behaviors

研究生：卓建益

指導教授：黃俊龍　教授

(2)

Bookle

：書籍推薦系統

基於讀者的評論與閱讀行為

Bookle : A Book Recommender System

Based on Readers’ Reviews and Reading Behaviors

研究生：卓建益

Student: Chien-I Tso

指導教授：黃俊龍

Advisor: Jiun-Long Huang

國立交通大學

資訊學院資訊科技（IT）產業研發碩士專班

碩士論文

A Thesis

Submitted to College of Computer Science

National Chiao-Tung University

in Partial Fulﬁllment of the Requirements

for the Degree of

Master

in

Industrial Technology R & D Master Program on

Computer Science and Engineering

August 2010

Hsinchu, Taiwan, Republic of China

(3)

Bookle

：書籍推薦系統

基於讀者的評論與閱讀行為

學生：卓建益

指導教授：黃俊龍

國立交通大學資訊學院資訊科技（IT）產業研發碩士專班

摘

要

推薦系統就好比是一種獨特且專門的資訊搜尋引擎，在本篇論文中，我們嚐試從新的角度來開發書籍推薦系統，傳統的書籍推薦系統，大多透過記錄過去使用者的購買行為來做推薦，或是要求使用者以填充特定資料的方式來過濾搜尋結果，而我們的做法是藉由使用者描述書籍內容的方式做為搜尋依據，企圖找出更能貼近讀者需求的書籍。在實作上，我們收集眾多的書籍評論作為基礎資料，藉由資料探勘與資訊擷取技術，找出評論中的關鍵資訊，統整這些資訊後，進而推測出各書籍的主要內容，更進一步的，為了評斷每篇評論的可信度與價值，系統利用閱讀行為衡量評論品質，讓各篇評論獲得應有的價值，讓推薦的結果更具說服力。

(4)

Bookle : A Book Recommender System

Based on Readers’ Reviews and Reading Behaviors

Student: Chien-I Tso

Advisor: Jiun-Long Huang

Industrial Technology R & D Master Program of

Computer Science College

National Chiao-Tung University

ABSTRACT

Recommender system is like a search engine for speciﬁc and specialized information. In

this paper, we attempt to develop a book recommender system from a new perspective. Most of the traditional book recommender systems are developed through the records of

purchasing behaviors of users, or speciﬁc words user are required to key in for ﬁltering search results. However, what our system needs is the description about the book content

that users are interested in. The purpose of our work is attempting to ﬁnd the best books that meet the needs of readers.

In the implementation of the project, we collected a large amount of book reviews as

source data. With data mining and information retrieval technology, we planed to ﬁnd the critical information out from reviews and integrated the information, and then the main

contents of books can be speculated. Furthermore, in order to determine the credibility and value of each review, the system would measure review quality by reading behaviors.

So that the reviews would be correctly scored which makes the result of recommendation become more convincing.

(5)

誌

謝

隨著九月的到來，我在交大碩班的日子漸告終結，在這段日子裡，我認識了很多新朋友，每個人都是我生命中的貴人。感謝我的指導老師黃俊龍老師引領我進入Web API的世界，讓我學習到很多新觀點與技術，開拓了我的視野，感謝在友訊研發中心認識了工作夥伴必安跟昱樺，我們在一起工作打混的有趣日子讓人難以忘懷，感謝研究室裡的好夥伴金亮、濬仲、信翰，跟你們在同一個實驗室裡學習成長，讓我感覺到我很幸運，你們每位都是好ㄉㄉ，我從你們身上學習了很多，感謝辛苦幫我修改論文的士詮學長，抱歉我的英文害您一個頭兩個大，還有實驗室裡的每位學長與學弟們，有大家的努力讓實驗室越來越茁壯！感謝產碩專班的好同學坤彥、雯謹、俊樺、家茵，因您們的存在，產碩專班的感情一天比一天更好，感謝產碩專班的助理雅珽，有您的提醒讓我們這群時常忘東忘西的傻學生都能夠順利畢業，感謝未來公司裡的Tom、Eason、東哥、David、Danny、Rola，在暑假實習的日子裡，承蒙您們的照顧，在此也要特別感謝Eason對本篇論文的支援，更得要感謝曾建超老師、胡誌麟老師、林順傑老師，在口試過程中給予我建議與指導，讓我的論文內容更臻完美，最後最後，我必須要感謝默默在我背後支持我的家人，有您們的支持讓我無後顧之憂，您們是最棒的家人，還有已經陪伴我六年多的女友思佩，雖然你修改我論文時很愛嫌東嫌西，但我知道這就是我們的生活樂趣，要感謝的人真得很多，或許有些人我已忘卻了名字，但是我依舊會記得您的臉龐，真的！有您們真好，謝謝。

(6)

目

錄

書名頁 . . . i 中文摘要 . . . ii 英文摘要 . . . iii 誌謝 . . . iv 目錄 . . . v 表目錄 . . . vii 圖目錄 . . . viii 1. Introduction . . . . 1 2. Related Work . . . . 4

2.1 Automatic Keyword Extraction . . . 7

2.1.1 Keyword Extraction Tool . . . 7

3. Research Topic and Methodology . . . 9

3.1 System Conception . . . 9

3.2 Client Information Collection . . . 10

3.3 Server Operation Scenario . . . 12

3.3.1 Key Extraction from the Review . . . 12

3.3.2 Keywords’ Weight Calculation . . . 12

3.3.3 Discovering the related Books according to the Keywords . . . 13

3.3.4 Presentation of Bookle Service . . . 14

(7)

4.1 The Back-end of Server . . . 15

4.1.1 The Workﬂow of Back-end . . . 16

4.2 The Front-end of Server . . . 21

4.2.1 The Workﬂow of Recommender . . . 22

5. Implementation and Experiment . . . . 23

5.1 Client Storage . . . 23 5.2 Communication . . . 23 5.3 Results Presentation . . . 24 5.4 Experiment . . . 26 5.4.1 Data Source . . . 26 5.4.2 Experiment Results . . . 28 6. Conclusion . . . . 29 Bibliography . . . 30

(8)

表

目

錄

(9)

圖

目

錄

1.1 The process structure of Bookle recommender system. . . 3

2.1 E-book readers . . . 6

2.2 The keyword extraction process . . . 7

3.1 Bookle system overview . . . 10

3.2 The keywords of a review of “The Lost Symbol”. . . 12

3.3 The weight of the keywords in the book “The Lost Symbol”. . . 13

3.4 The books related to the keyword “mystery”. . . 13

3.5 The results of the description “The mystery and suspense thriller”. . . 14

4.1 The Architecture of Server Back-end . . . 16

4.2 Keywords Extraction . . . 16

4.3 Review Quality Measurement . . . 19

4.4 The Weight for Keywords . . . 20

4.5 The Books Related to Keywords . . . 20

4.6 The Architecture of Server Front-end . . . 21

5.1 The communication of client-server . . . 24

5.2 Bookle Website Demo . . . 25

5.3 The Snapshot of Experiment Page . . . 26

(10)

Chapter 1 Introduction

A recommender system is like a specialized and speciﬁc information search engine,

which tries to show the items that users might be interested in with users’ and items’ pro-ﬁles. Generally speaking, most recommender systems are based on one of the following

methods: content-based filtering, collaborative filtering, and hybrid method that com-bines the above two techniques. Content-based filtering recommender systems [1] make

suggestions by selecting the features of the product information about consumers’ prefer-ences. Such systems always need users to check a variety of options about the products.

Collaborative ﬁltering recommender systems [2] make suggestions to customers by analyz-ing the previous users’ collective experience about the products [3]. Such systems believe

that the users who accept the products before, will also accept the similar products in the future. The hybrid recommender systems [4] combine the characteristics of the above two

methods. Such systems use the past users’ experience at the features of the products as data. After calculating and integrating the data, they make suggestions to the customers.

In this paper, we attempt to build a communication-style book recommender system

“Bookle”. It works just like you ask to staﬀ what you want in the bookstore. Users could use Bookle to obtain the suggested books by describing the book content, instead

of searching titles or abstract keywords. Our system is an informed recommender, which collected a lot of customer book reviews as data source. In addition, Our system is like a

book content-orientation search engine, making suggestions from the book content that users describe. Users could use Bookle service for book recommendation as using Google

(11)

The main method of Bookle is similar to Collaborative Filtering. This approach requires a large number of reader ratings. Collaborative ﬁltering systems use a collection

of historical ratings data of n users on m products as input, collected by asking users to rate products [5]. Collecting such rating data need the users to spend time responding,

and the values of ratings might not exactly provide reliable estimations of user preferences. Hence, the features of books are much diﬃcult to deﬁne. An alternated approach is needed

for Bookle to determine the quality of books. Hence, Bookle adds some concepts from Content-based ﬁltering [6], obtaining the content of books from reader reviews. To a

certain degree, the reviews could represent the summary of the book by readers. By the reviews, Bookle could speculate about the content of books and make the suggestions

based on the content information.

In our recommender system “Bookle”, we wish that the factors inﬂuence to the

result of recommendations do not only from the content of reviews, but also the factor

of reviewers’ level of expertise in books, or the degree of reviewers preference for books. With this information, the system can measure the qualities of reviews. In the past, the

common methods to measure article quality is to rate articles by users. For example, users can score the articles through the radio button or the star-rating section in a web

page. However, this scoring model is easily manipulated by man-made. The inadequate number of rating will make the score too high or too low. And there is a another unfair

case about articles rating. A commenter publishes his non-mainstream views seriously, but his articles can not obtain good evaluation. Because his opinions are diﬀerent from

the others. In fact, his articles still have the reference value. In conclusion, measuring the article quality by users is not a appropriate method.

After some discussion, we decide to use reading behaviors to determine the review quality. The trend of mobile reading is rising. The system can collect the information of

reading behaviors from handhold devices. The qualities of diﬀerent reviews are diﬀerent. Someone wrote reviews for fun, but someone wrote reviews seriously. There is an approach

to determine the qualities of product reviews by consumers level of expertise in the product [7]. This approach needs to record the proﬁles of customers for ranking their level of

experience. Here we do not want to record any users’ personal proﬁle in this system, i.e., users do not need to create an account for sharing their reviews in Bookle. Everyone

(12)

is anonymous in Bookle’s eye, and Bookle rates their review qualities by analyzing the reading behaviors.

Bookle collects the data source from users, and the part of the data including book in-formation, reading behaviors, and personal review. After the server-side data integration,

the results of “Text-mining process“ will be associated with the keyword list of books, and no other insigniﬁcant book information to be generated. After that the “Book-ranking

process” sorts the related books for the keywords. Finally, users could use the service from the “Recommender process”. Because the recommender is based on ﬁltering data from

reviews, instead of the keyword search from book name or abstract (e.g., Google book search), Bookle could recommend the books which are much close to the users desired.

Figure 1.1: The process structure of Bookle recommender system.

In the following chapters, we will describe some related work and useful techniques

about our research in chapter 2. Then we explain our idea and development scenarios in chapter 3. In chapter 4, we introduce the system architecture and explain the all

compo-nents in the system workﬂow step by step. The implementation results and experiment are shown in chapter 5. Finally, we make a conclusion and mention some future works

(13)

Chapter 2 Related Work

It is diﬃcult to deﬁne whether a book is worth to be recommended or not. Readers

often do not know whether the book is recommended to their friends or not until they ﬁnish this book. Books do not like the normal products that have standard speciﬁcations (e.g.,

Consumer Electronics Products), they do not have the speciﬁc features to determine their quality before reading them. The common book recommender systems make suggestions

by purchase experience. When people want to buy a book at a shopping website, they are interested in a section, “Customers Who Bought This Item Also Bought”. But we

do not know if those recommended books ﬁt your need. The quality of the book content is hard to determine, and it is subjective by the diﬀerent users. Users often discuss the

quality of a book by sharing their reviews. Many readers also mention the contents of books when they share their feelings in their reviews. For the book content-orientation

recommendation, we think that the book reviews are the useful information to predict the content of books.

When we think how to ﬁnd a data source of the book reviews that is valid, another

problem occurred in our mind: “Are the qualities of each review all equal?” [8] Some people wrote reviews about their feelings concisely, by using only two or three sentences.

On the contrary, some people wrote their reviews detail, including the contents of books, their feelings, and so on. In intuition, the latter review has better quality. But, the

re-view with more sentences possess the better quality? Some malice users would post some insigniﬁcant information to website message boards. Such information is most

(14)

the algorithm method. Most internet companies use artiﬁcial methods to do inspections for the spam. Hence, determine the review quality from the length of review content is

not a good idea.

An informed recommender system based on consumer product reviews [7]. This

informed recommender uses prioritized consumer product reviews to make recommen-dations. By using text-mining techniques, it maps each piece of each review comment

automatically into an ontology. The ontology consists with two parts. One part is “Prod-uct quality” that represents the opinion of the provider’s valuation of the prod“Prod-uct features.

The other part is “Opinion Quality” that includes several variables to measure the opinion provider’s expertise with the product. Hence, this approach also considers the quality of

reviews by “Opinion Quality”. However, the ontology of this approach is used for the products that have speciﬁc features, which is not suitable for books. To determine the

opinion quality, the ontology needs additional space to record the user’s proﬁle and

ex-perience. In our work, Bookle do not store any user’s personal proﬁle or skill. Because Bookle want to determine the review quality by user reading behaviors, and who be the

writer that is not necessary.

Some research investigate the relationship between reading behaviors and users’

in-terests. A research considers this topic by recording user’s behaviors including under-line, highlight, circle, annotation and bookmark to capture user’s interest precisely in

E-learning system [9]. Another research considers this topic by two processes. First step is to measure that whether a article is interesting by users reading time period. And the

second is to discover the match pattern from the articles in which a user is interested. Finally, this system can judge what articles would be interesting to the user [10]. A

re-search attempts to ﬁnd the interesting books for children from web logs information which including reading time period, and reading progress [11]. Hence, Bookle attempts to use

reading behaviors to determine review quality. The loving degree for the books of readers would present in their reading behaviors. For example, the less time period the readers

spent for reading a book means the higher degree they like the book. However, how could we obtain the reading behaviors from readers? Fortunately, mobile reading with handhold

devices is a new trend. Users are unnecessary to bring many books everywhere for read-ing. They just need to download the electronic ﬁles of books to their handhold device.

(15)

(a) Amazon Kindle source: Amazon.com

(b) Sony PRS reader source: Sony.com

Figure 2.1: E-book readers

In many kinds of handhold devices, “E-book” device is the newer product and the most suitable device for reading. The characteristic of such device is the screen without

back-light, just like watching the paper. Depending on this device, an E-book may be readable in low light environment. Many newer E-book devices have the ability to display motion,

enlarge or change fonts, use Text-to-speech software to read the text aloud, search for key terms, ﬁnd deﬁnitions, or allow highlighting bookmarking and annotation. The devices

that utilize E-Ink can imitate the look and ease of readability of a printed work while consuming very little power, allowing continuous reading for weeks at a time. E-book is

not similar to the LCD panel devices (e.g., Netbook or iPad) that is used for multimedia, but E-book is more suitable for reading text content. The well-known E-book devices on

the market are Amazon Kindle, iRex iLiad, and Sony PRS series and so on.

With E-book devices, we attempt to make an universal device side library for record-ing readrecord-ing behaviors and implement this library in it at ﬁrst. In addition, we provide

the function that can write the book reviews in E-book devices. When users ﬁnish read-ing books, they can write the book review immediately and the reviews would be one

part of our data source. We could obtain the valid review data and reading behaviors together by using this approach [12]. Finally, we could determine the review quality by

(16)

2.1 Automatic Keyword Extraction

Automatic keyword extraction is a technique to identify meaningful and

represen-tative fragments, or words. Keyword is the minimal unit to express the topic of many documents. Most automation applications use keyword extraction technique in

unstruc-tured documents, such as data mining, automatic answer, automatic ﬁltering and so on. [13] In other words, keyword extraction is the basis and core technique for all

doc-umentation automations. Keyword extraction has been investigated in many diﬀerent academic ﬁelds. One kind of keyword extraction was be studies by the viewpoint of

linguistics. The research of linguistics from the perspective of linguistic analysis of the subject system and method of extraction, use of lexical knowledge, syntactic knowledge,

semantic knowledge, and chapter subject knowledge extraction of diﬀerent levels. Such research processes called “Information Extraction” and “Natural Language Processing”.

As Figure 2.2 shown, the data of readers’ reviews is converted into useful information (keywords) after “Keyword Extraction Process”.

Figure 2.2: The keyword extraction process

2.1.1 Keyword Extraction Tool

In our research, we also use keyword extraction as preliminary work for the data of reviews. We attempt many diﬀerent kinds of keyword extraction tools. We ﬁnd the most

(17)

• Keyword Analysis Tool1_{: This tool is a web-base service. Users use it by assigning}

a web page URL, necessary keywords (optional), and stop keywords (optional) in

its website. The results contain the keywords with only one word, the terms com-posed with multi-words named “keyphrases”, and their frequency of “keywords”

and “keyphrases”. Keyword Analysis Tool is really powerful in keyword extraction. However, this tool is unsuitable for our work. Because it can only input data by

assigning a URL, and without any open API for programmers. But our data of reviews was stored in database, and we do not intend to make web page for putting

them temporarily. Hence, Keyword Analysis Tool is not an appropriate tool for us.

• Yahoo! Term Extraction2_{: The Term Extraction Web Service provides a list of}

signiﬁcant words or phrases extracted from a larger content. It provides a open

API for programmers. By this open API, we can post the content of reviews to the

service, and obtain the results in XML or JSON format. The only weakness of this service is the results containing only keywords list without any other information

about keywords’ frequency or score. But it is still an useful tool for us to do keyword extraction, so we use this service in our work.

• Yahoo! CAS Web Service3_{: The CAS Web Service just like the Term Extraction}

Service, and it is more versatile. This service is developed by Taiwan Yahoo!. In addition to keyword extraction service, it also provide “Word Segmentation” service.

Speciﬁcally, the results of keyword extraction have the score for each keyword. The score can make us to realize the importance of the keyword in the review. Here

we do not use this service in our research at present stage, because this service was only used for Traditional-Chinese, and our research focuses on English review at

beginning.

1_{Keyword Analysis Tool: http://seokeywordanalysis.com/}

(18)

Chapter 3 Research Topic and Methodology

Our recommender system is a communication-style service. Users request the

rec-ommendation by describing the book content that they want, and our system returns a suggestion of books to users. Similar to the collaborative ﬁltering system, we need to

collect suﬃcient reviews as source data. By those data source, we could build a book recommender database though a series of calculation process. Finally, the source data

becomes the useful information for users.

3.1 System Conception

Bookle recommender system is a client-server base service. With the client-side

library, any handhold devices could be the clients. The main job of clients is that collected the readers reading behaviors, saved the book reviews, and published them to Bookle

server. Like a lot of on-line services, Bookle would be implemented as web-base service. The server side could be cut into two parts “front-end” and “back-end”. The back-end

of Bookle server is responsible for accepting the information from clients and integrating them by a series of calculations. The front-end of Bookle server is a website that provides

service interface for users. The users could describe the book content that they want and obtain the book recommendation list at this website. As Figure 3.1 shown, it expresses

(19)

Figure 3.1: Bookle system overview

3.2 Client Information Collection

Bookle attempts to measure the quality of the reviews by reading behaviors. To

protect readers’ privacy, Bookle do not want to collect the information of who issued the reviews. In fact, Bookle only needs the data about how they read the book. We focus

on the behaviors of reading, rather than the individual. For the quality of each review, we respect each reviewer’s comment on the content. Readers may choose “bad” as a

comment for the review, but the review is still valuable for our research.

To obtain the information of reading behaviors and reviews, Bookle needs to store them in the devices. We use the existing E-book functions to obtain useful data, such

as bookmarking, adding to favorite, noting focus, etc. For more convenient maintenance, Bookle use a light database system to store the information in devices. When users read

books, the users’ reading behaviors would be saved in this storage unconsciously. The information that Bookle chooses to save in devices is described in the following chapter.

1. Book Name

2. ISBN (International Standard Book Number): an unique numeric commercial book identiﬁer usually based upon the 10-digit or 14-digit numbers.

3. Date of Create at: The date when the book download to E-book device.

4. Date of Finish at: The date when the book read process arrives to 100%.

(20)

6. Note: The note (paintings focus with circle or line) numbers that reader created for the book.

7. isFavorite: Is the book added into “my favorite”?

8. Review: The review that reader wrote for the book.

The above data is stored in the client-side device. When users ﬁnish writing reviews

and publish them to the server-side machine, the above data would be transfered to the server together. At server side, we use these data of reading behaviors to determine the

review quality. We will explain why Bookle choose to store above information in chapter 4 detail, and illustrate how could we determine the review quality with these data in

(21)

3.3 Server Operation Scenario

The main tasks for Bookle server-side are information integration and service

presen-tation. For understanding our idea, we use the data of the book “The Lost Symbol” as an example to explain server operation. After collecting the book reviews and the reading

behaviors pair by pair from E-book, Bookle would integrate the data and present service as the following scenario:

3.3.1 Key Extraction from the Review

First, Bookle obtains the key information from reviews by keyword extraction tool. With those keywords, Bookle could infer the book content and the feelings which readers

read through the book. In Figure 3.2, there is an example that Bookle obtain the keywords from a review about book “The Lost Symbol”.

Figure 3.2: The keywords of a review of “The Lost Symbol”.

3.3.2 Keywords’ Weight Calculation

In next, Bookle calculates the weight of each keyword extracted from the reviews in

one book according to their occurrences and the review quality. The weight means the keyword importance to the book. When a keyword often appear in diﬀerent reviews, it

will have a higher weight. In other words, if a keyword with high weight that is expressed people often referred to it in their reviews. It has closer relationship for this keyword to the

(22)

from diﬀerent reviews. As Figure 3.3 shown, there are the weights of the keywords from all reviews about book “The Lost Symbol”.

Figure 3.3: The weight of the keywords in the book “The Lost Symbol”.

3.3.3 Discovering the related Books according to the Keywords

This step is in a reverse thinking. After Bookle obtain the weight of every keyword

in the books, Bookle start to obtain the related books for each keyword from the weight.

For each keyword, the top-K values of keyword weight are discovered from all books, and stored to the database. Although the keywords are the same, but they are related to K

diﬀerent books. Hence, Bookle can obtain the most relevant K books for every keyword. In Figure 3.4, it shows the related books about keyword “mystery”.

(23)

3.3.4 Presentation of Bookle Service

When users give a query to the service by describing the book content. Bookle parse

the description and obtain the keywords from the query. With these keywords, Bookle ﬁnd the books from the relevant books of keywords created at previous step. As Figure

3.5 shown, the books is related to description, “The mystery suspense thriller”.

(24)

Chapter 4 Server Framework and Algorithm

Most of the recommender systems would be conducted modularly for more

conve-nient maintenance and management, and Bookle is no exception. Bookle server could be generally divided into two parts. One part is the front-end for user service and

commu-nication with client, and the other part is the back-end for a series of calculation process. The following of this chapter will explain the Bookle server architecture and workﬂow in

detail.

4.1 The Back-end of Server

The main task of the back-end is to convert the review data into useful information.

The data which Bookle server collects from clients (E-book Users) needs to go through four main processing modules as Figure 4.1 shown.

• Text-mining: To obtain keywords from reviews by keyword extraction service.

(Ya-hoo! Term Extraction Service)

• Review-rating: To integrate the reading behaviors and to rate reviews quality. • Weight-calculating: To calculate the weight of each keyword related to books. • Book-ranking: To rank the related books for each keyword by keyword weight.

(25)

Figure 4.1: The Architecture of Server Back-end

4.1.1 The Workflow of Back-end

For converting the review data into useful information, the back-end of Bookle server runs the calculus processing as following workﬂow:

Step 1 : Extract keywords from each review. Bookle utilizes “Yahoo! Term Extraction

tool” (mentioned in 2.1.1) to extract the representative keywords from reviews and

store them into server database. If keywords is more than 20, Bookle records the ﬁrst 20 keywords as Figure 4.2 shown.

Figure 4.2: Keywords Extraction

Step 2-A : Calculate the average value of reading behaviors for each book. To rate

review quality, Bookle acquires three diﬀerent average values from reading behav-iors. The ﬁrst value is “Average Experience Period” that is a time period in days

as the unit. The “Experience Period” is the diﬀerence of Date of F inish at and

Date of Create at (mentioned in 3.2). The value of “Experience Period” does not

mean the actual reading time that user spent for a book. The value could be the best indicator for measuring how users love the book. When readers are very

inter-ested in a book, they would desire to go through whole book as soon as possible. The second value is the amount of “Average Bookmarks” of the book. Users could

(26)

set a bookmark at speciﬁc page in E-book for recording reading progress or just making a mark. The amount of bookmarks is also one of criteria for rating review

quality. The last one is the amount of “Average Notes” in a book. Users could make the section that they feel interested in with a circle or an underline in pages.

The amount of notes represents how seriously users read the book, and it is also a good reference for rating review quality.

ExperienceP eriod(EP ) = Date of F insih at− Date of Create at (4.1) If there are n reviews for a book, let Ri be the review i. Let EPi, BMi, and N otei

represent the Experience Period, Bookmark number, and Note number respectively. The formula for calculating the average value of reading behaviors can be expressed

as following:

• Average Experience Period

avg.EP =

∑n i=1EPi

n (4.2)

• Average Bookmark Number

avg.BM =

∑n

i=1BMi

n (4.3)

• Average Note Number

avg.N ote =

∑n

i=1N otei

n (4.4)

Step 2-B : Measure the quality for each review. The main assessment method bases on

the data about reading behaviors. There are four criteria for Bookle to measure the review quality.

1. Base Quality(BQ): The “Base Quality” is measured by the value about “Ex-perience Period”. This is the most obvious indicator to determine the degree

how readers love one book, and this value is regarded as the most inﬂuential variables of review quality. When a reader likes a book very much, the value of

Experience Period will be very small, even though he or she is in a busy life. If the Experience Period that someone spent on a book is ten times faster than

(27)

calculates the Base Quality by the ratio of average Experience Period and per-sonal Experience Period. The Base Quality value of a review is calculated as

Formula 4.5 shown.

BaseQuality(BQ) = avg.EP

P ersonalEP (4.5)

2. Bookmark Quality(BmQ): The value of “Bookmark Quality” is derived from

the amount of bookmarks. In the process of how to decide the value of Book-mark Quality, Bookle consider the relationship about the Experience Period

and bookmark numbers. For example, Kevin has a lot of leisure time to do things in which he interested. Hence, he could read his favorite books in a

short period of time. From his reading behaviors, the “Experience Period” is seven times faster than the average and he made few bookmarks. For another

example, Alice is a busy oﬃce worker, and she must make good use of spare time on her interests. Hence, it took her longer time period than Kevin to read

through the same book. From her reading behaviors, the “Experience Period” is three times faster than the average and she made many bookmarks. If Bookle

only consider the value of “Experience Period”, the fronter has better review quality. However, Alice made more eﬀorts on reading the book than Kevin, so

the later should have better review quality. It should be realized that Bookle need to provide the additional quality score from the data of bookmarks for the

later. Generally speaking, a review with more bookmarks is valued with higher Bookmark Quality in the case of the same Experience Period. To obtain the

Bookmark Quality, Bookle calculates the quotient of personal bookmark num-bers and average bookmark numnum-bers for relative comparison. The complete

value of “Bookmark Quality” is the calculation about “Experience Period” and “Bookmark Number” as Formula 4.6 shown.

BookmarkQuality(BmQ) = avg.EP P ersonalEP ∗

P ersonalBM

avg.BM (4.6)

3. Note Quality(NQ): The expression of Note is a circle or an underline on pages for making a key block. The readers who make notes while reading are usually

more diligent, but personal reading habits and types of books are also factors which should be taken into consideration. It generates more positive eﬀects

(28)

Bookle measures Note Quality by comparing personal note numbers with the average note numbers of a book. If personal note numbers are ten times more

than the average value, it would be scored full marks in this criteria.

N oteQuality(N Q) = P ersonalN ote

avg.N ote (4.7)

4. Favorite Quality(FQ): If readers put a book into virtual bookcase “My Fa-vorite”, the review quality would be rewarded extra bonus.

In conclusion, the total value of review quality is the sum of the above four criteria as Formula 4.8 shown. We can adjust the constant terms of the four criteria

dy-namically by situation. The complete measurement of review quality is expressed as Figure 4.3 shown.

ReviewQuality = αBQ + βBmQ + γN Q + δF Q (4.8)

Figure 4.3: Review Quality Measurement

Step 3 : Calculate keyword weight for each keyword in each book. Keywords are

ex-tracted from reviews of books at Step 1, and Bookle would assign a weight for every

keyword as Figure 4.4 shown. Bookle calculates the weight of each keyword by occurrences and quality of reviews which the keywords occurred. Same keywords

keywords might reoccur in diﬀerent reviews of one book, and the quality values of diﬀerent reviews combine into the weight of the keyword. In generally, the more

(29)

weight of a keyword by the sum of the review quality value from which the keyword is extracted. For keyword T of the book B, if keyword T reoccurs in n reviews Rn

and the review quality of Rn is Qn, the weight of keyword T for this book B is the

sum of Qn as Formula 4.9. W eight(T ) = n ∑ i=1 Qn (4.9)

Figure 4.4: The Weight for Keywords

Step 4 : For each keyword, rank the books that related to the keyword. At step 3, Bookle

obtains information about keyword weights of all books. At this step, Bookle utilizes the information to rank books for keywords. The same keywords might have relation

with different books, because readers have the same experience between the books. The same keywords in different books would be weighted in different value. Bookle

ranks books for each keyword by sorting the top-K weights. After this step, Bookle obtains information about the top-K books for each keyword as Figure 4.5 shown.

Figure 4.5: The Books Related to Keywords

Finally, Bookle can make the suggestion by the information from Step 4 when users

request for recommender service. The information about books related to keywords could help Bookle to speculate what books users want. This step is also the ﬁnal step of back-end

(30)

4.2 The Front-end of Server

The main function of the front-end is regarded as a user interface or programmer

interface. There are two main tasks in front-end, one is the website service [14], and the other is the restful API for the uniﬁed usage.

• Website Service: Bookle designs a website for recommender service. Users could

describe book contents in which they are interested. And then, Bookle would show a recommendation list of related book sorted by “Bookle Score”.

• Restful API: Bookle provides two restful API for programmer usage. The

program-mer could develop all kinds of service through the two API. Bookle restful API

requests should be submitted using an HTTP POST request rather than GET.

1. Review Acceptor: Accept information about user reading behaviors and re-views from handhold device (E-book) and store them in server database. The

information must include book ISBN, Date of Create at, Date of Finish at, bookmark numbers, note numbers, isFavorite message, and review (mentioned

at chapter 3.2).

2. Recommender: Response the book recommendation query. The POST message

must include book description. The result format of Recommender could be

XML or JSON by the parameter “type” in POST.

(31)

4.2.1 The Workflow of Recommender

When users submit the book description to request Bookle recommender service.

Bookle would deal with the description as following steps:

Step 1 : Users request Bookle recommender service by describing the book content in

which they are interested.

Step 2 : Extract keywords from the user description by term extraction tool (mentioned

at chapter 2.1.1).

Step 3 : Select the related books from the “Book Ranking for Keywords” table by the

extracted keywords.

Step 4 : Cross compare to ﬁnd the same book and sum up the weight of keywords as

Bookle Score.

Step 5 : Reply a recommendation list to the users with top-10 books in Bookle Score.

For more convenience, the results of recommendation could be packed in the XML

or JSON format. Programmers can show the result of recommendation in many diﬀer-ent ways easily. Programmers could develop their services at handhold devices or any

application environments. At present, Bookle provides the base service for users by the website.

(32)

Chapter 5 Implementation and Experiment

5.1 Client Storage

Because the power of handhold devices is limited, Bookle use a light and convenience database system named “SQLite” 1_{. Dislike general database systems, SQLite is not a}

client-server based database. Whole database (including deﬁnition, tables, index, and data) were existed in single ﬁle. Because handhold devices only have limited computing

ability and battery capacity, “SQLite” is the best choice of database system for Bookle client-side.

5.2 Communication

When readers read through a book and write reviews, they can publish the review to the Bookle server. Meanwhile, the information of reading behaviors would also be

transfered to Bookle server. For transferring those information to server, Bookle make the communication API base on “libcURL”2 _{library. “libcURL” is a free and easy-to-use}

client-side URL transfer library, supporting FTP, HTTP, HTTPS, SMTP, POP3, etc. The main task for this communication API is responsible for submitting review information

from client to server and dealing with the responses. Relatively, there are some restful APIs in Bookle server-side for accepting requests from client. As the Figure 5.1 shown,

(33)

this is the concept diagram about client-server communication.

Figure 5.1: The communication of client-server

5.3 Results Presentation

As Figure 5.2 shown, it is a snapshot of Bookle website that provides recommender

service. Users can describe book contents they want in the text ﬁeld, and click the button “Conﬁrm” to request recommender service. The results of recommendation would be

showed at the bottom of this page. For example, someone writes the description “The story about true love.” and request to the service. The Bookle service responses a

recommendation list with two books, one is “Breaking Dawn”, and the other is “Eclipse”. In this recommendation list, Bookle would provide an indication to express the degree of

consistency for the description by “Bookle Score”. This score is calculated by the weight of keywords extracted from description. The more reviews are related to one book, the

better score the book gets. In other words, this score could be regarded as the popularity of books. Books with higher score indicating more relevant to the user needs. Users can

reference this information to discover the books in which they are interested.

Most recommender systems use the users transaction records as data source. The

most common type of this recommender service is “Customers Who Bought This Item Also Bought”. When you go shopping at online store, you can often see a block about

this service at the website. However, it usually can not provide what you really want. Actually, it provides information about popular products. When a product sells better, it

is more possible to appear in this block. For example, Andy wants to buy some love stories at online store (e.g., Amazon.com). He decides to buy the love story “Breaking Dawn”,

(34)

Figure 5.2: Bookle Website Demo

and references to the block named “Customers Who Bought This Item Also Bought” for more books at the same web page. He sees the book “Harry Potter and Deathly Hallows”

in this block, but the book is a fantasy and mystery ﬁction.

Diﬀerent from traditional recommender services, Bookle utilizes reader reviews as data source. The results of recommendation are the most relevant books about user’s

description, but it does not mean that “Customers Who Bought This Item Also Bought” service should be discarded. This service oﬀers the related products and popular products

information, and Bookle service like a book content search engine which lists books that you want to read. The two systems are designed on diﬀerent data source and utilizes

diﬀerent methods, but work for similiar purpose. However, the two systems have their own characteristics and advantages.

(35)

5.4 Experiment

In experiment, we utilizes a common search method to be compared with Bookle.

When users describe the book content in experiment web page, they will obtain two kinds of results. One result is recommended by Bookle service, and the other is recommended

by keyword search from book title or abstract, which is called default method temporarily. If one party does not response any result, it will get 0 points. As Figure 5.3 shown, it

is a snapshot of the experiment web page. Users can grade the results from the ratio buttons. According to the satisfaction of results, users can score for results from 1 to 5.

The information recorded in the experiment including description strings, the scores for Bookle, and the scores for default method.

Figure 5.3: The Snapshot of Experiment Page

5.4.1 Data Source

Because we can not obtain large amount of data from E-book devices at this stage, we need to use the existing resources. The main data source can be divided into two

parts. One part is the data of book reviews. For obtaining large amount of review data, we crawl the book reviews from Amazon.com as data source. And the other part is the

data of reading behaviors. On existing resource, it is hard to acquire the data of reading behaviors. We need to make some data of reading behaviors appropriately. The artiﬁcial

(36)

data of reading behaviors contained “Experience Period”, “Bookmark numbers”, “Note numbers” and “isFavorite”. We utilize certain speciﬁc rules to generate the data of reading

behaviors.

• Book Reviews: The reviews of top-100 books from Amazon.com in April and May,

and the amount of reviews of a book is more than 50. Currently there are 50 books and more than 18,000 reviews in the server database.

• Reading Behaviors

1. Experience period(Date of Creat At): To generate a period value P by

Nor-mal Distribution Random Number. ExpectedV alue = P ageN umber/15 and

StandardDeviation = 1.

Date of Create At = Date of F inish At− P (Unix Timestamp format)

2. Experience period(Date of F inish At): It is the date that readers posted re-views at Amazon.com in UNIX Timestamp format.

3. Bookmark Number: To generate a random integer by Normal Distribution

Random Number. ExpectedV alue = P ageN umber/15 and StandardDeviation = 1.

4. Note Number: To generate a random integer by Normal Distribution Random

Number. ExpectedV alue = P ageN umber/5 and StandardDeviation = 1.

(37)

5.4.2 Experiment Results

In the experimental process, we invited the 6 volunteers to rate our service. They

tested a total of 50 descriptions to our system, and we use the parameter settings for measuring review quality as Table 5.1 shown.

Table 5.1: The Criteria Ratio of Review Quality in Experiment

Base Quality Bookmark Quality Note Quality Favorite Quality

Percentage 55% 25% 15% 5%

As the Figure 5.4(a) shown, It is often occurred in default method that no results is responded because the keywords of the description can not be found from book title

or abstract. Comparing to the situation of results, the average score of Bookle is sightly better than that of default method in Figure 5.4(b). Generally speaking, most of the

volunteers think the results of Bookle are more adequate to the requirements than that of default method.

(a) Whether the Results for Description (b) Average Scores of Results

(38)

Chapter 6 Conclusion

Bookle is a recommender system for books. Books are special products, it is hard

to rate the quality of books by speciﬁc features. Bookle attempts to make suggestion in another way that users obtain book recommendation by describing book content they

wanted. Bookle utilizes the reviews as basic data source and extracts key elements from reviews by keyword extraction technique. Research about using reviews in recommender

systems is still in its infancy. To the best of our research, this is the ﬁrst attempt to build a recommender system for books based on reader reviews in free-form text. To be

more credible, Bookle measures the review quality by reading behaviors. Although Bookle already takes many elements into consideration, there are still a lot could be improved.

First, so far there are no incentives for users to write reviews for Bookle. We suggest network beneﬁts for users willing to share their reviews, so the social network connection

with Bookle should be built as soon as possible. Users would write comments because they want to share their opinions with friends. The second one is the importance of a keyword

in one review. Bookle thinks that the keywords of a review have the same importance, but it should distinguish the importance of individual keyword in one review. Some keyword

extraction services could provide the importance degree of keywords in results. The last but not the least, Bookle can consider more criteria of reading behaviors, for example, the

max time period which users read the book and how many times users ﬂip over pages in a continuous reading. Taking those criteria into consideration could enhance the reliability

of reviews and avoid improper operation of users eﬀectively. With the three improvements mentioned above, Bookle would gradually got enhanced in the future.

(39)

Bibliography

[1] G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender sys-tems: a survey of the state-of-the-art and possible extensions,” IEEE Transactions

on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734–749, 2005.

[2] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen, “Collaborative ﬁltering

rec-ommender systems,” pp. 291–324, 2007.

[3] D. Billsus and M. You, “Learning collaborative information ﬁlters,” Proc. 15th Int’l

Conf. Machine Learning, pp. 46–54, 1998.

[4] R. Burke, “Hybrid recommender systems: Survey and experiments,” User Modeling

and User-Adapted Interaction, vol. 12, no. 4, pp. 331–370, 2002.

[5] W. Yang, Z. Wang, and M. You, “An improved collaborative ﬁltering method for

recommendations’ generation,” in Systems, Man and Cybernetics, 2004 IEEE

Inter-national Conference on, vol. 5, pp. 4135–4139, 2004.

[6] P. Melville, R. J. Mooney, and R. Nagarajan, “Content-boosted collaborative ﬁlter-ing for improved recommendations,” Eighteenth National Conference on Artificial

Intelligence, pp. 187–192, 2002.

[7] S. Aciar, D. Zhang, S. Simoﬀ, and J. Debenham, “Informed recommender: Basing

recommendations on consumer product reviews,” IEEE Intelligent Systems, vol. 22, no. 3, pp. 39–47, 2007.

[8] D. Cosley, S. K. Lam, I. Albert, J. A. Konstan, and J. Riedl, “Is seeing believing?: how recommender system interfaces aﬀect users’ opinions,” in CHI ’03: Proceedings

of the SIGCHI conference on Human factors in computing systems, pp. 585–592,

ACM, 2003.

[9] Y. Liang, Z. Zhao, and Q. Zeng, “Mining user’s interest from reading behavior in e-learning system,” in Software Engineering, Artificial Intelligence, Networking, and

Parallel/Distributed Computing, 2007. SNPD 2007. Eighth ACIS International Con-ference on, vol. 2, pp. 417–422, 2007.

(40)

[10] M. Morita and Y. Shinoda, “Information ﬁltering based on user behavior analysis and best match text retrieval,” in SIGIR ’94: Proceedings of the 17th annual

interna-tional ACM SIGIR conference on Research and development in information retrieval,

pp. 272–281, Springer-Verlag New York, Inc., 1994.

[11] R. Chen, A. Rose, and B. B. Bederson, “How people read books online: mining and

visualizing web logs for use information,” in ECDL’09: Proceedings of the 13th

Eu-ropean conference on Research and advanced technology for digital libraries, pp. 364–

369, Springer-Verlag, 2009.

[12] W. Woerndl, C. Schueller, and R. Wojtech, “A hybrid recommender system for

context-aware recommendations of mobile applications,” in ICDEW ’07:

Proceed-ings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop,

pp. 871–878, IEEE Computer Society, 2007.

[13] A. M. R´aez and R. Steinberger, “Why keywording matters,” High Energy Physics

Libraries Webzine, 2004.

[14] P. Kazienko, “Web-based recommender systems and user needs –the comprehensive

view,” in Proceeding of the 2008 conference on New Trends in Multimedia and

Net-work Information Systems, pp. 243–258, IOS Press, 2008.

[15] J. A. Konstan, “Introduction to recommender systems,” in SIGMOD ’08:

Proceed-ings of the 2008 ACM SIGMOD international conference on Management of data,

pp. 1373–1374, ACM, 2008.

[16] J. O’Donovan and B. Smyth, “Trust in recommender systems,” in IUI ’05:

Proceed-ings of the 10th international conference on Intelligent user interfaces, pp. 167–174,

ACM, 2005.

[17] M. Degemmis, P. Lops, and G. Semeraro, “A content-collaborative recommender that exploits wordnet-based user proﬁles for neighborhood formation,” User Modeling and

User-Adapted Interaction, vol. 17, no. 3, pp. 217–255, 2007.

[18] M. Elahi, “Context-aware intelligent recommender system,” in IUI ’10: Proceeding of

(41)

[19] M. Balabanovi´c and Y. Shoham, “Fab: content-based, collaborative recommenda-tion,” Commun. ACM, vol. 40, no. 3, pp. 66–72, 1997.

[20] P. Massa and P. Avesani, “Trust-aware recommender systems,” in RecSys ’07:

Pro-ceedings of the 2007 ACM conference on Recommender systems, pp. 17–24, ACM,

2007.

[21] T. Tran, “Designing recommender systems for e-commerce: an integration approach,” in ICEC ’06: Proceedings of the 8th international conference on Electronic commerce,

pp. 512–518, ACM, 2006.

[22] G. Adomavicius and A. Tuzhilin, “Context-aware recommender systems,” in RecSys

’08: Proceedings of the 2008 ACM conference on Recommender systems, pp. 335–336,

ACM, 2008.

[23] Liu and Ziming, “Reading behavior in the digital environment: Changes in reading behavior over the past ten years,” Journal of Documentation, vol. 61, pp. 700–712,

January 2005.

[24] K. O’Hara and A. Sellen, “A comparison of reading paper and on-line documents,”

in CHI ’97: Proceedings of the SIGCHI conference on Human factors in computing

Bookle ： 書籍推薦系統 - 基於讀者的評論與閱讀行為

國 立 交 通 大 學

資訊學院資訊科技（IT）產業研發碩士專班

碩 士 論 文

Bookle

： 書籍推薦系統

基於讀者的評論與閱讀行為

Bookle : A Book Recommender System

Based on Readers’ Reviews and Reading Behaviors

研 究 生： 卓建益

指導教授： 黃俊龍 教授

Bookle

： 書籍推薦系統

基於讀者的評論與閱讀行為

Bookle : A Book Recommender System

Based on Readers’ Reviews and Reading Behaviors

研 究 生 ： 卓建益

Student: Chien-I Tso

指 導 教 授 ： 黃俊龍

Advisor: Jiun-Long Huang

國 立 交 通 大 學

資訊學院資訊科技（IT）產業研發碩士專班

碩 士 論 文

A Thesis

Submitted to College of Computer Science

National Chiao-Tung University

in Partial Fulﬁllment of the Requirements

for the Degree of

Master

in

Industrial Technology R & D Master Program on

Computer Science and Engineering

August 2010

Hsinchu, Taiwan, Republic of China

Bookle

： 書籍推薦系統

基於讀者的評論與閱讀行為

學 生 ： 卓建益

指 導 教 授 ： 黃俊龍

國立交通大學資訊學院資訊科技（IT）產業研發碩士專班

摘

要

Bookle : A Book Recommender System

Based on Readers’ Reviews and Reading Behaviors

Student: Chien-I Tso

Advisor: Jiun-Long Huang

Industrial Technology R & D Master Program of

Computer Science College

National Chiao-Tung University

ABSTRACT

誌

誌

誌

謝

謝

謝

目

目

目

錄

錄

錄

表

表

表

目

目

目

錄

錄

錄

圖

圖

圖

目

目

目

錄

錄

錄

Bookle ：書籍推薦系統 - 基於讀者的評論與閱讀行為

國立交通大學

碩士論文

：書籍推薦系統

研究生：卓建益

指導教授：黃俊龍　教授

：書籍推薦系統

研究生：卓建益

指導教授：黃俊龍

國立交通大學

碩士論文

：書籍推薦系統

學生：卓建益

指導教授：黃俊龍