A Collaborative Filtering-Based Two Stage Model with Item Dependency for Course Recommendation

(1)

A Collaborative Filtering-Based Two Stage Model with Item Dependency for Course Recommendation

Eric L. Lee

Computer Science and Information Engineering National Taiwan University

r01922164@csie.ntu.edu.tw

Tsung-Ting Kuo

Department of Biomedical Informatics University of California, San Diego

tskuo@ucsd.edu

Shou-De Lin

Computer Science and Information Engineering National Taiwan University

sdlin@csie.ntu.edu.tw

ABSTRACT

Recommender systems have been studied for decades with numerous promising models been proposed. Among them, Collaborative Filtering (CF) models are arguably the most successful one due to its high accuracy in recommendation and elimination of privacy-concerned personal meta-data from training. This paper extends the usage of CF-based model to the task of course recommendation. We point out several challenges in applying the existing CF-models to build a course recommendation engine, including the lack of rating and meta- data, the imbalance of course registration distribution, and the demand of course dependency modeling. We then propose several ideas to address these challenges. Eventually, we combine a two- stage CF model regularized by course dependency with a graph- based recommender base on course-transition network, to achieve AUC as high as 0.97 with a real world dataset.

CCS Concepts

• Information systems ➝ Database management system engines ➝ Data mining ➝ Collaborative filtering

Keywords

Collaborative Filtering; Matrix Factorization; Recommendation Systems; Data Mining.

1. INTRODUCTION

Collaborative Filtering (CF) based techniques have become very popular for designing a recommendation system. Among them, the Matrix Factorization (MF) model that jointly learns the user and item latent factors for CF has enjoyed tremendous success, and becomes one of the standard solutions. This paper extends the existing matrix factorization model to handle a different type of recommendation task: recommending courses to the students.

Education institutions normally offer a spectrum of courses in different areas for students to choose, and in many cases the number can be overwhelming. For example, in the year of 2012 there were more than ten thousand courses offered in National Taiwan University (NTU). Thus, selecting suitable courses to take in the upcoming semester is a demanding task for students; more importantly, improper course selection could lead to serious waste of efforts for students and extra administrative burden for faculties.

There have been a few previous works proposed for course recommendation [1] [9]. These models mainly rely on certain meta-information such as the curriculum information from each department, the feedbacks from students, or the grades received by former students. Here we argue that the existence of such meta-data is not guaranteed, and even if it does, those data might not be available due to privacy concern. To address such concern, our goal is to design a general-purpose, privacy-preserving course recommendation system that requires only minimum personal information (i.e. course registration records) from students.

To develop a course recommendation system, first thought would be to treat students as users and courses as items and deploy a CF- based solution. To achieve such goal, normally we would require some ‘ratings’ from students to courses, specifying how much they like the course. Then based on the rating, a CF-based solution can utilize the similarity between students as well as the similarity between courses to predict the level of the students’

interests to the courses they have not yet taken.

However, here we argue that there are several practical challenges that hinder the effectiveness of conventional CF models for the course recommendation task:

1. Potentially lack of rating data from students to courses.

Traditional CF-based methods rely on the ratings from users to items. However, such rating data might not be universally available for training. For instance, although the partial students’ feedback ratings for each course at National Taiwan University exist, they are kept private for privacy concern with only the statistic aggregation of the ratings are made available to the corresponding instructors. On the other hand, it is much less controversial to obtain the information of ‘course registration’, namely a binary value indicating whether a student registered for a course.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

KDD’16, August 13–17, 2016, San Francisco, CA, USA.

DOI: http://dx.doi.org/10.1145/12345.67890

(2)

Figure 1. One Class Collaborative Filtering (OCCF) problem vs. traditional recommendation problem.

Acknowledging the lack of meta-information, in this paper we model the course recommendation task as a One Class Collaborative Filtering (OCCF) challenge for which only partially observable binary matrix indicating whether a student registered for a course is available. See Figure 1 for the comparison between traditional CF problem and OCCF problem. Comparing to the traditional CF problem, OCCF encounters more challenges which are detailed in Section 2 and 3.

2. Imbalance of the user-item matrix.

The available registration information normally includes the records of current and graduated students. Note that for the current students, only the records of previously registered courses are available. That says, the more senior a student is, the more data are available.

We would like to start from performing some analysis on the data to be used to train our model. First, we have realized that normally courses are not taken uniformly across different seniority. Some courses are usually taken earlier, while others are taken in the later stage of a student's academic career. For example, Table 1 shows the distribution of seniority for four different courses in NTU during year 2008-2013. It shows that Chinese Literature and Calculus are usually taken by freshmen, while the Database Systems are taken by upperclassmen. We found such imbalance of distribution very common. In fact, during 2008~2013, the registration of more than 70% of the courses in NTU are dominated by students of the same level, meaning that more than 50% of the registered students are of the same seniority.

Such imbalance of distribution tells us that the course registration records for students of different seniority are likely to be very different. Such difference can cause serious problem for a CF-based model. Assuming now the goal is to recommend courses to a student s who is entering his senior year. A conventional CF model would recommend courses taken by other students that are similar to s, while the similarity is determined by the number of common courses the two students have taken together previously. However, the imbalance of course-registration record reveals that two students of different seniority (e.g. freshman vs. junior) will not be too similar, and it is very likely the similar students to s are of the same seniority. On the other hand, most of the course records from students of the same seniority would not be very helpful for recommendation, since it is likely very few students of the same seniority have taken those courses in the past.

Table 1. Distribution of course registration for students with different grade level.

Course Freshmen Sophomore Junior Senior Chinese

Literature 96.5% 1.5% 1.1% 0.7%

Calculus 85.8% 6.6% 3.7% 2.3%

Data

Structure 3.8% 73.4% 12.7% 7.6%

Database

Systems 17.4% 9.5% 21.6% 51.7%

It then brings up a dilemma that in order to generate effective recommendation for s, a model should indeed include the records of senior or graduated students that have taken higher-level courses. However, based on the definition of CF, those senior people are not necessarily the ones that are the most similar to s. Furthermore, the popularity of courses can change over time d (e.g. the Neural Network course in NTU becomes much more popular recently due to the rise of deep learning), thus simply looking up the courses that have been taken by more senior students might not be an ideal solution. This becomes the main challenge this paper tries to handle.

3. Courses are coherent and not independent.

Different from product recommendation in which most of the products are independent and the order of purchasing matters little, there are strong temporal and order correlation between courses taken by a student. For example, we would not recommend calculus to those who have already taken advanced calculus; while recommending advanced calculus to those who just took calculus seems to be a reasonable idea. Conventional CF does not consider such dependency, which becomes another main focus in our solution.

To address the abovementioned challenges, we first adopt the Bayesian Personal Ranking Matrix Factorization (BPR-MF) [11]

method, which models the OCCF problem as a ranking problem.

Next, we propose a novel two stage framework to handle the second challenge. In the first stage, we use the registration record of all students to learn the latent features for the courses. Then based on the learned course latent features, in the second stage, we try to learn the student’s latent factor to optimize the ranking of courses given a student. To model the course dependency mentioned in the third challenge, we build an item transition network to capture the probability of a course being taken following another. Such transition network serves two purposes.

First, it is used to regularize our two stage CF model; and second, it is used to build a Personalized PageRank model for recommendation. Final results from these two models are then combined to generate our final outputs.

We compare our model to several baseline solutions on a real- world course registration dataset which contains about 14K students and about 900K course records. The experiment result shows that the proposed model produces significantly better results, with the final ensemble model reaches 0.97 in AUC.

The rest of the paper is organized as follows. In Section 2 we describe related work. The detail of the proposed method will be described in Section 3 and the experiments are shown in Section 4.

We then conclude in the final section.

(3)

2. RELATED WORK

2.1 Course Recommendation Systems

There are several existing works on course recommendation.

Parameswaran et al. [9] propose a model using knowledge obtained from curriculum of each department to recommend courses. Bendikir et al. propose a model RARE [1] to discover rules from historical data. However, these solutions require extra meta-information such as the departments the students belong to, the course-registration constraints implemented by each department, and the feedback of students toward each course, which is very different from the goal of not requiring personal meta-information we have setup in this paper.

2.2 Collaborative Filtering

Collaborative Filtering (CF) techniques have long been proposed to model explicit feedbacks from users. Comparing to the content- based techniques [10] [14], CF methods are more general, require less information, and in many situations produce superior results.

One of the most straightforward CF model is k-nearest neighbor based CF (kNN-CF) [3] [4] [13] [16], which is based on user-wise or item-wise similarity for mapping. In general, the similarity measurement (e.g. the Pearson correlation) of kNN-CF is chosen through a trial-and-error process on the validation datasets.

Recently, Matrix Factorization (MF) based methods have become popular and are widely accepted as the state-of-the-art single model for CF, as researchers have found that given sufficient rating data, MF methods outperform many other methods in competitions such as Netflix Challenge [28] [29] [30] [31], KDD Cup [26] [32] [33] [34] [35], etc. Comparing to the kNN-CF methods, MF methods are usually more efficient and effective, as they are able to discover the latent features which are usually hidden behind the interactions between users and items. However, MF tends to over train data so there has been extensions to address this issue, such as regularized least-square optimization with case weights (WR-MF) [5] [8] and max-margin MF [15]

[17].

However, the abovementioned CF techniques are designed to model explicit feedbacks from users (e.g., music rating). In many practical scenarios such as course recommendation, only implicit feedbacks (e.g., binary values indicating whether a student took a course) are available. Therefore, these CF algorithms cannot be applied directly to deal with tasks such as the recommendation of courses.

2.3 One-Class Collaborative Filtering

The problem of applying CF using implicit feedback is known as One-Class Collaborative Filtering (OCCF) task [8]. In OCCF, magnitude of user's preference is usually subtle; therefore, it is hard to distinguish negative examples from unlabeled examples.

OCCF can be regarded as a ranking problem in which we need to rank the positive instances above others. Exploiting the idea of ranking optimization, Rendle et al. [12] propose the Bayesian Personalized Ranking Matrix Factorization (BPR-MF) framework to optimize Area Under receiver-operating-characteristic Curve (AUC) for Matrix Factorization (MF) model. Pan et al. [8] also provide several approaches to deal with this problem, such as Weighted Alternating Least Square (WALS) method which gives missing examples different weights to avoid drawbacks of two extreme cases, all missing as unknown and all missing as negative. Another method proposed in [8] is the Sampling ALS Ensemble (SALS-ENS) that applies bootstrap aggregating (bagging) techniques. In our work, BPR-MF is adopted due to its simplicity and flexibility.

2.4 Sequence Recommendation

Sequence recommendation is probably the closest recommendation task to course recommendation, which aims at recommending items to the users in the next period of time.

Several studies of sequence recommendation for various applications have been proposed. For example, Claudio et al. [19]

present a case-based recommendation approach for the music playlist recommendation; Yong et al. [21] develop two cost-aware latent factor models to recommend tours by considering both travel cost and the tourist’s interest; Ziegler et al. [25] propose topic diversification to balance and diversify personalized recommendation lists; Yap et al. [24] design a framework that learns user-specific sequence importance knowledge by using a competence score measure; Hsieh et al. [22] solve the urban route recommendation task, proposing a model that focuses on the time- sensitive routes; Liu et al. [23] devise a hybrid recommendation method that combines the segmentation-based sequential rule method with k-nearest-neighbor collaborative filtering method for a particular E-commerce application; and Cheng et al. [20]

develop a personalized model to recommend the Point-of-interest (POI) sequences.

Although these studies all consider the sequential behavior for recommendation, their methods mostly include ad-hoc algorithms, particular scoring functions, special features, or additional information for each specific application. In our paper, we focus on proposing a general method to solve the real-world course recommendation problem without using meta-data or application- specific content information (e.g., detailed student or course information).

The CF-based solution proposed by Rendle et al. [18] for sequence recommendation seems to be applicable to our problem.

This model brings Matrix Factorization (MF) and Markov Chains (MC) together to consider both personalized preference and the action (e.g., purchase behavior) in the last basket (time period), and has been successfully evaluated on the purchase data of an online drug store.

However, the solution in [18] cannot be directly applied to deal with course recommendation. First, it assumes sparse transition behavior (e.g., purchasing) and longer period of time in training (e.g., more than thousand time slots). In our problem, the transition behavior (course registration) is dense, and the number of temporal slot is few (only 4 academic years). Thus, it would be hard to learn a meaningful Markov Chain. Also, the issue of distribution imbalance as described previously is not handled in this model.

3. METHODOLOGY

We are given a registration matrix R indicating whether a student s registered a course c in the past. As shown in the left part of Figure 1, in R an element is 1 if the student did take the course, and ? (indicating “unknown”) if the student has not yet registered for the course. We model the course recommendation problem as:

given the Registration Matrix R, predicting the probabilities of the unknown elements actually being 1.

To solve this problem, we propose a two-stage CF model with dependency regularization. In the following subsections, we introduce the three main components of our model: Bayesian Personal Ranking Matrix Factorization (BPR-MF), two-stage training, and course dependency regularization.

(4)

3.1 Bayesian Personal Ranking Matrix Factorization (BPR-MF)

We first introduce a variation of the well-known Matrix Factorization (MF) technique to be adopted into our scenario.

Basic MF model aims at finding two matrices, P (the latent feature matrix for students) and Q (the latent feature matrix for courses), of which their multiplication can best recover the input matrix R minimizing the square error between Rsc and Ps・Qc. The objective function of MF is,

min$,& (),*)∈2(𝑅)*− 𝑃)∙ 𝑄*)⁰, (1) where W represents the set of all existing (student, course) pairs.

Eventually MF learns the values in the Ps and Qc matrix as the latent student/course features, and uses the corresponding inner product of P and Q to produce the prediction score.

However, applying MF technique directly to solve our problem leads to a serious drawback. In the given scenario, the values in the registration matrix R are either one (i.e., registered) or unknown. Directly applying MF with such data will lead to a useless solution where the model predicts every entry as 1 to minimize the error.

The problem we are trying to solve is known as the One Class Collaborative Filtering (OCCF). To solve OCCF, we adopt the Bayesian Personal Ranking Matrix Factorization (BPR-MF) [11]

model. BPR-MF optimizes the Area Under receiver-operating- characteristic Curve (AUC) instead of square error. The reason to optimize AUC is that it yields a model that produce faithful ranking of instances, giving higher prediction score to the courses to be recommended.

Given a student s, denote the courses that s has taken as a set 𝐼₎⁴ and those s has not taken as 𝐼)5. BPR-MF maximizes the difference between the likelihood that the courses are in 𝐼)4 and the likelihood that the courses are in 𝐼)5.

Formally, let Psk be the k-th latent feature for student s in P, Qik be the k-th latent feature of a taken course 𝑖 ∈ 𝐼₎⁴ in Q, and Qjk be the k-th latent feature of an untaken course 𝑗 ∈ 𝐼)5 in Q, the objective function of the BPR-MF is defined as:

𝐸𝑟𝑟𝑜𝑟 ≔ ),D,E ∈Fln (1 + exp (𝑦)DE)) + 𝜆( 𝑃⁰+ 𝑄⁰), (2) where

𝐷 ∶= 𝑠, 𝑖, 𝑗 𝑖 ∈ 𝐼)4 𝑎𝑛𝑑 𝑗 ∈ 𝐼)5} (3) is the tuple of positive-negative courses for the student s, and

𝑦)DE= ^QP𝑃)P(𝑄DP− 𝑄EP) (4) is the pairwise difference of the likelihoods between 𝐼₎⁴ and 𝐼₎⁵, K is the number of latent features, and 𝜆 is a regularization term.

We can further define

𝑢)DE=_Z4STU (5V^5STU (5V^WXY⁾

WXY) . (5) Then, BPR-MF can be learned in Stochastic Gradient Descent (SGD), after deriving the partial derivatives of Psk, Qik, and Qjk as:

[\]]^]

[$W_ = 𝑢)DE∙ 𝑄DP− 𝑄EP + 2𝜆𝑃)P (6)

[\]]^]

[&_{X_} = 𝑢)DE∙ (𝑃)P) + 2𝜆𝑄DP (7)

[\]]^]

[&Y_ = 𝑢)DE∙ (−𝑃)P) + 2𝜆𝑄EP . (8)

Figure 2. Four parts of the course registration records. The blue parts are records for all former students. The orange part contains records of current students. The red part contains mostly the courses not yet taken by current students.

However, the complexity of SGD-based optimization procedure requires 𝑂( 𝐼)4 ∙ 𝐼)5). This can cause significant computational burden because 𝐼)5 is usually very large (i.e., there are many courses a student has not yet taken). To overcome such limitation in the course recommendation problem, we perform down- sampling on the negative set 𝐼)5. That is, for a given student s, we randomly sample one negative course in 𝐼)5 for every instance in 𝐼)4. This way, the training speed is significantly improved.

3.2 Two-Stage Training for BPR-MF

Although the BPR-MF model proposes a nice solution for the OCCF problem, we still need to handle the issue of distribution imbalance of taken courses as described previously. To elaborate the issue and our solution, we first divide the course registration records into four parts as shown in Figure 2. The students are divided into two parts: graduated students whose course selection information throughout their academic career are available; and current students who require the course recommendation service for the upcoming semester. For the current students, we can only obtain their registration data from previous years. We also divide the courses roughly into two parts, the fundamental courses that are more likely to be taken by lowerclassmen, and the advanced ones that are more likely to be taken by upperclassmen. The matrix can therefore be divided into four parts:

1. The upper-left part represents the fundamental courses that have been taken by students who have already graduated (FG).

2. The upper-right part represents the advanced courses that have been taken by students who have already graduated (AG).

3. The lower-left part represents the fundamental courses that have been taken by current students (FC).

4. The lower-right part represents the advanced courses that are more likely to be taken by the current students in the upcoming year (AR).

Note that AR is supposed to be much sparser than AG, since not as many current students have taken advanced courses. As result, the BPR-MF model tend to downgrade the probability inside AR, which leads to an inferior model that tends to NOT recommend advanced courses to the current students, which is opposite to our goal.

(5)

Algorithm 1. Two-stage BPR-MF.

Input: FG, AG, FC, 𝛼 Output: P・Q

// First Stage: train Q using FG, AG, FC 1: repeat

2:

for (u, i) ∈ {FG, AG, FC} do

3:

Draw a negative sample j from 𝐼)5 in {FG, AG, FC}

𝑃)P≔ 𝑃)P− 𝛼 ∙^{[\]]^]}_[$

W_

𝑄DP≔ 𝑄DP− 𝛼 ∙^{[\]]^]}_[&

X_

𝑄_EP≔ 𝑄_EP− 𝛼 ∙^{[\]]^]}_[&

Y_

4:

end for 5: until convergence

// Second Stage: fix Q, and then update P using FC 6: repeat

7:

for (u, i) ∈ FC do

8:

Draw a negative sample j from 𝐼)5 in FC 𝑃)P≔ 𝑃)P− 𝛼 ∙^{[\]]^]}_[$

W_

9:

end for 10: until convergence

Our idea to address such deficiency is to learn the course latent features and student latent features separately in two stages. The latent course features in fact can be used to represent the similarity between courses, namely whether two courses are taken by a similar set of students. In order to take advantage of such latent features of courses that preserve the connection between the fundamental courses and advanced courses, we propose to train a BPR-MF model using data from FG, AG, and FC, excluding AR.

Also during the negative sampling stage, entries in AR cannot be sampled as negative to avoid bias. The latent courses features are kept as the input to the second stage.

In the second stage, we fix Q and perform the BPR-MF model again only using data from FC. The goal is to learn a refined latent features for only current students. Since this time Q is fixed, meaning that the learning algorithm has to respect the dependency learnt between fundamental courses and advanced course while training the student latent features. Note that there is no need to use FG to learn the latent features of graduate students since they do not need recommendation anymore. The final prediction is obtained from the inner product of the fixed Q and newly obtained P’ in the second stage. The details are shown in Algorithm 1.

It should be noted that the latent student features in the decomposed student matrix P cannot be utilized directly. As abovementioned, the latent student features for current students tend to have lower preference towards advanced courses.

Table 2. Example course records for three students, four courses, and three grade levels.

Student Freshmen Sophomore Junior

s1 A B C D

s2 A C D B

s3 C A D B

Figure 3. The item transition network for the example shown in Table 2.

3.3 Item Transition Network for Regularizing Two-Stage BPR-MF

By applying the two-stage training for BPR-MF, we can now address the issue of data imbalance. However, we argue that the explicit dependency between courses should be modeled as well.

That is, certain courses are more likely to be taken right after another. Here we first propose the item transition network to model the dependency between courses. Next, such network is exploited as a regularization term in our two-stage BPR-MF model. The goal is to strengthen the connection between dependent courses for recommendation. A simple pruning heuristic is applied to speed up the process.

3.3.1 Constructing the Item Transition Network

We propose a directed, weighted, homogeneous graph called the item transition network, to model the dependency between the courses. Intuitively, an item transition network is defined as follows:

• Node: a course.

• Link: the dependency between two courses; the source node represents the course taken in a year, and the destination node represents the course taken in the next year.

• Weight of link: ranges from [0,1], representing the probability that the target course is taken in the next year after the source course is taken.

We illustrate the construction of item transition network using an example (Table 2). In this example, there are three students (s1, s2, and s3), four courses (A, B, C and D), and three grade levels. The three students are all now in their junior year. Student s1 took course A and B in his freshman year, C in the sophomore year, and D in the junior year. The course records for student s2 and s3

are also represented in the similar way in Table 2.

(6)

The item transition network constructed from this example of course records is shown in Figure 3. We demonstrate the generation of the weights using out-links of node A (i.e., the probabilities that students take course A in a year, and then take course B, C or D in the next year). There is exactly one student (s₃) who takes course B the year after he/she takes course A. There are two students (s1, s2) who take course C the year after they take course A. There is only one student (s2) who takes course D the year after he/she takes course A. Therefore, the weight of the link from A to B is 1 / (1+2+1) = 0.25, the weight of the link from A to C is 2 / (1+2+1) = 0.5, and similarly the weight of the link from A to D is 1 / (1+2+1) = 0.25. In this way, we can compute the weights for all links in the network, as shown in Figure 3. Note that only the links with non-zero weights are included in the item transition network.

3.3.2 Regularizing the Two-Stage BPR-MF using Item Transition Network

To incorporate the course dependency information modeled in the item transition network, we add a soft constraint on the connected courses in our two-stage BPR-MF model. Inspired by Ma et al.

[7], we propose to impose such soft constraint by adding a regularization term into our model. That is, we add the course dependency regularization term to the original BPR-MF error function (Equation (2)) as below:

𝐸𝑟𝑟𝑜𝑟^c≔ 𝐸𝑟𝑟𝑜𝑟 +^d₀ i∈l j∈k(i)𝑤 𝑓, 𝑔 ∙ |𝑄i− 𝑄j|⁰ , (9) where 𝛽 is a positive constant parameter, I is the set of all courses, N(f) is the set of courses in the item network pointed to by course f, and w(f, g) is the weight of the link from f to g.

And the partial derivatives are as follows:

[\]]^]c

[$W_ =^{[\]]^]}_[$

W_ (10)

[\]]^]c

[&_{X_} =^{[\]]^]}_[&

X_ + 𝑡i (11)

[\]]^]c [&_{Y_} =^{[\]]^]}_[&

Y_ + 𝑡i , (12) where

𝑡i= 𝛽 j∈k(i)𝑤 𝑓, 𝑔 ∙ (𝑄i− 𝑄j). (13) We then apply the above partial derivatives to Algorithm 1, as the two-stage BPR-MF model with course dependency regularization.

Furthermore, we have observed from data that the item transition network is dense (average node degree is 99.8), which can seriously hurt the training efficiency. In practice, we simply apply a threshold to remove edges whose weight is smaller than 3%, which leads to a much sparser network for training.

3.4 Personalized PageRank and Ensemble

Besides using the information in the item transition network as a regularization term, another way to utilize the course dependencies is to perform centrality algorithm (such as random walk) on the item transition network to identify important courses for recommendation based on dependencies. That is, we can rank and recommend the nodes (courses) based on their importance in the item transition network.

Therefore, we design a Personalized PageRank algorithm to recommend courses to each student. The main differences from the original PageRank algorithm with a damping factor 𝛾 are as follows:

Table 3. Statistics of the NTU dataset.

Class Number of

Students Number of Course Registrations Class 2008

(2008–2011) 4,736 311,283

Class 2009

(2009–2012) 4,686 299,772

Class 2010

(2010–2013) 4,555 285,561

Total 13,977 896,616

• Because the courses we would like to recommend is for next time period, we limited the start/restart nodes to a set C, which includes only the courses the student has taken at the current time. For example, in our experiment we would like to recommend courses to senior students, so in the Personalized PageRank algorithm we limit the start/restart nodes to the set of courses that are taken in the junior level.

• With probability 𝛾, the algorithm walks to a neighbor with probability proportional to the weight of the link to that neighbor.

• With probability 1 – 𝛾 , the algorithm restarts from a randomly selected node in a subset C of the nodes of the item transition network.

Finally, as have shown in several previous studies [26] [27], an ensemble of CF-based models and graph-based models can significantly improve the performance of a recommendation system. Here we propose to combine the proposed two-stage BPR-MF model with dependency regularization and the Personalized PageRank as our final model for recommendation.

Since both models produce the ranking results, we ensemble them using a supervised ranking-based model, the linear RankSVM [6].

4. EXPERIMENTS

To evaluate the proposed ideas, we compare our model with several algorithms using real-world course registration records.

The dataset, experiment settings, comparing methods, results, discussion, and statistical tests are described in the following subsections.

4.1 Dataset and Experiment Settings

We collect 6 years of course registration records for all NTU students from 2008 to 2013. That says, for those started as a freshman in NTU at 2008, 2009, and 2010, we have full 4-year registration records for them. For evaluation purpose, we ignore the students whose 4-year registration records are incomplete. The statistics of our dataset is shown in Table 3; the dataset contains 13,977 students and 896,616 course registration records.

In our experiment, we aim at recommending the advanced courses to students who are entering their senior year. We believe it is a more useful course recommendation scenario since freshman and sophomore students generally need to take required entry level courses, and therefore has less freedom in course selection. On the other hand, senior students have much higher flexibility to choose courses of their interests.

(7)

That says, we have the ground truth of senior course registration data for class 2008, 2009, and 2010 students. We use all the data except the senior registration record of class 2010 students, which is used as testing data, as the training data. From the training set, we further choose the senior year registration records of class 2009 as the validation set to tune the parameters of all the competing models.

We choose the Area Under receiver-operating-characteristic Curve (AUC) [2] as the evaluation metric. For each student, we rank the predicted scores for all courses, and compare with gold standard registration records from year 2013 in the test set, to calculate the AUC of each student. Then, the average AUC of all students is reported.

4.2 Compared Methods

We compare four different sets of models in the experiment;

Memory-based models, graph-based models, our BPR-MF based solutions, and the ensemble of different types of solutions.

4.2.1 Popularity Baseline

One potentially powerful baseline is to always recommend the most popular courses. Thus, we implement a simple non- personalized baseline to simply recommend the users the most popular courses based on historical data.

4.2.2 Memory-Based Collaborative Filtering

The idea of memory-based CF methods is to recommend courses base on the similarity of students. The courses taken by similar peers have higher chances to be recommended. To implement such strategy, we first calculate the similarity of student pairs based on their course registration data. Then, we can calculate the score of a student s to a course c according to this similarity, as shown in the following equation:

𝑠𝑐𝑜𝑟𝑒(𝑠,𝑐) :=𝑠′∈𝑆∩𝑐 ∙ 𝑠𝑖𝑚(𝑠′,𝑠), (14) where S is the set of all senior or graduated students, s’∈S∩c represents the students that have taken course c, and sim(s’, s) is the similarity of two students s’ and s. Note that in the similarity- based model we focus on finding the similarity between a student and his/her senior peers, rather than students of the same grade, since experiment shows that bringing the students of the same level into consideration can hurt the performance. It is reasonable as such records might carry negative bias toward advanced courses.

We apply the following two similarity functions in the memory- based CF models:

• Number of Intersectional Courses. The similarity is defined using the number of courses that two students have taken in common previously:

𝑠𝑖𝑚lvwx])x*wD^v 𝑠′, 𝑠 ∶= _w𝐹₎^z(𝑡) ∩ 𝐹₎(𝑡), (15) where Fs’(t) and Fs(t) are the set of all courses the students s’

and s has taken in the t-th grade, respectively. t is 1 for freshmen, 2 for sophomore, and so on.

• Jaccard Similarity. To normalize the similarity using the total number of courses a student has taken, we apply the following Jaccard similarity function:

𝑠𝑖𝑚|}**}]~ 𝑠′, 𝑠 ∶= ^•_•^Wz^(w)∩•^W^(w)

Wz(w)∪•W(w)

w . (16)

Table 4. Experiment results.

Category Method AUC

Baseline Course Popularity 0.8172

Memory-based Number of Intersectional Courses 0.8678 Jaccard Similarity 0.8922 Graph-based Personalized PageRank 0.9334

Model-based

BPR-MF 0.9366

BPR-MF + Two-Stage Training 0.9404 Our Model

(BPR-MF + Two-Stage Training + Course Dependency Regularization)

0.9427 Ensemble Our Model + Personalized PageRank 0.9709

4.3 Results and Discussion

The experiment results are shown in Table 4. It shows the baseline popularity-based model performs the worst, which is reasonable as it is not personalized. Memory-based models perform better than the baseline, but not as promising as the other models, which is consistent with the outcomes of the conventional recommendation task. The graph-based model using item transitional network performs better than the memory-based models, meaning that modeling the item transition is more useful than modeling the user similarity in course recommendation. The BPR-MF model performs slightly better than the graph-based model, probably due to the fact that in BPR-MF both the item similarity and user similarity are considered. Adding two-stage training into BPR produces significant improvement (see next section for the hypothesis tests), and adding course dependency regularization can further boost the performance.

Interestingly, we have realized that combining the graph-based solution with our model creates a large jump on the performance, reaching 0.9709 from 0.9427. The results also prove some previous findings in Netflix competition [28] [29] [30] [31] that combining diverse recommender models can significantly boost the outcome.

4.4 Hypothesis Tests

Although Table 4 shows improvements of using two-stage training and course dependency regularization, we would like to perform deeper analysis to evaluate the effectiveness of these two methods. Thus, we conduct the following two hypothesis tests:

• Test 1: comparing BPR-MF + Two-Stage Training (target model 1) with BPR-MF only (original model 1).

• Test 2: comparing BPR-MF + Two-Stage Training + Course Dependency Regularization (target model 2) with BPR-MF + Two-Stage Training (original model 2).

For both tests, we first calculate the difference of the AUC between two models (i.e., AUC of the target model minus AUC of the original model) for each of the 4,555 student in the test set (Table 3), and then perform a hypothesis test.

We set the significance level to 0.05 and generate the P-value as a measure to accept or reject the hypotheses. The results show that the P-value for Test 1 is 3.24×10^-12. It shows that the two-stage model is significantly better than the original BPR-MF. On the other hand, the P-value for Test 2 is 0.0135, also demonstrates the usefulness of the course dependency regularization.

(8)

5. CONCLUSION

The capability to recommend items in real-world setting is highly valuable in practice, as in real world the ratings may only be binary, the training data for a new period of time may be missing, or the relationship among the items to be recommended may be complicated. In this paper, we demonstrate how such a challenging recommendation task can be solved using a two-stage collaborative filtering model with dependency regularization. We show how the one-class issue can be mitigated using BPR-MF method, propose a novel two-stage training method to learn the parameters using incomplete training data, and devise a transition network to integrate the item dependency as the regularization term in our model. Most importantly, with growing awareness of privacy, our method provides a way for applications that tries to recommend items with no content information. It should be noted that our current model is mainly focused on recommending courses for upperclassmen (e.g. students in junior or senior year) and might not be very effective for lowerclassmen due to lack of registered courses for training. The future work will be focused on extending the model to deal with cold start students.

6. REFERENCES

[1] N. Bendakir and E. Aimeur. Using association rules for course recommendation. In Proceedings of the AAAI Work- shop on Educational Data Mining, volume 3, 2006.

[2] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern

recognition, 30(7):1145-1159, 1997.

[3] M. Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems. Springer-Verlag, 22/1, 2004.

[4] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering.

In Proceedings of the 1999 Conference on Research and Development in Information Retrieval, August 1999.

[5] Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In IEEE International Conference on Data Mining (ICDM 2008), pages 263-272, 2008.

[6] T. Joachims. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217-226. ACM, 2006.

[7] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King.

Recommender systems with social regularization. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 287-296. ACM, 2011.

[8] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. M. Lukose, M.

Scholz, and Q. Yang. One-class collaborative filtering. In IEEE International Conference on Data Mining (ICDM 2008), pages 502-511, 2008.

[9] A. Parameswaran, P. Venetis, and H. Garcia-Molina.

Recommendation systems with complex constraints: A course recommendation perspective. ACM Transactions on Information Systems (TOIS), 29(4):20, 2011.

[10] M. J. Pazzani and D. Billsus. Content-based recommendation systems. In The adaptive web, pages 325-341. Springer, 2007.

[11] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt- Thieme. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on

Uncertainty in Artificial Intelligence, pages 452-461. AUAI Press, 2009.

[12] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt- Thieme. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009.

[13] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms. In Proceedings of ACM WWW'01, pages 285-295, New York, NY, USA, 2001. ACM.

[14] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Survey, 34(1):1-47, Mar.

2002.

[15] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum- margin matrix factorization. In L. K. Saul, Y. Weiss, and L.

Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1329-1336. MIT Press, Cambridge, MA, 2005.

[16] L. Ungar and D. Foster. Clustering methods for collaborative filtering. In Proceedings of the Workshop on

Recommendation Systems. AAAI Press, Menlo Park California, 1998.

[17] M. Weimer, A. Karatzoglou, and A. Smola. Improving maximum margin matrix factorization. Machine Learning, 72(3):263-276, 2008.

[18] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt- Thieme. "Factorizing personalized Markov chains for next- basket recommendation." Proceedings of the 19th international conference on World wide web. ACM, 2010.

[19] C. Baccigalupo and E. Plaza. Case-based sequential ordering of songs for playlist recommendation. In ECCBR’06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning, pages 286–300, 2006.

[20] C. Cheng, H. Yang, M. R. Lyu, and I. King. Where you like to go next: Successive point-of-interest recommendation. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages 2605–2611.

AAAI Press, 2013.

[21] Y. Ge, Q. Liu, H. Xiong, A. Tuzhilin, and J. Chen. Cost- aware travel tour recommendation. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011.

[22] H.-P. Hsieh, C.-T. Li, and S.-D. Lin. Measuring and recommending time-sensitive routes from location-based data. ACM Transactions on Intelligent Systems and Technology (TIST), 5(3):45, 2014.

[23] Liu, Duen-Ren, Chin-Hui Lai, and Wang-Jung Lee. "A hybrid of sequential rules and collaborative filtering for product recommendation." Information Sciences 179, no. 20 (2009): 3505-3519.

[24] G.-E. Yap, X.-L. Li, and P. S. Yu. Effective Next-Items Recommendation via Personalized Sequential Pattern Mining. Springer Berlin Heidelberg, 2012.

[25] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In Proceedings of the 14th

(9)

international conference on World Wide Web, WWW

’05, pages 22–32, New York, NY, USA, 2005. ACM [26] Yu, Hsiang-Fu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-Kai

Lou, Todd G. McKenzie, Jung-Wei Chou, Po-Han Chung et al. "Feature engineering and classifier ensemble for KDD cup 2010." In Proceedings of the KDD Cup 2010 Workshop, pp. 1-16. 2010.

[27] Chen, Yu-Chih, Yu-Shi Lin, Yu-Chun Shen, and Shou-De Lin. "A modified random walk framework for handling negative ratings and generating explanations." ACM Transactions on Intelligent Systems and Technology (TIST) 4, no. 1 (2013): 12.

[28] Bennett, James, and Stan Lanning. The Netflix prize.

Proceedings of KDD cup and workshop. Vol. 2007. 2007.

[29] Bell, Robert M., and Yehuda Koren. "Lessons from the Netflix prize challenge." ACM SIGKDD Explorations Newsletter 9, no. 2 (2007): 75-79.

[30] Koren, Yehuda. "The Bellkor solution to the Netflix grand prize." Netflix prize documentation 81 (2009).

[31] Zhou, Yunhong, Dennis Wilkinson, Robert Schreiber, and Rong Pan. "Large-scale parallel collaborative filtering for the

Netflix prize." In Algorithmic Aspects in Information and Management, pp. 337-348. Springer Berlin Heidelberg, 2008.

[32] Po-Lung Chen, et al. A Linear Ensemble of Individual and Blended Models for Music Rating Prediction. JMLR-W&CP, 2012.

[33] Lo, Hung-Yi, et al. Learning to improve Area-Under-FROC for imbalanced medical data classification using an ensemble method. ACM SIGKDD Explorations Newsletter 10.2 2008:

43-46.

[34] Lo, Hung-Yi, et al. An ensemble of three classifiers for KDD cup 2009: Expanded linear model, heterogeneous boosting, and selective Naıve Bayes. JMLR W&CP 7 2009.

[35] Todd G. McKenzie, et al. Novel Models and Ensemble Techniques to Discriminate Favorite Items from Unrated Ones for Personalized Music Recommendation. JMLR- W&CP, 2012.