If you have used any commercial online system in the last 10 years, you have probably seen these recommendations. Some are like Amazon's "costumers who bought X also bought Y." These will be dealt with in the next chapter under the topic of basket analysis. Others are based on predicting the rating of a product, such as a movie.
This last problem was made famous with the Netflix Challenge; a million-dollar machine learning public challenge by Netflix. Netflix (well-known in the U.S.
and U.K., but not available elsewhere) is a movie rental company. Traditionally, you would receive DVDs in the mail; more recently, the business has focused on online streaming of videos. From the start, one of the distinguishing features of the service was that it gave every user the option of rating films they had seen, using these ratings to then recommend other films. In this mode, you not only have the information about which films the user saw, but also their impression of them (including negative impressions).
In 2006, Netflix made available a large number of customer ratings of films in its database and the goal was to improve on their in-house algorithm for ratings prediction. Whoever was able to beat it by 10 percent or more would win 1 million dollars. In 2009, an international team named BellKor's Pragmatic Chaos was able to beat that mark and take the prize. They did so just 20 minutes before another team, The Ensemble, passed the 10 percent mark as well! An exciting photo-finish for a competition that lasted several years.
Unfortunately, for legal reasons, this dataset is no longer available (although the data was anonymous, there were concerns that it might be possible to discover who the clients were and reveal the private details of movie rentals). However, we can use an academic dataset with similar characteristics. This data comes from GroupLens, a research laboratory at the University of Minnesota.
Machine learning in the real world
Much has been written about the Netflix Prize and you may learn a lot reading up on it (this book will have given you enough to start to understand the issues). The techniques that won were a mix of advanced machine learning with a lot of work in the preprocessing of the data. For example, some users like to rate everything very highly, others are always more negative; if you do not account for this in preprocessing, your model will suffer. Other not so obvious normalizations were also necessary for a good result: how old the film is, how many ratings did it receive, and so on. Good algorithms are a good thing, but you always need to "get your hands dirty" and tune your methods to the properties of the data you have in front of you.
We can formulate this as a regression problem and apply the methods that we learned in this chapter. It is not a good fit for a classification approach. We could certainly attempt to learn the five class classifiers, one class for each possible grade.
There are two problems with this approach:
• Errors are not all the same. For example, mistaking a 5-star movie for a 4-star one is not as serious of a mistake as mistaking a 5-star movie for a 1-star one.
• Intermediate values make sense. Even if our inputs are only integer values, it is perfectly meaningful to say that the prediction is 4.7. We can see that this is a different prediction than 4.2.
These two factors together mean that classification is not a good fit to the problem.
The regression framework is more meaningful.
We have two choices: we can build movie-specific or user-specific models. In our case, we are going to first build user-specific models. This means that, for each user, we take the movies, it has rated as our target variable. The inputs are the ratings of other old users. This will give a high value to users who are similar to our user (or a negative value to users who like more or less the same movies that our user dislikes).
The system is just an application of what we have developed so far. You will find a copy of the dataset and code to load it into Python on the book's companion
website. There you will also find pointers to more information, including the original MovieLens website.
The loading of the dataset is just basic Python, so let us jump ahead to the learning.
We have a sparse matrix, where there are entries from 1 to 5 whenever we have a rating (most of the entries are zero to denote that this user has not rated these movies). This time, as a regression method, for variety, we are going to be using the LassoCV class:
from sklearn.linear_model import LassoCV
reg = LassoCV(fit_intercept=True, alphas=[.125,.25,.5,1.,2.,4.]) By passing the constructor an explicit set of alphas, we can constrain the values that the inner cross-validation will use. You may note that the values are multiples of two, starting with 1/8 up to 4. We will now write a function which learns a model for the user i:
# isolate this user u = reviews[i]
We are only interested in the movies that the user u rated, so we must build up the index of those. There are a few NumPy tricks in here: u.toarray() to convert from a sparse matrix to a regular array. Then, we ravel() that array to convert from a row array (that is, a two-dimensional array with a first dimension of 1) to a simple one-dimensional array. We compare it with zero and ask where this comparison is true.
The result, ps, is an array of indices; those indices correspond to movies that the user has rated:
u = u.array().ravel() ps, = np.where(u > 0)
# Build an array with indices [0...N] except i us = np.delete(np.arange(reviews.shape[0]), i) x = reviews[us][:,ps].T
Finally, we select only the movies that the user has rated:
y = u[ps]
Cross-validation is set up as before. Because we have many users, we are going to only use four folds (more would take a long time and we have enough training data with just 80 percent of the data):
err = 0
kf = KFold(len(y), n_folds=4) for train,test in kf:
# Now we perform a per-movie normalization # this is explained below
reg.fit(xc, y[train]-x1)
# We need to perform the same normalization while testing xc,x1 = movie_norm(x[test])
p = np.array(map(reg.predict, xc)).ravel() e = (p+x1)-y[test]
err += np.sum(e*e)
We did not explain the movie_norm function. This function performs per-movie normalization: some movies are just generally better and get higher average marks:
def movie_norm(x):
xc = x.copy().toarray()
We cannot use xc.mean(1) because we do not want to have the zeros counting for the mean. We only want the mean of the ratings that were actually given:
x1 = np.array([xi[xi > 0].mean() for xi in xc])
In certain cases, there were no ratings and we got a NaN value, so we replace it with zeros using np.nan_to_num, which does exactly this task:
x1 = np.nan_to_num(x1)
Now we normalize the input by removing the mean value from the non-zero entries:
for i in xrange(xc.shape[0]):
xc[i] -= (xc[i] > 0) * x1[i]
Implicitly, this also makes the movies that the user did not rate have a value of zero, which is average. Finally, we return the normalized array and the means:
return x,x1
You might have noticed that we converted to a regular (dense) array. This has the added advantage that it makes the optimization much faster: while scikit-learn works well with the sparse values, the dense arrays are much faster (if you can fit them in memory; when you cannot, you are forced to use sparse arrays).
When compared with simply guessing the average value for that user, this approach is 80 percent better. The results are not spectacular, but it is a start. On one hand, this is a very hard problem and we cannot expect to be right with every prediction:
we perform better when the users have given us more reviews. On the other hand, regression is a blunt tool for this job. Note how we learned a completely separate model for each user. In the next chapter, we will look at other methods that go beyond regression for approaching this problem. In those models, we integrate the information from all users and all movies in a more intelligent manner.
Summary
In this chapter, we started with the oldest trick in the book, ordinary least squares. It is still sometimes good enough. However, we also saw that more modern approaches that avoid overfitting can give us better results. We used Ridge, Lasso, and Elastic nets; these are the state-of-the-art methods for regression.
We once again saw the danger of relying on training error to estimate generalization:
it can be an overly optimistic estimate to the point where our model has zero training error, but we can know that it is completely useless. When thinking through these issues, we were led into two-level cross-validation, an important point that many in the field still have not completely internalized. Throughout, we were able to rely on scikit-learn to support all the operations we wanted to perform, including an easy way to achieve correct cross-validation.
At the end of this chapter, we started to shift gears and look at recommendation problems. For now, we approached these problems with the tools we knew:
penalized regression. In the next chapter, we will look at new, better tools for this problem. These will improve our results on this dataset.
This recommendation setting also has a disadvantage that it requires that users have rated items on a numeric scale. Only a fraction of users actually perform this operation. There is another type of information that is often easier to obtain: which items were purchased together. In the next chapter, we will also see how to leverage this information in a framework called basket analysis.