Handling large data on a single computer
4.5 Case study 2: Building a recommender system inside a database
In reality most of the data you work with is stored in a relational database, but most databases aren’t suitable for data mining. But as shown in this example, it’s possible to adapt our techniques so you can do a large part of the analysis inside the database itself, thereby profiting from the database’s query optimizer, which will optimize the code for you. In this example we’ll go into how to use the hash table data structure and how to use Python to control other tools.
4.5.1 Tools and techniques needed
Before going into the case study we need to have a quick look at the required tools and theoretical background to what we’re about to do here.
TOOLS
■ MySQL database —Needs a MySQL database to work with. If you haven’t installed a MySQL community server, you can download one from www.mysql.com.
Appendix C: “Installing a MySQL server” explains how to set it up.
■ MySQL database connection Python library—To connect to this server from Python you’ll also need to install SQLAlchemy or another library capable of communicat-ing with MySQL. We’re using MySQLdb. On Windows you can’t use Conda right off the bat to install it. First install Binstar (another package management service) and look for the appropriate mysql-python package for your Python setup.
conda install binstar
binstar search -t conda mysql-python
The following command entered into the Windows command line worked for us (after activating the Python environment):
conda install --channel https://conda.binstar.org/krisvanneste mysql-python
Again, feel free to go for the SQLAlchemy library if that’s something you’re more com-fortable with.
■ We will also need the pandas python library, but that should already be installed by now.
With the infrastructure in place, let’s dive into a few of the techniques.
TECHNIQUES
A simple recommender system will look for customers who’ve rented similar movies as you have and then suggest those that the others have watched but you haven’t seen yet. This technique is called k-nearest neighbors in machine learning.
A customer who behaves similarly to you isn’t necessarily the most similar customer.
You’ll use a technique to ensure that you can find similar customers (local optima) without guarantees that you’ve found the best customer (global optimum). A com-mon technique used to solve this is called Locality-Sensitive Hashing. A good overview of papers on this topic can be found at http://www.mit.edu/~andoni/LSH/.
The idea behind Locality-Sensitive Hashing is simple: Construct functions that map similar customers close together (they’re put in a bucket with the same label) and make sure that objects that are different are put in different buckets.
Central to this idea is a function that performs the mapping. This function is called a hash function: a function that maps any range of input to a fixed output. The simplest hash function concatenates the values from several random columns. It doesn’t matter how many columns (scalable input); it brings it back to a single col-umn (fixed output).
You’ll set up three hash functions to find similar customers. The three functions take the values of three movies:
■ The first function takes the values of movies 10, 15, and 28.
■ The second function takes the values of movies 7, 18, and 22.
■ The last function takes the values of movies 16, 19, and 30.
This will ensure that the customers who are in the same bucket share at least several movies. But the customers inside one bucket might still differ on the movies that weren’t included in the hashing functions. To solve this you still need to compare the customers within the bucket with each other. For this you need to create a new dis-tance measure.
The distance that you’ll use to compare customers is called the hamming distance.
The hamming distance is used to calculate how much two strings differ. The distance is defined as the number of different characters in a string. Table 4.2 offers a few examples of the hamming distance.
Table 4.2 Examples of calculating the hamming distance
String 1 String 2 Hamming distance
Hat Cat 1
Hat Mad 2
Tiger Tigre 2
Paris Rome 5
Comparing multiple columns is an expensive operation, so you’ll need a trick to speed this up. Because the columns contain a binary (0 or 1) variable to indicate whether a customer has bought a movie or not, you can concatenate the information so that the same information is contained in a new column. Table 4.3 shows the “mov-ies” variable that contains as much information as all the movie columns combined.
This allows you to calculate the hamming distance much more efficiently. By handling this operator as a bit, you can exploit the XOR operator. The outcome of the XOR operator (^) is as follows:
1^1 = 0 1^0 = 1 0^1 = 1 0^0 = 0
With this in place, the process to find similar customers becomes very simple. Let’s first look at it in pseudo code:
Preprocessing:
1 Define p (for instance, 3) functions that select k (for instance, 3) entries from the vector of movies. Here we take 3 functions (p) that each take 3 (k) movies.
2 Apply these functions to every point and store them in a separate column. (In literature each function is called a hash function and each column will store a bucket.)
Querying point q:
1 Apply the same p functions to the point (observation) q you want to query.
2 Retrieve for every function the points that correspond to the result in the corre-sponding bucket.
Stop when you’ve retrieved all the points in the buckets or reached 2p points (for example 10 if you have 5 functions).
3 Calculate the distance for each point and return the points with the mini-mum distance.
Table 4.3 Combining the information from different columns into the movies column. This is also how DNA works: all information in a long string.
Column 1 Movie 1 Movie 2 Movie 3 Movie 4 movies
Customer 1 1 0 1 1 1011
Customer 2 0 0 0 1 0001
Let’s look at an actual implementation in Python to make this all clearer.
4.5.2 Step 1: Research question
Let’s say you’re working in a video store and the manager asks you if it’s possible to use the information on what movies people rent to predict what other movies they might like. Your boss has stored the data in a MySQL database, and it’s up to you to do the analysis. What he is referring to is a recommender system, an automated system that learns people’s preferences and recommends movies and other products the cus-tomers haven’t tried yet. The goal of our case study is to create a memory-friendly rec-ommender system. We’ll achieve this using a database and a few extra tricks. We’re going to create the data ourselves for this case study so we can skip the data retrieval step and move right into data preparation. And after that we can skip the data explo-ration step and move straight into model building.
4.5.3 Step 3: Data preparation
The data your boss has collected is shown in table 4.4. We’ll create this data ourselves for the sake of demonstration.
For each customer you get an indication of whether they’ve rented the movie before (1) or not (0). Let’s see what else you’ll need so you can give your boss the recom-mender system he desires.
First let’s connect Python to MySQL to create our data. Make a connection to MySQL using your username and password. In the following listing we used a database called “test”. Replace the user, password, and database name with the appropriate val-ues for your setup and retrieve the connection and the cursor. A database cursor is a control structure that remembers where you are currently in the database.
Table 4.4 Excerpt from the client database and the movies customers rented
Customer Movie 1 Movie 2 Movie 3 … Movie 32
Jack Dani 1 0 0 1
Wilhelmson 1 1 0 1
…
Jane Dane 0 0 1 0
Xi Liu 0 0 0 1
Eros Mazo 1 1 0 1
…
import MySQLdb import pandas as pd user = '****' password = '****' database = 'test'
mc = MySQLdb.connect('localhost',user,password,database) cursor = mc.cursor()
nr_customers = 100
colnames = ["movie%d" %i for i in range(1,33)]
pd.np.random.seed(2015)
generated_customers = pd.np.random.randint(0,2,32 * nr_customers).reshape(nr_customers,32)
data = pd.DataFrame(generated_customers, columns = list(colnames)) data.to_sql('cust',mc, flavor = 'mysql', index = True, if_exists =
'replace', index_label = 'cust_id')
We create 100 customers and randomly assign whether they did or didn’t see a certain movie, and we have 32 movies in total. The data is first created in a Pandas data frame but is then turned into SQL code. Note: You might run across a warning when running this code. The warning states: The “mysql” flavor with DBAPI connection is deprecated and will be removed in future versions. MySQL will be further supported with SQLAlchemy engines.
Feel free to already switch to SQLAlchemy or another library. We’ll use SQLAlchemy in other chapters, but used MySQLdb here to broaden the examples.
To efficiently query our database later on we’ll need additional data preparation, including the following things:
■ Creating bit strings. The bit strings are compressed versions of the columns’
content (0 and 1 values). First these binary values are concatenated; then the resulting bit string is reinterpreted as a number. This might sound abstract now but will become clearer in the code.
■ Defining hash functions. The hash functions will in fact create the bit strings.
■ Adding an index to the table, to quicken data retrieval.
CREATINGBITSTRINGS
Now you make an intermediate table suited for querying, apply the hash functions, and represent the sequence of bits as a decimal number. Finally, you can place them in a table.
First, you need to create bit strings. You need to convert the string “11111111” to a binary or a numeric value to make the hamming function work. We opted for a numeric representation, as shown in the next listing.
Listing 4.8 Creating customers in the database
First we establish the connection; you’ll need to fill out your own username, password, and schema-name
Store the data inside a Pandas data frame and write the data frame in a MySQL table called
“cust”. If this table already exists, replace it.
def createNum(x1,x2,x3,x4,x5,x6,x7,x8):
return [int('%d%d%d%d%d%d%d%d' % (i1,i2,i3,i4,i5,i6,i7,i8),2) for (i1,i2,i3,i4,i5,i6,i7,i8) in zip(x1,x2,x3,x4,x5,x6,x7,x8)]
By converting the information of 32 columns into 4 numbers, we compressed it for later lookup. Figure 4.11 shows what we get when asking for the first 2 observations (customer movie view history) in this new format.
store[0:2]
The next step is to create the hash functions, because they’ll enable us to sample the data we’ll use to determine whether two customers have similar behavior.
Listing 4.9 Creating bit strings
We represent the string as a numeric value. The string will be a concatenation of zeros (0) and ones (1) because these indicate whether someone has seen a certain movie or not. The strings are then regarded as bit code. For example:
0011 is the same as the number 3. What def createNum() does: takes in 8 values, concatenates these 8 column values and turns them into a string, then turns the byte code of the string into a number.
Test if the function works correctly. Binary code 1111 is the same as 15 (=1*8+1*4+1*2+1*1). If the assert fails, it will raise an assert error; otherwise nothing will happen.
Translate the
Figure 4.11 First 2 customers' information on all 32 movies after bit string to numeric conversion
CREATINGAHASHFUNCTION
The hash functions we create take the values of movies for a customer. We decided in the theory part of this case study to create 3 hash functions: the first function com-bines the movies 10, 5, and 18; the second comcom-bines movies 7, 18, and 22; and the third one combines 16, 19, and 30. It’s up to you if you want to pick others; this can be picked randomly. The following code listing shows how this is done.
def hash_fn(x1,x2,x3):
return [b'%d%d%d' % (i,j,k) for (i,j,k) in zip(x1,x2,x3)]
assert hash_fn([1,0],[1,1],[0,0]) == [b'110',b'010']
store['bucket1'] = hash_fn(data.movie10, data.movie15,data.movie28) store['bucket2'] = hash_fn(data.movie7, data.movie18,data.movie22) store['bucket3'] = hash_fn(data.movie16, data.movie19,data.movie30) store.to_sql('movie_comparison',mc, flavor = 'mysql', index = True,
index_label = 'cust_id', if_exists = 'replace')
The hash function concatenates the values from the different movies into a binary value like what happened before in the createNum() function, only this time we don’t convert to numbers and we only take 3 movies instead of 8 as input. The assert function shows how it concatenates the 3 values for every observation. When the client has bought movie 10 but not movies 15 and 28, it will return b’100’ for bucket 1. When the client bought movies 7 and 18, but not 22, it will return b’110’ for bucket 2. If we look at the current result we see the 4 variables we created earlier (bit1, bit2, bit3, bit4) from the 9 handpicked movies (figure 4.12).
The last trick we’ll apply is indexing the customer table so lookups happen more quickly.
Listing 4.10 Creating hash functions
Define hash function (it is exactly like the createNum() function without the final conversion to a number and for 3 columns instead of 8).
Test if it works correctly (if no error is raised, it works).
It’s sampling on columns but all observations will be selected.
Create hash values from customer movies, respectively [10,15, 28], [7,18, 22], [16,19, 30].
Store this information in database.
Figure 4.12 Information from the bit string compression and the 9 sampled movies
ADDINGANINDEXTOTHETABLE
Now you must add indices to speed up retrieval as needed in a real-time system. This is shown in the next listing.
def createIndex(column, cursor):
sql = 'CREATE INDEX %s ON movie_comparison (%s);' % (column, column) cursor.execute(sql)
createIndex('bucket1',cursor) createIndex('bucket2',cursor) createIndex('bucket3',cursor)
With the data indexed we can now move on to the “model building part.” In this case study no actual machine learning or statistical model is implemented. Instead we’ll use a far simpler technique: string distance calculation. Two strings can be compared using the hamming distance as explained earlier in the theoretical intro to the case study.
4.5.4 Step 5: Model building
To use the hamming distance in the database we need to define it as a function.
CREATINGTHEHAMMINGDISTANCEFUNCTION
We implement this as a user-defined function. This function can calculate the distance for a 32-bit integer (actually 4*8), as shown in the following listing.
Sql = '''
CREATE FUNCTION HAMMINGDISTANCE(
A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT )
RETURNS INT DETERMINISTIC RETURN
b'11111111',b'00000000',b'11011111',b'11111111' ,b'11111111',b'10001001',b'11011111',b'11111111' )'''
pd.read_sql(Sql,mc)
Listing 4.11 Creating an index
Listing 4.12 Creating the hamming distance
Create function to easily create indices. Indices will quicken retrieval.
Put index on bit buckets.
Define function. It takes 8 input arguments: 4 strings of length 8 for the first customer and another 4 strings of length 8 for the second customer. This way we can compare 2 customers side-by-side for 32 movies.
The function is stored in a database. You can only do this once; running this code a second time will result in an error: OperationalError:
(1304, 'FUNCTION HAMMING-DISTANCE already exists').
To check this function you can run this SQL statement with 8 fixed strings. Notice the “b” before each string, indicating that you’re passing bit values. The outcome of this particular test should be 3, which indicates the series of strings differ in only 3 places.
This runs the query.
If all is well, the output of this code should be 3.
Now that we have our hamming distance function in position, we can use it to find similar customers to a given customer, and this is exactly what we want our application to do. Let’s move on to the last part: utilizing our setup as a sort of application.
4.5.5 Step 6: Presentation and automation
Now that we have it all set up, our application needs to perform two steps when con-fronted with a given customer:
■ Look for similar customers.
■ Suggest movies the customer has yet to see based on what he or she has already viewed and the viewing history of the similar customers.
First things first: select ourselves a lucky customer.
FINDINGASIMILARCUSTOMER
Time to perform real-time queries. In the following listing, customer 27 is the happy one who’ll get his next movies selected for him. But first we need to select customers with a similar viewing history.
customer_id = 27
sql = "select * from movie_comparison where cust_id = %s" % customer_id cust_data = pd.read_sql(sql,mc)
sql = """ select cust_id,hammingdistance(bit1, bit2,bit3,bit4,%s,%s,%s,%s) as distance
from movie_comparison where bucket1 = '%s' or bucket2 ='%s' or bucket3='%s' order by distance limit 3""" %
(cust_data.bit1[0],cust_data.bit2[0], cust_data.bit3[0], cust_data.bit4[0],
cust_data.bucket1[0], cust_data.bucket2[0],cust_data.bucket3[0]) shortlist = pd.read_sql(sql,mc)
Table 4.5 shows customers 2 and 97 to be the most similar to customer 27. Don’t for-get that the data was generated randomly, so anyone replicating this example might receive different results.
Now we can finally select a movie for customer 27 to watch.
Listing 4.13 Finding similar customers
Pick customer from database.
We do two-step sampling. First sampling: index must be exactly the same as the one of the selected customer (is based on 9 movies). Selected people must have seen (or not seen) these 9 movies exactly like our customer did.
Second sampling is a ranking based on the 4-bit strings. These take into account all the movies in the database.
We show the 3 customers that most resemble customer 27.