Trying SGD in action - Large Scale Machine Learning with Python

As a conclusion of the present chapter, we will implement two examples: one for classification based on the Forest Covertype data and one for regression based on the bike-sharing dataset. We will see how to put into practice the previous insights

Chapter 2 Starting with the classification problem, there are two noticeable aspects to consider.

Being a multiclass problem, first of all we noticed that there is some kind of ordering in the database and distribution of classes along the stream of instances. As an initial step, we will shuffle the data using the ram_shuffle function defined during the chapter in the Paying attention to the ordering of instances section:

In: import os

local_path = os.getcwd() source = 'covtype.data'

ram_shuffle(filename_in=local_path+'\\'+source, \

filename_out=local_path+'\\shuffled_covtype.data', \ header=False)

As we are zipping the rows in-memory and shuffling them without much disk usage, we can quickly obtain a new working file. The following code will train SGDClassifier with log loss (equivalent to a logistic regression) so that it leverages our previous knowledge of the classes present in the dataset. The forest_type list contains all the codes of the classes and it is passed every time (though just one, the first, would suffice) to the partial_fit method of the SGD learner.

For validation purposes, we define a cold start at 200.000 observed cases. At every ten instances, one will be left out of training and used for validation. This schema allows reproducibility even if we are going to pass over the data multiple times; at every pass, the same instances will be left out as an out-of-sample test, allowing the creation of a validation curve to test the effect of multiple passes over the same data.

The holdout schema is accompanied by a progressive validation, too. So each case after the cold start is evaluated before being fed to the training. Although progressive validation provides an interesting feedback, such an approach will work only for the first pass; in fact after the initial pass, all the observations (but the ones in the holdout schema) will become in-sample instances. In our example, we are going to make only one pass.

As a reminder, the dataset has 581.012 instances and it may prove a bit long to stream and model with SGD (it is quite a large-scale problem for a single computer).

Though we placed a limiter to observe just 250.000 instances, still allow your computer to run for about 15-20 minutes before expecting results:

In: import csv, time import numpy as np

from sklearn.linear_model import SGDClassifier source = 'shuffled_covtype.data'

SEP=','

forest_type = [t+1 for t in range(7)]

Scalable Learning in Scikit-learn

SGD = SGDClassifier(loss='log', penalty=None, random_state=1, average=True)

iterator = csv.reader(R, delimiter=SEP) for n, row in enumerate(iterator):

features = np.array(map(float,row[:-1])).reshape(1,-1) # MACHINE LEARNING

SGD.partial_fit(features, response, classes=forest_type) print '%s FINAL holdout accuracy: %0.3f' % (time.strftime('%X'), accuracy / ((n+1-cold_start) / float(k_holdout)))

print '%s FINAL progressive accuracy: %0.3f' % (time.strftime('%X'), prog_accuracy / float(prog_count))

Chapter 2 18:45:59 holdout accuracy: 0.621

18:45:59 progressive accuracy: 0.617 18:45:59 FINAL holdout accuracy: 0.621 18:45:59 FINAL progressive accuracy: 0.617

As the second example, we will try to predict the number of shared bicycles in Washington based on a series of weather and time information. Given the historical order of the dataset, we do not shuffle it and treat the problem as a time series one.

Our validation strategy is to test the results after having seen a certain number of examples in order to replicate the uncertainities to forecast from that moment of time onward.

It is also interesting to notice that some of the features are categorical, so we applied the FeatureHasher class from Scikit-learn in order to represent having the categories recorded in a dictionary as a joint string made up of the variable name and category code. The value assigned in the dictionary for each of these keys is one in order to resemble a binary variable in the sparse vector that the hashing trick will be creating:

In: import csv, time, os

SGD = SGDRegressor(loss='squared_loss', penalty=None, random_state=1, average=True)

iterator = csv.DictReader(R, delimiter=SEP) for n, row in enumerate(iterator):

Scalable Learning in Scikit-learn

features.update(numeric_features)

hashed_features = h.transform([features]) # MACHINE LEARNING

if (n+1) >= predictions_start:

# HOLDOUT AFTER N PHASE

predicted = SGD.predict(hashed_features) val_rmse += (apply_exp(predicted) \ - apply_exp(target))**2

val_rmsle += (predicted - target)**2 if (n-predictions_start+1) % 250 == 0 \ and (n+1) > predictions_start:

print '%s holdout RMSE: %0.3f' \ % (time.strftime('%X'), (val_rmse \ / float(n-predictions_start+1))**0.5), print 'holdout RMSLE: %0.3f' % ((val_rmsle \ / float(n-predictions_start+1))**0.5)

else:

# LEARNING PHASE

SGD.partial_fit(hashed_features, target) print '%s FINAL holdout RMSE: %0.3f' % \

(time.strftime('%X'), (val_rmse \

/ float(n-predictions_start+1))**0.5) print '%s FINAL holdout RMSLE: %0.3f' % \ (time.strftime('%X'), (val_rmsle \

/ float(n-predictions_start+1))**0.5) Out:

18:02:54 holdout RMSE: 281.065 holdout RMSLE: 1.899 18:02:54 holdout RMSE: 254.958 holdout RMSLE: 1.800 18:02:54 holdout RMSE: 255.456 holdout RMSLE: 1.798 18:52:54 holdout RMSE: 254.563 holdout RMSLE: 1.818 18:52:54 holdout RMSE: 239.740 holdout RMSLE: 1.737 18:52:54 FINAL holdout RMSE: 229.274

18:52:54 FINAL holdout RMSLE: 1.678

Chapter 2

Summary

In this chapter, we have seen how learning is possible out-of-core by streaming data, no matter how big it is, from a text file or database on your hard disk. These methods certainly apply to much bigger datasets than the examples that we used to demonstrate them (which actually could be solved in-memory using non-average, powerful hardware).

We also explained the core algorithm that makes out-of-core learning possible—

SGD—and we examined its strength and weakness, emphasizing the necessity of streams to be really stochastic (which means in a random order) to be really effective, unless the order is part of the learning objectives. In particular, we introduced the Scikit-learn implementation of SGD, limiting our focus to the linear and logistic regression loss functions.

Finally, we discussed data preparation, introduced the hashing trick and validation strategies for streams, and wrapped up the acquired knowledge on SGD fitting two different models—classification and regression.

In the next chapter, we will keep on enriching our out-of-core capabilities by figuring out how to enable non-linearity in our learning schema and hinge loss for support vector machines. We will also present alternatives to Scikit-learn, such as Liblinear, Vowpal Wabbit, and StreamSVM. Although operating as external shell commands, all of them could be easily wrapped and controlled by Python scripts.

在文檔中 Large Scale Machine Learning with Python (頁 93-100)