The Scikit-learn SGD implementation - Large Scale Machine Learning with Python

A good number of online learning algorithms can be found in the Scikit-learn package. Not all machine learning algorithms have an online counterpart, but the list is interesting and steadily growing. As for supervised learning, we can divide available learners into classifiers and regressors and enumerate them.

As classifiers, we can mention the following:

• sklearn.naive_bayes.MultinomialNB

• sklearn.naive_bayes.BernoulliNB

• sklearn.linear_model.Perceptron

• sklearn.linear_model.PassiveAggressiveClassifier

• sklearn.linear_model.SGDClassifier

Scalable Learning in Scikit-learn

As regressors, we have two options:

• sklearn.linear_model.PassiveAggressiveRegressor

• sklearn.linear_model.SGDRegressor

They all can learn incrementally, updating themselves instance by instance; though only SGDClassifier and SGDRegressor are based on the stochastic gradient descent optimization that we previously described, and they are the main topics of this chapter. The SGD learners are optimal for all large-scale problems as their complexity is bound to O(k*n*p), where k is the number of passes over the data, n is the number of instances, and p is the number of features (naturally non-zero features if we are working with sparse matrices): a perfectly linear time learner, taking more time exactly in proportion to the number of examples shown.

Other online algorithms will be used as a comparative benchmark. Moreover, all algorithms have the usage of the same API in common, based on the partial_fit method for online learning and mini-batch (when you stream larger chunks rather than a single instance). Sharing the same API makes all these learning techniques interchangeable in your learning frame.

Contrary to the fit method, which uses all the available data for its immediate optimization, partial_fit operates a partial optimization based on each of the single instances passed. Even if a dataset is passed to the partial_fit method, the algorithm won't process the entire batch but for its single elements, making the complexity of the learning operations indeed linear. Moreover, a learner, after partial_fit, can be perpetually updated by subsequent partial_fit calls, making it perfect for online learning from continuous streams of data.

When classifying, the only caveat is that at the first initialization, it is necessary to know how many classes we are going to learn and how they are labeled. This can be done using the classes parameter, pointing out a list of the numeric values labels.

This requires to be explored beforehand, streaming through the data in order to record the labels of the problem and also taking notice of their distribution in case they are unbalanced—a class is numerically too large or too small with respect to the others (but the Scikit-learn implementation offers a way to automatically handle the problem). If the target variable is numeric, it is still useful to know its distribution, but this is not necessary to successfully run the learner.

Chapter 2 In Scikit-learn, we have two implementations—one for classification problems

(SGDClassifier) and one for regression ones (SGDRegressor). The classification implementation can handle multiclass problems using the one-vs-all (OVA) strategy.

This strategy implies that, given k classes, k models are built, one for each class against all the instances of other classes, therefore creating k binary classifications.

This will produce k sets of coefficients and k vectors of predictions and their probability. In the end, based on the emitted probability of each class compared against the other, the classification is assigned to the class with the highest probability.

If we need to give actual probabilities for the multinomial distribution, we can simply normalize the results by dividing by their sum. (This is what is happening in a

softmax layer in a neural network, which we will see in the following chapters.) Both classification and regression SGD implementations in Scikit-learn feature different loss functions (the cost function, the core of the stochastic gradient descent optimization).

For classification, expressed by the loss parameter, we can rely on the following:

• loss='log': Classical logistic regression

• loss='hinge': Softmargin, that is, a linear support vector machine

• loss='modified_huber': A smoothed hinge loss For regression, we have three loss functions:

• loss='squared_loss': Ordinary least squares (OLS) for linear regression

• loss='huber': Huber loss for robust regression against outliers

• loss='epsilon_insensitive': A linear support vector regression We will present some examples using the classical statistical loss functions, which are logistic loss and OLS. Hinge loss and support vector machines (SVMs) will be discussed in the next chapter, a detailed introduction about their functioning being necessary.

As a reminder (so that you won't have to go and consult any other supplementary machine learning book), if we define the regression function as h and its predictions are given by h(X) because X is the matrix of features, then the following is the suitable formulation:

Scalable Learning in Scikit-learn

Consequently, the OLS cost function to be minimized is as follows:

In logistic regression, having a transformation of the binary outcome 0/1 into an odds ratio, πy being the probability of a positive outcome, the formula is as follows:

The logistic cost function, consequently, is defined as follows:

在文檔中 Large Scale Machine Learning with Python (頁 76-79)