In the previous section, we learned how to minimize a cost function by taking a step into the opposite direction of a gradient that is calculated from the whole training set; this is why this approach is sometimes also referred to as batch gradient descent.
Now imagine we have a very large dataset with millions of data points, which is not uncommon in many machine learning applications. Running batch gradient descent can be computationally quite costly in such scenarios since we need to reevaluate the whole training dataset each time we take one step towards the global minimum.
A popular alternative to the batch gradient descent algorithm is stochastic gradient descent, sometimes also called iterative or on-line gradient descent. Instead of updating the weights based on the sum of the accumulated errors over all samples
x
( )i :( )
( )
( )(
i i)
( )i,
i
y z
η φ
∆ = w ∑ − x
We update the weights incrementally for each training sample:
( )
( )
( )( y
iz
i)
( )iη − φ x
Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since each gradient is calculated based on a single training example, the error surface is noisier than in gradient descent, which can also have the advantage that stochastic gradient descent can escape shallow local minima more readily. To obtain accurate results via stochastic gradient descent, it is important to present it with data in a random order, which is why we want to shuffle the training set for every epoch to prevent cycles.
In stochastic gradient descent implementations, the fixed learning rate η
is often replaced by an adaptive learning rate that decreases over time, for example,
[
1]
2c
number of iterations c+ where
c
1 andc
2 are constants.Note that stochastic gradient descent does not reach the global minimum but an area very close to it. By using an adaptive learning rate, we can achieve further annealing to a better global minimum
Another advantage of stochastic gradient descent is that we can use it for online learning. In online learning, our model is trained on-the-fly as new training data arrives. This is especially useful if we are accumulating large amounts of data—for example, customer data in typical web applications. Using online learning, the system can immediately adapt to changes and the training data can be discarded after updating the model if storage space in an issue.
A compromise between batch gradient descent and stochastic gradient descent is the so-called mini-batch learning. Mini-batch learning can be understood as applying batch gradient descent to smaller subsets of the training data—for example, 50 samples at a time. The advantage over batch gradient descent is that convergence is reached faster via mini-batches because of the more frequent weight updates.
Furthermore, mini-batch learning allows us to replace the for-loop over the training samples in Stochastic Gradient Descent (SGD) by vectorized operations, which can further improve the computational efficiency of our learning algorithm.
Since we already implemented the Adaline learning rule using gradient descent, we only need to make a few adjustments to modify the learning algorithm to update the weights via stochastic gradient descent. Inside the fit method, we will now update the weights after each training sample. Furthermore, we will implement an additional partial_fit method, which does not reinitialize the weights, for on-line learning. In order to check if our algorithm converged after training, we will calculate the cost as the average cost of the training samples in each epoch.
Furthermore, we will add an option to shuffle the training data before each epoch to avoid cycles when we are optimizing the cost function; via the random_state parameter, we allow the specification of a random seed for consistency:
from numpy.random import seed class AdalineSGD(object):
"""ADAptive LInear NEuron classifier.
Parameters
def __init__(self, eta=0.01, n_iter=10, shuffle=True, random_state=None):
self.eta = eta self.n_iter = n_iter self.w_initialized = False self.shuffle = shuffle
if random_state:
self._initialize_weights(X.shape[1]) self.cost_ = []
"""Fit training data without reinitializing the weights"""
if not self.w_initialized:
"""Shuffle training data"""
r = np.random.permutation(len(y)) return X[r], y[r]
def _initialize_weights(self, m):
"""Initialize weights to zeros"""
self.w_ = np.zeros(1 + m) self.w_initialized = True
def _update_weights(self, xi, target):
"""Apply Adaline learning rule to update the weights"""
output = self.net_input(xi)
"""Compute linear activation"""
return self.net_input(X) def predict(self, X):
"""Return class label after unit step"""
return np.where(self.activation(X) >= 0.0, 1, -1)
The _shuffle method that we are now using in the AdalineSGD classifier works as follows: via the permutation function in numpy.random, we generate a random sequence of unique numbers in the range 0 to 100. Those numbers can then be used as indices to shuffle our feature matrix and class label vector.
We can then use the fit method to train the AdalineSGD classifier and use our plot_decision_regions to plot our training results:
>>> ada = AdalineSGD(n_iter=15, eta=0.01, random_state=1)
>>> ada.fit(X_std, y)
>>> plot_decision_regions(X_std, y, classifier=ada)
>>> plt.title('Adaline - Stochastic Gradient Descent')
>>> plt.xlabel('sepal length [standardized]')
>>> plt.ylabel('petal length [standardized]')
>>> plt.legend(loc='upper left')
>>> plt.show()
>>> plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
>>> plt.xlabel('Epochs')
>>> plt.ylabel('Average Cost')
>>> plt.show()
The two plots that we obtain from executing the preceding code example are shown in the following figure:
As we can see, the average cost goes down pretty quickly, and the final decision boundary after 15 epochs looks similar to the batch gradient descent with Adaline.
If we want to update our model—for example, in an on-line learning scenario with streaming data—we could simply call the partial_fit method on individual samples—for instance, ada.partial_fit(X_std[0, :], y[0]).
Summary
In this chapter, we gained a good understanding of the basic concepts of linear classifiers for supervised learning. After we implemented a perceptron, we saw how we can train adaptive linear neurons efficiently via a vectorized implementation of gradient descent and on-line learning via stochastic gradient descent. Now that we have seen how to implement simple classifiers in Python, we are ready to move on to the next chapter where we will use the Python scikit-learn machine learning library to get access to more advanced and powerful off-the-shelf machine learning classifiers that are commonly used in academia as well as in industry.