As in batch learning, there are no shortcuts in out-of-core algorithms when testing the best combinations of hyperparameters; you need to try a certain number of combinations to figure out a possible optimal solution and use an out-of-sample error measurement to evaluate their performance.
As you actually do not know if your prediction problem has a simple smooth convex loss or a more complicated one and you do not know exactly how your hyperparameters interact with each other, it is very easy to get stuck into some sub-optimal local-minimum if not enough combinations are tried. Unfortunately, at the moment there are no specialized optimization procedures offered by Scikit-learn for out-of-core algorithms. Given the necessarily long time to train an SGD on a long stream, tuning the hyperparameters can really become a bottleneck when building a model on your data using such techniques.
Here, we present a few rules of thumb that can help you save time and efforts and achieve the best results.
First, you can tune your parameters on a window or a sample of your data that can
Chapter 3 As Léon Bottou from Microsoft Research has remarked in his technical paper,
Stochastic Gradient Descent Tricks:
"The mathematics of stochastic gradient descent are amazingly independent of the training set size."
This is true for all the key parameters but especially for the learning rate; the learning rate that works better with a sample will work the best with the full data. In addition, the ideal number of passes over data can be mostly guessed by trying to converge on a small sampled dataset. As a rule of thumb, we report the indicative number of 10**6 examples examined by the algorithm—as pointed out by the Scikit-learn documentation—a number that we have often found accurate, though the ideal number of iterations may change depending on the regularization parameters.
Though most of the work can be done at a relatively small scale when using SGD, we have to define how to approach the problem of fixing multiple parameters.
Traditionally, manual search and grid search have been the most used approaches, grid search solving the problem by systematically testing all the combinations of possible parameters at significant values (using, for instance, the log scale checking at the different power degree of 10 or of 2).
Recently, James Bergstra and Yoshua Bengio in their paper, Random Search for Hyper-Parameter Optimization, pointed out a different approach based on the random sampling of the values of the hyperparameters. Such an approach, though based on random choices, is often equivalent in results to grid search (but requiring fewer runs) when the number of hyperparameters is low and can exceed the performance of a systematic search when the parameters are many and not all of them are relevant for the algorithm performance.
We leave it to the reader to discover more reasons why this simple and appealing approach works so well in theory by referring to the previously mentioned paper by Bergstra and Bengio. In practice, having experienced its superiority with respect to other approaches, we propose an approach that works well for streams based on Scikit-learn's ParameterSampler function in the following example code snippet.
ParameterSampler is able to randomly sample different sets of hyperparameters (both from distribution functions or lists of discrete values) to be applied to your learning SGD by means of the set_params method afterward:
In: from sklearn.linear_model import SGDRegressor from sklearn.grid_search import ParameterSampler source = '\\bikesharing\\hour.csv'
local_path = os.getcwd()
b_vars = ['holiday','hr','mnth', 'season','weathersit','weekday','wor
Fast SVM Implementations
n_vars = ['hum', 'temp', 'atemp', 'windspeed']
std_row, min_max = explore(target_file=local_path+'\\'+source, binary_
def apply_log(x): return np.log(x + 1.0) def apply_exp(x): return np.exp(x) - 1.0
param_grid = {'penalty':['l1', 'l2'], 'alpha': 10.0**-np.arange(2,5)}
random_tests = 3
search_schedule = list(ParameterSampler(param_grid, n_iter=random_
tests, random_state=5)) results = dict()
for search in search_schedule:
SGD = SGDRegressor(loss='epsilon_insensitive', epsilon=0.001, penalty=None, random_state=1, average=True)
params =SGD.get_params()
new_params = {p:params[p] if p not in search else search[p] for p in params}
SGD.set_params(**new_params) print str(search)[1:-1]
for iterations in range(200):
for x,y,n in pull_examples(target_file=local_path+'\\'+source, vectorizer=std_row, min_max=min_
Chapter 3 print_rmse = (val_rmse / examples)**0.5
print_rmsle = (val_rmsle / examples)**0.5 if iterations == 0:
print 'Iteration %i - RMSE: %0.3f - RMSE: %0.3f' % (iterations+1, print_rmse, print_rmsle)
if iterations > 0:
if tmp_rmsle / print_rmsle <= 1.01:
print 'Iteration %i - RMSE: %0.3f - RMSE: %0.3f\n' % (iterations+1, print_rmse, print_rmsle)
results[str(search)]= {'rmse':float(print_rmse), 'rmsle':float(print_rmsle)}
break
tmp_rmsle = print_rmsle Out:
'penalty': 'l2', 'alpha': 0.001
Iteration 1 - RMSE: 216.170 - RMSE: 1.440 Iteration 20 - RMSE: 152.175 - RMSE: 0.857 'penalty': 'l2', 'alpha': 0.0001
Iteration 1 - RMSE: 714.071 - RMSE: 4.096 Iteration 31 - RMSE: 184.677 - RMSE: 1.053 'penalty': 'l1', 'alpha': 0.01
Iteration 1 - RMSE: 1050.809 - RMSE: 6.044 Iteration 36 - RMSE: 225.036 - RMSE: 1.298
The code leverages the fact that the bike-sharing dataset is quite small and doesn't require any sampling. In other contexts, it makes sense to limit the number of treated rows or create a smaller sample before by means of reservoir sampling or other sampling techniques for streams seen so far. If you would like to explore optimization in more depth, you can change the random_tests variable, fixing the number of sampled hyperparameters' combinations to be tested. Then, you modify the if tmp_rmsle / print_rmsle <= 1.01 condition using a number nearer to 1.0—if not 1.0 itself—thus letting the algorithm fully converge until some possible gain in predictive power is feasible.
Fast SVM Implementations
Though it is recommended to use distribution functions rather than picking from lists of values, you can still appropriately use the hyperparameters' ranges that we suggested before by simply enlarging the number of values to be possibly picked from the lists. For instance, for alpha in L1 and L2 regularization, you could use NumPy's function, arrange, with a small step such as 10.0**-np.arange(1, 7, step=0.1), or use NumPy logspace with a high number for the num parameter: 1.0/np.logspace(1,7,num=50).