Feature selection by regularization - Large Scale Machine Learning with Python

In a batch context, it is common to operate feature selection by the following:

• A preliminary filtering based on completeness (incidence of missing values), variance, and high multicollinearity between variables in order to have a cleaner dataset of relevant and operable features.

• Another initial filtering based on the univariate association (chi-squared test, F-value, and simple linear regression) between the features and response variable in order to immediately remove the features that are of no use for the predictive task because they are little or not related to the response.

• During modeling, a recursive approach inserting and/or excluding features on the basis of their capability to improve the predictive power of the algorithm, as tested on a holdout sample. Using a smaller subset of just relevant features allows the machine learning algorithm to be less affected by overfitting because of noisy variables and the parameters in excess due to the high dimensionality of the features.

Applying such approaches in an online setting is certainly still possible, but quite expensive in terms of the required time because of the quantity of data to stream to complete a single model. Recursive approaches based on a large number of iterations and tests require a nimble dataset that can fit in memory. As just previously quoted, in such a case, subsampling would be a good option in order to figure out features and models later to be applied to a larger scale.

Keeping on our out-of-core approach, regularization is the ideal solution as a way to select variables while streaming and filter out noisy or redundant features.

Regularization works fine with online algorithms as it operates as the online

machine learning algorithm is working and fitting its coefficients from the examples, without any need to run other streams for the purpose of selection. Regularization is, in fact, just a penalty value, which is added to the optimization of the learning process. It is dependent on the features' coefficient and a parameter named alpha setting the impact of regularization. The regularization balancing intervenes when coefficients' weights are updated by the model. At that time, regularization acts by reducing the resulting weights if the value of the update is not large enough.

The trick of excluding or attenuating redundant variables is achieved because of the regularization alpha parameter, which has to be empirically set at the correct magnitude for the best result with respect to each specific data to be learned.

Chapter 3 SGD implements the same regularization strategies to be found in batch algorithms:

• L1 penalty pushing to zero redundant and not so important variables

• L2 reducing the weight of less important features

• Elastic Net mixing the effects of L1 and L2 regularization

L1 regularization is the perfect strategy when there are unusual and redundant variables as it will push the coefficients of such features to zero, making them irrelevant when calculating the prediction.

L2 is suitable when there are many correlations between the variables as its strategy is just to reduce the weights of the features whose variation is less important for the loss function minimization. With L2, all the variables keep on contributing to the prediction, though some less so.

Elastic Net mixes L1 and L2 using a weighted sum. This solution is interesting as sometimes L1 regularization is unstable when dealing with highly correlated variables, choosing one or the other with respect to the seen examples. Using ElasticNet, many unusual features will still be pushed to zero as in L1 regularization, but correlated ones will be attenuated as in L2.

Both SGDClassifier and SGDRegressor can implement L1, L2, and Elastic Net regularization using the penalty, alpha, and l1_ratio parameters.

The alpha parameter is the most critical parameter after deciding what kind of penalty or about the mix of the two. Ideally, you can test suitable values in the range from 0.1 to 10^-7, using the list of values produced by 10.0**-np.arange(1,7).

If penalty determinates what kind of regularization is chosen, alpha, as mentioned, will determinate its strength. As alpha is a constant that multiplies the penalization term; low alpha values will bring little influence on the final coefficient, whereas high values will significantly affect it. Finally, l1_ratio represents, when penalty='elasticnet', how much percentage is the L1 penalization with respect to L2.

Setting regularization with SGD is very easy. For instance, you may try changing the previous code example inserting a penalty L2 into SGDClassifier:

SGD = SGDClassifier(loss='hinge', penalty='l2', alpha= 0.0001, random_

state=1, average=True)

Fast SVM Implementations

If you prefer to test an Elastic-Net mixing the effects of the two regularization approaches, all you have to do is explicit the ratio between L1 and L2 by setting l1_ratio:

SGD = SGDClassifier(loss=''hinge'', penalty=''elasticnet'', \ alpha= 0.001, l1_ratio=0.5, random_state=1, average=True) As the success of regularization depends on plugging the right kind of penalty and best alpha, regularization will be seen in action in our examples when dealing with the problem of hyperparameter optimization.

Including non-linearity in SGD

The fastest way to insert non-linearity into a linear SGD learner (and basically a no-brainer) is to transform the vector of the example received from the stream into a new vector including both power transformations and a combination of the features upto a certain degree.

Combinations can represent interactions between the features (explicating when two features concur to have a special impact on the response), thus helping the SVM linear model to include a certain amount of non-linearity. For instance, a two-way interaction is made by the multiplication of two features. A three-way is made by multiplying three features and so on, creating even more complex interactions for higher-degree expansions.

In Scikit-learn, the preprocessing module contains the PolynomialFeatures class, which can automatically transform the vector of features by polynomial expansion of the desired degree:

In: from sklearn.linear_model import SGDRegressor from sklearn.preprocessing import PolynomialFeatures source = '\\bikesharing\\hour.csv'

local_path = os.getcwd()

b_vars = ['holiday','hr','mnth', 'season','weathersit','weekday','wor kingday','yr']

n_vars = ['hum', 'temp', 'atemp', 'windspeed']

std_row, min_max = explore(target_file=local_path+'\\'+source, binary_

features=b_vars, numeric_features=n_vars)

Chapter 3

def apply_log(x): return np.log(x + 1.0) def apply_exp(x): return np.exp(x) - 1.0

for x,y,n in pull_examples(target_file=local_path+'\\'\

+source,vectorizer=std_row, min_max=min_max, \ sparse = False, binary_features=b_vars,\

numeric_features=n_vars, target='cnt'):

y_log = apply_log(y)

# Extract only quantitative features and expand them

num_index = [j for j, i in enumerate(std_row.feature_names_) if i in n_vars]

x_poly = poly.fit_transform(x[:,num_index])[:,len(num_index):]

new_x = np.concatenate((x, x_poly), axis=1) # MACHINE LEARNING

print '%s FINAL holdout RMSE: %0.3f' % (time.strftime('%X'), (val_rmse / float(n-predictions_start+1))**0.5)

print '%s FINAL holdout RMSLE: %0.3f' % (time.strftime('%X'), (val_

rmsle / float(n-predictions_start+1))**0.5) Out: ...

21:49:24 FINAL holdout RMSE: 219.191

Fast SVM Implementations

PolynomialFeatures expects a dense matrix, not a sparse one as an input. Our pull_examples function allows the setting of a sparse parameter, which, normally set to True, can instead be set to False, thus returning a dense matrix as a result.

在文檔中 Large Scale Machine Learning with Python (頁 127-131)