The hashing trick - Large Scale Machine Learning with Python

If, among your features, there are categories (encoded in values or left in textual form), things can get a bit trickier. Normally, in batch learning, you would one-hot encode the categories and get as many new binary features as categories that you have. Unfortunately, in a stream, you do not know in advance how many categories you will deal with, and not even by sampling can you be sure of their number because rare categories may appear really late in the stream or require a too large sample to be discovered. You will have to first stream all the data and take a record of every category that appears. Anyway, streams can be ephemeral and sometimes the number of classes can be so large that they cannot be stored in-memory. The online advertising data is such an example because of its high volumes that are difficult to store away and because the stream cannot be passed over more than once.

Scalable Learning in Scikit-learn

Working with texts makes the problem even more strikingly evident because you cannot anticipate what kind of word could be part of the text that you will be analyzing. In a bag-of-words model—where for each text the present words are counted and their frequency value pasted in an element in the feature vector specific to each word—you should be able to map each word to an index in advance. Even if you can manage this, you'll always have to handle the situation when an unknown word (therefore never mapped before) will pop up during testing or when the predictor is in production. Marginally, it should also be added that, being a spoken language, dictionaries made of hundreds of thousands or even millions of different terms are not unusual at all.

To recap, if you can handle knowing in advance the classes in your features, you can deal with them using the one-hot encoder from Scikit-learn (http://Scikit-learn.

org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.

html). We actually won't illustrate it here but basically the approach is not at all different from what you would apply when working with batch learning. What we want to illustrate to you is when you cannot really apply one-hot encoding.

There is a solution called the hashing trick because it is based on the hash function and can deal with both text and categorical variables in integer or string form. It can also work with categorical variables mixed with numeric values from quantitative features. The core problem with one-hot encoding is that it assigns a position to a value in the feature vector after having mapped its feature to that position. The hashing trick can univocally map a value to its position without any prior need to evaluate the feature because it leverages the core characteristic of a hashing function—to transform a value or string into an integer value deterministically.

Therefore, the only necessary preparation before applying it is creating a sparse vector large enough to represent the complexity of the data (potentially containing from 2**19 to 2**30 elements, depending on the available memory, bus architecture of your computer, and type of hash function that you are using). If you are working on some text, you'll also need a tokenizer, that is, a function that will split your text into single words and removes punctuation.

A simple toy example will make this clear. We will be using two specialized functions from the Scikit-learn package: HashingVectorizer, a transformer based on the hashing trick that works on textual data, and FeatureHasher, which is another transformer, specialized in converting a data row expressed as a Python dictionary to a sparse vector of features.

Chapter 2 As the first example, we will turn a phrase into a vector:

In: from sklearn.feature_extraction.text import HashingVectorizer h = HashingVectorizer(n_features=1000, binary=True, norm=None)

sparse_vector = h.transform(['A simple toy example will make clear how it works.'])

print(sparse_vector) Out:

(0, 61) 1.0 (0, 271) 1.0 (0, 287) 1.0 (0, 452) 1.0 (0, 462) 1.0 (0, 539) 1.0 (0, 605) 1.0 (0, 726) 1.0 (0, 918) 1.0

The resulting vector has unit values only at certain indexes, pointing out an association between a token in the phrase (a word) and a certain position in the vector. Unfortunately, the association cannot be reversed unless we map the hash value for each token in an external Python dictionary. Though possible, such a mapping would be indeed memory consuming because dictionaries can prove large, in the range of millions of items or even more, depending on the language and topics. Actually, we do not need to keep such tracking because hash functions guarantee to always produce the same index from the same token.

A real problem with the hashing trick is the eventuality of a collision, which happens when two different tokens are associated to the same index. This is a rare but possible occurrence when working with large dictionaries of words. On the other hand, in a model composed of millions of coefficients, there are very few that are influential. Consequently, if a collision happens, probably it will involve two unimportant tokens. When using the hashing trick, probability is on your side because with a large enough output vector (for instance, the number of elements is above 2^24), though collisions are always possible, it will be highly unlikely that they will involve important elements of the model.

Scalable Learning in Scikit-learn

The hashing trick can be applied to normal feature vectors, especially when there are categorical variables. Here is an example with FeatureHasher:

In: from sklearn.feature_extraction import FeatureHasher h = FeatureHasher(n_features=1000, non_negative=True)

example_row = {'numeric feature':3, 'another numeric feature':2, 'Categorical feature = 3':1, 'f1*f2*f3':1*2*3}

print (example_row)

Out: {'another numeric feature': 2, 'f1*f2*f3': 6, 'numeric feature':

3, 'Categorical feature = 3': 1}

If your Python dictionary contains the feature names for numeric values and a composition of feature name and value for any categorical variable, the dictionary's values will be mapped using the hashed index of the keys creating a one-hot encoded feature vector, ready to be learned by an SGD algorithm:

In: print (h.transform([example_row])) Out:

(0, 16) 2.0 (0, 373) 1.0 (0, 884) 6.0 (0, 945) 3.0

在文檔中 Large Scale Machine Learning with Python (頁 88-91)