Our achievements and goals - www.it-ebooks.info [ ]

Our current text pre-processing phase includes the following steps:

1. Firstly, tokenizing the text.

2. This is followed by throwing away words that occur way too often to be of any help in detecting relevant posts.

3. Throwing away words that occur way so seldom so that there is only little chance that they occur in future posts.

4. Counting the remaining words.

5. Finally, calculating TF-IDF values from the counts, considering the whole text corpus.

Again, we can congratulate ourselves. With this process, we are able to convert a bunch of noisy text into a concise representation of feature values.

But, as simple and powerful the bag of words approach with its extensions is, it has some drawbacks, which we should be aware of:

• It does not cover word relations: With the aforementioned vectorization approach, the text "Car hits wall" and "Wall hits car" will both have the same feature vector.

• It does not capture negations correctly: For instance, the text "I will eat ice cream" and "I will not eat ice cream" will look very similar by means of their feature vectors although they contain quite the opposite meaning. This problem, however, can be easily changed by not only counting individual words, also called "unigrams", but instead also considering bigrams (pairs of words) or trigrams (three words in a row).

• It totally fails with misspelled words: Although it is clear to the human beings among us readers that "database" and "databas" convey the same meaning, our approach will treat them as totally different words.

For brevity's sake, let's nevertheless stick with the current approach, which we can now use to efficiently build clusters from.

Clustering

Finally, we have our vectors, which we believe capture the posts to a sufficient degree.

Not surprisingly, there are many ways to group them together. Most clustering algorithms fall into one of the two methods: flat and hierarchical clustering.

Flat clustering divides the posts into a set of clusters without relating the clusters to each other. The goal is simply to come up with a partitioning such that all posts in one cluster are most similar to each other while being dissimilar from the posts in all other clusters. Many flat clustering algorithms require the number of clusters to be specified up front.

In hierarchical clustering, the number of clusters does not have to be specified.

Instead, hierarchical clustering creates a hierarchy of clusters. While similar posts are grouped into one cluster, similar clusters are again grouped into one uber-cluster.

This is done recursively, until only one cluster is left that contains everything. In this hierarchy, one can then choose the desired number of clusters after the fact.

However, this comes at the cost of lower efficiency.

SciKit provides a wide range of clustering approaches in the sklearn.cluster package. You can get a quick overview of advantages and drawbacks of each of them at http://scikit-learn.org/dev/modules/clustering.html.

In the following sections, we will use the flat clustering method K-means and play a bit with the desired number of clusters.

K-means

k-means is the most widely used flat clustering algorithm. After initializing it with the desired number of clusters, num_clusters, it maintains that number of so-called cluster centroids. Initially, it will pick any num_clusters posts and set the centroids to their feature vector. Then it will go through all other posts and assign them the nearest centroid as their current cluster. Following this, it will move each centroid into the middle of all the vectors of that particular class. This changes, of course, the cluster assignment. Some posts are now nearer to another cluster. So it will update the assignments for those changed posts. This is done as long as the centroids move considerably. After some iterations, the movements will fall below a threshold and we consider clustering to be converged.

Let's play this through with a toy example of posts containing only two words. Each point in the following chart represents one document:

After running one iteration of K-means, that is, taking any two vectors as starting points, assigning the labels to the rest and updating the cluster centers to now be the center point of all points in that cluster, we get the following clustering:

Because the cluster centers moved, we have to reassign the cluster labels and recalculate the cluster centers. After iteration 2, we get the following clustering:

The arrows show the movements of the cluster centers. After five iterations in this example, the cluster centers don't move noticeably any more (SciKit's tolerance threshold is 0.0001 by default).

After the clustering has settled, we just need to note down the cluster centers and their identity. Each new document that comes in, we then have to vectorize and compare against all cluster centers. The cluster center with the smallest distance to our new post vector belongs to the cluster we will assign to the new post.

在文檔中 www.it-ebooks.info [ ] (頁 86-91)