Comparing documents by topics - www.it-ebooks.info [ ]

Topics can be useful on their own to build the sort of small vignettes with words that are shown in the previous screenshot. These visualizations can be used to navigate a large collection of documents. For example, a website can display the different topics as different word clouds, allowing a user to click through to the documents. In fact, they have been used in just this way to analyze large collections of documents.

However, topics are often just an intermediate tool to another end. Now that we have an estimate for each document of how much of that document comes from each topic, we can compare the documents in topic space. This simply means that instead of comparing word to word, we say that two documents are similar if they talk about the same topics.

This can be very powerful as two text documents that share few words may actually refer to the same topic! They may just refer to it using different constructions (for example, one document may read "the President of the United States" while the other will use the name "Barack Obama").

Topic models are good on their own to build visualizations and explore data. They are also very useful as an intermediate step in many other tasks.

At this point, we can redo the exercise we performed in the last chapter and look for the most similar post to an input query, by using the topics to define similarity.

Whereas, earlier we compared two documents by comparing their word vectors directly, we can now compare two documents by comparing their topic vectors.

For this, we are going to project the documents to the topic space. That is, we want to have a vector of topics that summarize the document. How to perform these types of dimensionality reduction in general is an important task in itself and we have a chapter entirely devoted to this task. For the moment, we just show how topic models can be used for exactly this purpose; once topics have been computed for each document, we can perform operations on its topic vector and forget about the original words. If the topics are meaningful, they will be potentially more informative than the raw words. Additionally, this may bring computational

advantages, as it is much faster to compare 100 vectors of topic weights than vectors of the size of the vocabulary (which will contain thousands of terms).

Using gensim, we have seen earlier how to compute the topics corresponding to all the documents in the corpus. We will now compute these for all the documents and store it in a NumPy arrays and compute all pairwise distances:

>>> from gensim import matutils

>>> topics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics)

Now, topics is a matrix of topics. We can use the pdist function in SciPy to compute all pairwise distances. That is, with a single function call, we compute all the values of sum((topics[ti] – topics[tj])**2):

>>> from scipy.spatial import distance

>>> pairwise = distance.squareform(distance.pdist(topics))

Now, we will employ one last little trick; we will set the diagonal elements of the distance matrix to a high value (it just needs to be larger than the other values in the matrix):

>>> largest = pairwise.max()

>>> for ti in range(len(topics)):

... pairwise[ti,ti] = largest+1

And we are done! For each document, we can look up the closest element easily (this is a type of nearest neighbor classifier):

>>> def closest_to(doc_id):

... return pairwise[doc_id].argmin()

Note that this will not work if we had not set the diagonal elements to a large value: the function will always return the same element as it is the one most similar to itself (except in the weird case where two elements had exactly the same topic distribution, which is very rare unless they are exactly the same).

For example, here is one possible query document (it is the second document in our collection):

From: [email protected] (Gordon Banks)

Subject: Re: request for information on "essential tremor" and Indrol?

In article <[email protected]> [email protected] writes:

Essential tremor is a progressive hereditary tremor that gets worse

when the patient tries to use the effected member. All limbs, vocal

cords, and head can be involved. Inderal is a beta-blocker and is usually effective in diminishing the tremor. Alcohol and mysoline

are also effective, but alcohol is too toxic to use as a treatment.

[email protected] | it is shameful to surrender it too soon."

---

---If we ask for the most similar document to closest_to(1), we receive the following document as a result:

From: [email protected] (Gordon Banks) Subject: Re: High Prolactin

In article <[email protected]> [email protected] (John E. Rodway) writes:

>Any comments on the use of the drug Parlodel for high prolactin in the blood?

It can suppress secretion of prolactin. Is useful in cases of galactorrhea.

Some adenomas of the pituitary secret too much.

---

---Gordon Banks N3JXP | "Skepticism is the chastity of the intellect, and

[email protected] | it is shameful to surrender it too soon."

The system returns a post by the same author discussing medications.

Modeling the whole of Wikipedia

While the initial LDA implementations can be slow, which limited their use to small document collections, modern algorithms work well with very large collections of data. Following the documentation of gensim, we are going to build a topic model for the whole of the English-language Wikipedia. This takes hours, but can be done even with just a laptop! With a cluster of machines, we can make it go much faster, but we will look at that sort of processing environment in a later chapter.

First, we download the whole Wikipedia dump from http://dumps.wikimedia.

org. This is a large file (currently over 10 GB), so it may take a while, unless your Internet connection is very fast. Then, we will index it with a gensim tool:

python -m gensim.scripts.make_wiki \

enwiki-latest-pages-articles.xml.bz2 wiki_en_output

Run the previous line on the command shell, not on the Python shell. After a few hours, the index will be saved in the same directory. At this point, we can build the final topic model. This process looks exactly like what we did for the small AP dataset. We first import a few packages:

>>> import logging, gensim

Now, we set up logging, using the standard Python logging module (which gensim uses to print out status messages). This step is not strictly necessary, but it is nice to have a little more output to know what is happening:

>>> logging.basicConfig(

format='%(asctime)s : %(levelname)s : %(message)s',

Now we load the preprocessed data:

>>> id2word = gensim.corpora.Dictionary.load_from_text(

'wiki_en_output_wordids.txt')

>>> mm = gensim.corpora.MmCorpus('wiki_en_output_tfidf.mm') Finally, we build the LDA model as we did earlier:

>>> model = gensim.models.ldamodel.LdaModel(

corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)

This will again take a couple of hours. You will see the progress on your console, which can give you an indication of how long you still have to wait.

Once it is done, we can save the topic model to a file, so we don't have to redo it:

>>> model.save('wiki_lda.pkl')

If you exit your session and come back later, you can load the model again using the following command (after the appropriate imports, naturally):

>>> model = gensim.models.ldamodel.LdaModel.load('wiki_lda.pkl') The model object can be used to explore the collection of documents, and build the topics matrix as we did earlier.

We can see that this is still a sparse model even if we have many more documents than we had earlier (over 4 million as we are writing this):

>>> lens = (topics > 0).sum(axis=0) >>> print(np.mean(lens))

6.41

>>> print(np.mean(lens <= 10)) 0.941

So, the average document mentions 6.4 topics and 94 percent of them mention 10 or fewer topics.

We can ask what the most talked about topic in Wikipedia is. We will first compute the total weight for each topic (by summing up the weights from all the documents) and then retrieve the words corresponding to the most highly weighted topic. This is performed using the following code:

>>> weights = topics.sum(axis=0)

>>> words = model.show_topic(weights.argmax(), 64)

Using the same tools as we did earlier to build up a visualization, we can see that the most talked about topic is related to music and is a very coherent topic. A full 18 percent of Wikipedia pages are partially related to this topic (5.5 percent of all the words in Wikipedia are assigned to this topic). Take a look at the following screenshot:

These plots and numbers were obtained when the book was being written. As Wikipedia keeps changing, your results will be different. We expect that the trends will be similar, but the details may vary.

Alternatively, we can look at the least talked about topic:

>>> words = model.show_topic(weights.argmin(), 64)

The least talked about topic is harder to interpret, but many of its top words refer to airports in eastern countries. Just 1.6 percent of documents touch upon it, and it represents just 0.1 percent of the words.

在文檔中 www.it-ebooks.info [ ] (頁 107-113)