Probabilistic Skip-gram (PSG) Model - Probabilistic Word Embeddings

Chapter 7 A Word Embedding Framework for Summarization

7.3 Probabilistic Word Embeddings

7.3.2 Probabilistic Skip-gram (PSG) Model

In contrast to the PBOW model, which learns the representation of each word wi

through estimating the probability distributions of its context words that collectively Figure 7.1 A running toy example for learning disparate distributional representations

of a specific word w5, where the training corpus contains three documents, the vocabulary size is 9 (i.e., having words w1,…,w9) and the context window size is 1

(i.e., c=1).

generate wi, an alternative approach is to obtain an appropriate word representation by considering the predictive ability of a given word occurring at an arbitrary position of the training corpus (denoted by w^t) to predict its surrounding context words. For the idea to go, we define the objective function of such a model as

Again, since we assume that each column of W is a multinomial distribution, the terms in the denominator will be summed to one and thus we can omit the denominator here.

This model is similar in spirit to SG and can be regarded as a probabilistic counterpart of SG. As such, we will term the resulting model as the probabilistic skip-gram model (PSG) hereafter. Following a similar vein to PBOW, the component distributions of PSG can also be estimated with the EM algorithm. A running example for the proposed two models is schematically depicted in Figure 7.1.

7.3.3 Analytic Comparisons

CBOW, SG, GloVe, PBOW, and PSG can be analyzed from several critical perspectives.

First, the training objectives for all of these models aim at maximizing the collection likelihood, but their respective update formulations are different. The model parameters of CBOW, SG and GloVe are updated by variants of the stochastic gradient descent-based (SGD) algorithm [55][142], while PBOW and PSG are estimated by the expectation-maximization (EM) algorithm. It is worthy to note that GloVe has a close relation with the classic weighted matrix factorization approach, while the major difference is that the former concentrates on rendering the word-by-word co-occurrence

matrix and the latter is concerned with decomposing the word-by-document matrix [32][65]. Second, since the parameters (word representations) of CBOW and SG are trained sequentially (i.e., the so-called on-line learning strategy), the order of the training corpus may affect their resulting models dramatically. On the contrary, GloVe, PBOW and PSG accumulate the statistics over the entire training corpus in the first place; the corresponding model parameters of these models are then updated based on such censuses at once (i.e., the so-called batch-mode learning strategy). Finally, due to that in our models (PBOW and PSG) we assume each row of the matrix W is designated as a multinomial distribution, the by-product is that the columns of W collectively can be thought of as forming a latent semantic space whose meaning can be explained by the component multinomial distributions. Therefore, word representations learned by our proposed models (i.e., M) can readily be realized and interpreted by referring to the matrix W. More formally, the word vectors learned by PBOW and PSG are distributional representations, while CBOW, SG and GloVe present each word by a distributed representation. To the best of our knowledge, this is the first study of such an interpretation when learning word representations.

7.3.4 Experimental Results

We now turn to investigate the utilities of three state-of-the-art word embedding methods (i.e., CBOW, SG and GloVe) and our proposed methods (i.e., PBOW and PSG), respectively working in conjunction with the cosine similarity measure for speech summarization. The corresponding results are shown in Table 7.4, where HSM denotes the condition when the model parameters were obtained based on the hierarchal soft-max algorithm, while NS denotes learning with the negative sampling algorithm.

Several observations can be made from the experimental results. First, all the three

state-of-the-art word embedding methods, though based on disparate model structures and learning strategies, achieve results competitive to each other for both TD and SD cases. Albeit that these methods outperform the conventional VSM model, they achieve almost the same level of performance as LSA and MMR, which are considered to be two enhanced versions of VSM (c.f. Table 3.5). It should be noted that the proposed methods not only outperform than CBOW, SG and GloVe, but also are better than LSA and MMR for most of the TD and SD cases (c.f. Table 3.5).

In the next set of experiments, we evaluate the various word embedding methods Method

Text Documents (TD) Spoken Documents (SD) ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L

GloVe 0.366 0.244 0.310 0.363 0.214 0.310

CBOW HSM 0.360 0.199 0.294 0.357 0.185 0.293 NS 0.359 0.200 0.293 0.363 0.193 0.300

SG HSM 0.370 0.209 0.305 0.346 0.180 0.283

NS 0.366 0.211 0.306 0.345 0.179 0.282

PBOW 0.397 0.283 0.346 0.376 0.233 0.326

PSG 0.403 0.281 0.351 0.380 0.234 0.330

Table 7.4 Summarization results achieved by various word-embedding methods in conjunction with the cosine similarity measure.

Method

Text Documents (TD) Spoken Documents (SD) ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L

GloVe 0.422 0.309 0.372 0.380 0.239 0.332

CBOW HSM 0.472 0.364 0.417 0.372 0.226 0.316 NS 0.456 0.342 0.398 0.385 0.237 0.333

SG HSM 0.436 0.323 0.385 0.372 0.223 0.323

NS 0.436 0.320 0.385 0.371 0.225 0.322

PBOW 0.437 0.331 0.387 0.386 0.241 0.332

PSG 0.434 0.333 0.389 0.375 0.244 0.331

Table 7.5 Summarization results achieved by various word-embedding methods in conjunction with the document likelihood measure.

paired with the document likelihood measure for extractive speech summarization. The deduced sentence-based language models were combined, respectively, with ULM for computing document likelihoods [202] and the corresponding results are shown in Table 7.5. Comparing to the results of these word embedding methods paired with the cosine similarity measure (c.f. Table 7.4), it is evident that the document likelihood measure seems be a preferable vehicle to leverage word-embedding methods for speech summarization. As we look into the detailed results of Table 7.5, we notice two particularities. On one hand, CBOW seems to perform better than others in the TD case, whereas the superiority does not seem to preserve in the SD case. On the other hand, if we compare the results with that of the other state-of-the-art summarization methods (c.f.

Table 3.5), the word embedding methods with the document likelihood measure still outperform them by a margin for most of the TD and SD cases.

在文檔中統計式語言模型 – 語音文件標記、檢索以及摘要 (頁 109-115)