I-vector based Language Modeling - An I-vector based Language Modeling Framework for Retrieval

Chapter 5 An I-vector based Language Modeling Framework for Retrieval

5.1 I-vector based Language Modeling

The i-vector framework [64][133] is a simplified variant of the joint factor analysis (JFA) approach [89][90], and both are well-known approaches for LID and SR. Their major contribution is to provide an elegant way to convert the cepstral coefficient vector

sequence of a variable-length utterance into a low-dimensional vector representation. To do so, first, a Gaussian mixture model is used to collect the Baum-Welch statistics from the utterance. Then, the first-order statistics from each mixture component are concatenated to form a high-dimensional “supervector” S, which is assumed to obey an affine linear model [89][90][185]:

S, φ

S =m+T⋅ (5.1) where T is a total variability matrix, 𝜑𝜑𝑆𝑆 is an utterance specific latent variable, and m denotes a global statistics vector. In detail, the column vectors of T form a set of bases spanning a subspace covering the important variability, e.g., the language-specific evidences for LID or the speaker-specific evidences for SR, and the utterance specific variable 𝜑𝜑_𝑆𝑆 indicates the combination of the variability of the utterance. In this way, a variable-length utterance is represented by a low-dimensional vector 𝜑𝜑. Finally, the low-dimensional vector is applied to some well-developed post-processing techniques, such as PLDA, for LID and SR. Since the i-vector framework can be trained in an unsupervised manner while JFA must be trained along with manual annotation information, the former has become one of the state-of-the-art approaches for LID and SR recently. In this chapter, we investigate the same idea in the context of spoken document retrieval.

Specifically speaking, each document D is first represented by a high-dimensional feature vector 𝜈𝜈_𝐷𝐷 ∈ ℝ^𝛽𝛽. All of the representative (e.g., lexical-, semantic-, and structure-specific) statistics are encoded in the 𝛽𝛽-dimensional vector, which obeys an affine linear model:

D ϕ

ν =m+T⋅ (5.2)

where 𝐓𝐓 ∈ ℝ^{𝛽𝛽×𝛾𝛾} is a total variability matrix, 𝛾𝛾 is a desired value (𝛾𝛾 ≪ 𝛽𝛽), and

𝐦𝐦 ∈ ℝ^𝛽𝛽 denotes a global statistics vector. Similarly, the column vectors of T span a subspace covering the important characteristics for documents. Moreover, each document has a document specific variable 𝜑𝜑_𝐷𝐷 ∈ ℝ^𝛾𝛾, which indicates the combination of the variability of the document. Based on the methodology, a disengaged version is to characterize the representative information of a document only by words. Consequently, each element of the 𝛽𝛽-dimensional vector is corresponding to a distinct word, and the probability of a word w occurring in a document D can be defined as a log-linear function: statistics value of m corresponding to word w, and V denotes the vocabulary inventory in the language. We name this model as the i-vector based language model (IVLM).

Based on Eqs. (5.2) and (5.3), the model parameters (i.e., T, 𝜑𝜑_𝐷𝐷 and m) of the proposed IVLM can be estimated by maximizing the total likelihood over all training documents:

where c(w,D) denotes the number of times the word w occurs in document D. Since estimating all the parameters jointly is intractable, we estimate them through an iterative process, i.e., we estimate T and m with fixed 𝜑𝜑_𝐷𝐷, and then estimate 𝜑𝜑_𝐷𝐷 with fixed T and m:

( )

, step size can be set empirically or by calculating the Hessian matrix [129][185].

In the retrieval phase, each document D has its own IVLM, including the document specific variable 𝜑𝜑_𝐷𝐷 and common T and m. As such, the probability of word w occurring in document D computed by IVLM in Eq. (5.3) can be linearly combined with or used to replace P(w|D) in the query-likelihood measure to distinguish relevant documents from irrelevant ones.

The concept of the proposed IVLM is similar to that of LSA, RLSI, and PLSA, but differences do exist among them. First, IVLM and PLSA are probabilistic models while LSA and RLSI are not. Second, IVLM not only has a different formulation to PLSA, but it does not assume that the total variability is governed by some distribution. Since the parameters of IVLM are real numbers rather than positive real numbers in PLSA, IVLM is more flexible and general than PLSA. Moreover, the parameters of IVLM can be solved in parallel while the parameters of PLSA have to be estimated in a batch mode. It

is worth noting that IVLM is a special (disengaged) case of the proposed i-vector based language modeling framework for SDR. We will try to discover and couple with more representative information in the future work.

5.1.1 Experimental Results

First, we will compare the use of inductive and transductive learning strategies [160] in IVLM. Inductive learning means that the models are trained from an external document collection. After training, T and m are used to fold-in each document d in the document collection to be retrieved to get the corresponding document specific variable 𝜑𝜑𝑑𝑑. Transductive learning uses the document collection to be retrieved to train the models.

After training, 𝜑𝜑_𝑑𝑑 for each document d is used in the retrieval phase. Table 5.1 reports the retrieval results of the proposed IVLM approach for both short and long queries with respect to two learning strategies using word- or subword-level index features. We use a set of Chinese news stories from Xinhua News Agency as a contemporaneous external document set for inductive learning. It is generally believed that transductive learning should be better than inductive learning. However, as can be seen from Table 2, inductive learning achieves slightly better performance than transductive learning in

IVLM Inductive Transductive

Word Subword Word Subword

short 0.336 0.360 0.382 0.350

long 0.582 0.584 0.563 0.574

Table 5.1 Retrieval results (in MAP) of IVLM with word- and subword-level index features for short and long queries using inductive and transductive learning

strategies.

most cases, except when using word-level index features with short queries for SDR.

Since the document collection to be retrieved (2,265 documents in total) is much smaller than the external collection (18,461 documents in total), transductive learning may suffer from the data sparseness problem while inductive learning can obtain more robust model parameters from a larger set of contemporaneous documents.

Next, the proposed IVLM approach is compared with several well-known non-probabilistic and probabilistic approaches, namely VSM, LSA, SCI, and ULM, and topic models such as PLSA and LDA. To bypass the impact of the data sparseness problem, all the approaches are trained by inductive learning. The results when using word- and subword-level index features are shown in Table 5.2. From the table, at first glance, it can be seen that the proposed IVLM framework outperforms all the non-probabilistic approaches (c.f. VSM, LSA, and SCI) and the probabilistic approach (c.f. ULM, PLSA, and LDA) in most cases. The reason why it does not perform as well with subword-level index features for short queries is not clear, and is worthy of further

Word Subword

short long short long

VSM 0.273 0.484 0.257 0.499

LSA 0.296 0.364 0.384 0.527

SCI 0.270 0.413 0.270 0.349

ULM 0.321 0.563 0.329 0.570

PLSA 0.328 0.567 0.376 0.584

LDA 0.328 0.566 0.377 0.584

IVLM 0.336 0.582 0.360 0.584

Table 5.2 Retrieval results (in MAP) of different approaches with word- and subword-level index features for short and long queries.

studying. The results indicate that the proposed IVLM approach is a novel and alternative way for SDR. In addition, it can also be seen that most IR approaches seem to benefit more from the use of subword-level index features than word-level index features, probably because the subword-level index units can shadow the impact of imperfect speech recognition results.

Moreover, two general observations can be made from the results. First, probabilistic approaches in general outperform non-probabilistic approaches. The results indicate that probabilistic approaches are a school of simple but powerful methods for SDR, and there are still potential research areas for non-probabilistic approaches. It should also be noticed that, the frequency count of a word is weighted by using the standard IDF method for non-probabilistic approaches while probabilistic approaches (including IVLM) only take the frequency count of a word into account. Second, a topic modeling approach outperforms its non-topic modeling counterpart (e.g., LSA vs. VSM, IVLM vs.

ULM). The results indicate that the relevance between a pair of query and document should not be estimated only based on “literal term matching,” concept information is useful and should be considered in SDR.

在文檔中統計式語言模型 – 語音文件標記、檢索以及摘要 (頁 65-71)