Example: Unsupervised Word Segmentation

A.3 Utility-Bias Tradeoff

A.3.4 Example: Unsupervised Word Segmentation

For any language of concern, let C denote the set of tokens or characters, i.e., the alphabet, and W denote the set of words. Generally, every w ∈ W is a string over C. Consider that we have observed a set of utterances {O1, O2, . . . , OM} in this language, in which each utterance Ok is a string over C. For simplicity, we write O = O1O2. . . OM as a concatenation of all the utterances by assuming that each utterance is separated from each other using some delimiting token.

Suppose that O as a sequence of observations (C1, C2, . . . , CN) drawn from a stochas-tic process C ∈ C. In unsupervised word segmentation, we seek to induce a sequence of words (W₁, W₂, . . . , W_L), in which each W_j is a draw from a latent stochastic pro-cess W ∈ W. In this respect, unsupervised word segmentation is connected to the decoding problem discussed in Section A.3.3 and therefore has a corresponding utility-bias solution. In the following paragraphs, we motivates the definitions for a utility-bias construction to unsupervised word segmentation.

Hypothesis An ideal construction of the hypothesis space should cover all the possible ways to segment the observations O, but practically it is infeasible to im-plement this strategy. Instead, we propose defining the hypothesis space H as a set

of ordered rulesets:

H = {hR| R is an ordered ruleset}, (A.9) in which each hypothesis hR : C^∗ → W∗ with respect to some ordered ruleset R is a function that reduces a sequence of words from a sequence of tokens by applying each translation rule in R in order.

An ordered ruleset R is a sequence of translation rules. Generally, every translation rule in R takes the form

w → c1c2c3. . . cn

for some n, where w ∈ W represents a word and each c_i ∈ C represents a token.

Applying this translation rule to a string over C^∗ is equivalent to replacing all the occurrences of c1c2. . . cn in the string to a corresponding word w.

In this light, a hypothesis hR is well-defined since we requires that the translation rules in R be applied in order. For completeness, any token c in the string that is not covered by any rule during the reduction is automatically reduced afterward by using an implicit rule c → c, since c can also be a member of W.

Utility We define the utility in terms of an estimate g for the lexical cohesion that we gain by generalizing O into h(O). In linguistics, cohesion is the connection that puts a piece of text together, one broader sense of which is the meaning carried by the text. When this connection is made in the lexical context, it refers to repetitive uses or collocation of a text. It seems reasonable here, in this light, to assume highly-cohesive subsequences are more likely to be true words.

The lexical cohesion is usually defined with respect to an entire word type w ∈ W instead of individual occurrences of the word type. In the word segmentation problem, we consider the following three kinds of estimates for lexical cohesion for any word type w:

1. Frequency, by which we define g(w) = n_w × (|w| − 1) where n_w denotes the

number of occurrences for w and |w| denotes the number of tokens in w.

2. Pointwise mutual information, by which we define g(w) as the pointwise mu-tual information for w.

3. Branching entropy, by which we define g(w) as the branching entropy for w.

To motivate the definition for the utility using some lexical cohesion g, it suffices to consider only on the unique word types in the translated sequence. In this respect, we define the utility with respect to some hypothesis h and O as:

utility(h, O) = X

w∈WT(h(O))

g(w), (A.10)

where WT(h(O)) denotes the unique words in h(O).

Bias The bias is defined as the absolute value of the difference between the entropy rates for C and for W , as in:

bias(h, O) = | ˜Hh(O)(W ) − ˜HO(C)|, (A.11)

where ˜Hh(O)(W ) and ˜HO(C) denote the empirical entropy rate for W , estimated on h(O), and the empirical entropy rate for C, estimated on O, respectively.

This definition attempts to quantify the change on the model predictability after the perturbation, since it affects not only the observations but the support and density of the underlying probabilistic process. It can be shown that this definition has a connection to the difference between the perplexity estimates for the language models constructed based on the word representation (i.e., h(O)) and the token representation (i.e., O), respectively.

Combining Equations (A.10) and (A.11), we write out the scalarized utility-bias

solution based on Equation (A.7) as follows:

maximize λ X

w∈WT(h(O))

g(w) + (1 − λ)| ˜Hh(O)(W ) − ˜HO(C)|

subject to h ⊆ H (A.12)

Iterative Approximation Globally optimizing Equation (A.12) is infeasible since we need to consider all the possible translation rulesets. To alleviate this problem, we develop an iterative algorithm that allows us to explore the search space more efficiently.

The search strategy we take is called greedy inclusion. Since each hypothesis in H is defined with respect to some ordered ruleset R, we maintain the best ruleset B that we have obtained so far and greedily expand B by adding new rules at the end of the set. Formally, in each iteration, we seek to maximize the objective with respect to all the possible rules r by forming the corresponding solution B ∪ {r}; once the best rule is found, we add it to the end of B. This procedure is repeated many times until the terminal condition is satisfied.

This algorithm can be made more efficient if we make two changes: (1) at the end of each iteration, we rewrite the sequence O by applying the rule r, right before updating the ruleset B, and (2) in each iteration, we consider optimizing over only all the possible new rule r. This is legit in a greedy inclusion algorithm, since, in each iteration, we effectively form a partial solution based on all the existing rules in B and then optimize over all the possible new rules r with respect to this partial solution.

A brief sketch of the algorithm is given as follows:

1. Let B be an empty ordered ruleset, and let O⁽⁰⁾ be the original sequence of tokens.

2. Repeat the following steps for each i ∈ N , starting from i = 1, until the terminal condition is satisfied.

(a) Find a rule r in the form of w → c1c2. . . cn for some n, where w ∈ W and each ci ∈ C, such that Equation (A.12) is maximized with respect to h = h{r} and O = O⁽ⁱ⁻¹⁾.

(b) Form a new token sequence O⁽ⁱ⁾ by applying the rule r to O⁽ⁱ⁻¹⁾. (c) Add r to the end of B.

3. Output B.

In later subsections, we show that there exists an efficient implementation for this algorithm in a simplified case when we consider only bigram translation rules.

在文檔中資訊保存與自然語言處理的應用 (頁 105-109)