Good Approximation Method - Inference in Large System

4.3 Inference in Large System

4.3.2 Good Approximation Method

In the above discussion, when the complexity of the graph is high, the computation time will become intractable. We have two ways to improve the computation time. One is by simplifying the graph structure, and the other is by reducing the size of the local loop cut-sets. The first method can modify structure learning; here, we assume that the structure is given, and therefore, the focus is on the latter method.

The local loop cut-sets for node X is found by identifying each loop on X and selecting the top node in each loop. If we initiate the top node, the loop will break and the other nodes in the loop on a different path will be independent, including the parents of X. In this situation, the local structure can be seen as the tree structure; the message propagation will have a consistent result. However, there are other situations can make the parents of the node nearly independent. That is, the loop is so long that the top node has almost no effect on the parents of the bottom node. The structure is as shown in Figure 4.7(a). The level in (a) is the number of nodes between the top node and the bottom node on the left path. In order to understand this concept, we use an analogy. We view a belief network as a system of a family tree. At the top of the family tree, the relationship between the family members is close. However, when every family member has its family, we are not familiar with the other

(a)

(b)

Figure 4.5: (a) Maximum eigenvalue of a structure with ten nodes with different numbers of edges. A greater number of edges will lead to more loops in the structure and a large maximum eigenvalue. The maximum eigenvalue of all the ten nodes is between that of the line structure (the left) and the that of the fully connected structure (the right). (b) Maximum eigenvalue of nine edges with different nodes. A greater number of nodes will lead to simple structures and the maximum eigenvalue will decrease.

2 2.5 3 3.5 4 4.5 5 5.5 -6

-4 -2 0 2 4 6 8

㪞㫉㪸㫇㪿㩷㪺㫆㫄㫇㫃㪼㫏㫀㫋㫐㪣㫆㪾㪸㫉㫀㫋㪿㫄㩷㪺㫆㫄㫇㫌㫋㪸㫋㫀㫆㫅㩷㫋㫀㫄㪼㪅㩷㫃㫅㩿㫋㫀㫄㪼㪀㫊㪼㪺

data linear

Figure 4.6: Computation time increase exponentially with an increase in the graph complex-ity. The value on the y-axis is the natural logarithm.

members of the family tree, except the generations close to ours. Therefore, we can mostly say that we have no relationship with the other members except our close family member.

It is the same on the Bayesian networks. The loop can be seen as the family tree when the node in the loop is far away from the top node of the loop (ancestor), and the relationship would mostly disappear. Therefore, we can say that the parents of the bottom nodes in the clique would be mostly independent if the loop is sufficiently long.

The Markov property can explain this phenomenon. We know that the relationship between two nodes can be represented in the conditional probability table (CPT). For the sake of simplification, suppose both two nodes A and B are two states. Therefore, the CPT will be a 2 × 2 matrix. The CPT matrix is a Markov matrix since the sum of each row is one. The independent relationship is established when the CPT matrix is singular, which implies that the rows of the matrix are the same. Therefore, no matter what the probability of node A is, the probability of node B will always be the same value. If the CPT matrix is an identical matrix, then the probability of node A will be the same as that of node B.

Therefore, nodes A and B have a strong relationship. The singular or identity matrix react on the determinant of matrix, and since the CPT is the Markov matrix, the determinant of

matrix will always be smaller than or equal to one. We can judge from the smallest eigenvalue to determine the strength of the relationship. Therefore, suppose any two nodes Xi and Xj

in only one path Xi to Xj, the CPT of P (Xj|Xi) would be a product of all of the CPT in the path, i.e., P (X_j|X_i) = Qj

k=iP (X_k|Π_X_k). Therefore, the determinant of matrix will also be a product of all the CPT in the path and will decrease to zero. If the path is sufficiently long, we can say that the node X_i is approximate irrespective of node X_j.

We can see the following example. Figure 4.7(a) is a loop structure with a different level from the top node X₁ to the bottom node X₄. We add a node on the left path and want to justify when path is sufficiently long, the parent node X₂ would be independent of X₁. Figure 4.7(b) shows the KL-divergence between the real joint probability and the estimated joint probability of node X₁ and X₂. The estimated joint probability is calculated by supposing that the node X₁ is independent of X₂. The small value of KL-divergence means that X₁ and X₂ are nearly independent. We can find KL-divergence decreases rapidly when the level is increases. From blue line in (b), we conclude that X₁ and X₂are nearly independent when the level is more than two; the KL-divergence is smaller than 10⁻². Figure 4.7(c) shows another structure in which an outside node is added to the left path. From the green line in (b), we can conclude that X1 and X2 are nearly independent at a smaller level than the blue line.

This is make sense because the outside node can share dependencies with the original parents of the node. Therefore, the dependence will not be as strong as the original. In practice, the outside node structure is common, and hence the approximate inference obtained by this approach is always good.

In any directed loop, there must be more than two paths from the top node to the bottom node. If we want to ensure that all parents of the bottom node are independent, the shortest path should be sufficiently long. Then, we can ignore this loop since all parents of the bottom node are nearly independent. Therefore, the local loop cut-sets can be reduced; the precision depends on how many loops were be abnegated. In Figure 4.8, we identify the local loop cut-set of node X₁₂, which are nodes X₁, X₂ and X₃, and have different levels to node X₁₂.

(a)

(b)

1 2 3 4 5 6 7 8 9

10^-14 10^-12 10^-10 10^-8 10^-6 10^-4 10^-2 10⁰

6JGNGPIVJQHNQQRNGXGN

㪢㪣㪄㪻㫀㫍㪼㫉㪾㪼㫅㪺㪼

All nodes are in the loop.

Add outside node arc to loop node.

(c)

Figure 4.7: (a) Distance level from parents of bottom node X₄ to top node X₁ of the loop. (b) KL-divergence between real joint probability and estimated joint probability of the parents of node X₄ in the loop. The estimated joint probability assumes that the nodes X₂ and X₃ are independent of each other. The low value KL-divergence implies that the parents of node X4 are closer to the independent node. The blue line indicates a loop structure that does not have an outside node, see (a). The green line indicates an outside node added to parent X₃ (shown in (c)) and has less KL-divergence than the blue line. (c) Addition of an outside

If we want to save the computing time, we can use the approximate method to remove the largest level loop. Thus, node X1 will be removed from the local loop cut-set since it has three levels. The local loop cut-set can be reduced to only two member sets.

In the above example, if the CPT is close to the identity matrix, then the loop must be longer than 10 levels (the shortest path must pass more than ten nodes) and we can say that the parents of the node are nearly independent. Therefore, a different model would have a different result, and the user has to perform model selection to choose how many levels of the loop to save. If we keep all the loops, we will obtain an exact inference but will have intractable computation time in a complex system.

Figure 4.8: Yellow nodes {X₁, X₂, X₃} are the local loop cut-set of node X₁₂. X₁ is three levels from X₁₂, and X₂ and X₃ are two levels from X₁₂. We can just keep two levels for approximation, and the reduced local loop cut-set has only two nodes, {X₂, X₃}.

在文檔中大型貝氏網路推論之時間與準度權衡演算法 (頁 59-64)