2.4.1 Threshold Selection
The threshold has an effect on two places in the algorithm. One is in phase I, when we choose the MI of the pairs of nodes which are greater than a threshold ε1. The other is in Phase II and III, the procedure of testing the conditional independence of the pairs of nodes. If I(Xi, Xj|C) is smaller than a threshold value ε2, then Xi and Xj are conditionally independent. When ε1 is large, the small edges would be considered; the small edges are set for the conditional independent test. However, the graph would be too simple to explain the data. On the other hand, if ε1 is small, many edges will be examined. We will be able to determine a relatively more correct structure, but waste more time in computation. For the threshold value ε2 of the conditional independent test, the large ε2 leads to the omission of a considerable number of edges, and the graph structure will be too simple. However, the small value of ε2 will lead to a complex graph, and cannot explain the data well.
In practice, the optimal threshold is unknown beforehand. Moreover, the optimal thresh-old is problem and data-driven, i.e., it depends, on one hand, on the database and its size and, on the other hand, on the variables and the numbers of their states. Therefore, it is not possible to set a default threshold value that will accurately determine conditional indepen-dence while using any database or problem. Therefore, we would score structures learned using different thresholds by a likelihood-based criterion evaluated using the training set and select the threshold leading to the structure achieving the highest score. The most commonly used scoring function is the Bayesian information criteria (BIC) which is derived on the basis of the principles stated in [23]. The BIC is a criterion for model selection among a class of parametric models with different numbers of parameters and has the following formulation:
BIC(G, D) = logP (D|G, θM L) −1
2Dim(G) · logN, (2.11)
where D is the data-set, M L are the parameter values obtained by likelihood maximization, and the network dimension Dim(G) is used for preventing the graph from becoming very over-fitting. The other methods such as test error are also a good pointer.
2.4.2 Orientation Problem
In the last phase of the BNPC algorithm, we orient the edges in the final step. The procedure involves the identification of colliders, as the other edge orientations are virtually based on these identified colliders. Colliders can be found by a conditional independence test. For any two nodes Xi and Xj that are not directly adjoined and are dependent according to the smallest given condition-set C , by d-separation, we can say that nodes Xi and Xj are the parents of the node set C, i.e., Xi → C ← Xj.
However, in practice, we cannot always orient all the edges. In a special case, if a part of the structure is a line structure, which means that the nodes adjoin each other one by one, then there are no identifiable collides. Therefore, we cannot know the exact direction of the structure. For example, in Figure 2.2(d), we can only identity the collider E; C and D are the parents of E. The other edges cannot be oriented since we do not have any further information.
Therefore, expert knowledge is required in this situation. In the worst case, there is no expert knowledge and we cannot orient all the edges. We will arbitrarily assign the variable order in this case, and the previous order will be the ancestors of the reverse order. Therefore, the directed graph cannot explain a causal relationship between two connected nodes, but just a conditional relationship between them.
Chapter 3
Bayesian Network Inference
3.1 Introduction
In the previous chapter, we described an algorithm for building a Bayesian network struc-ture. In this chapter, our goal is to draw conclusions when new information, or evidence, is observed. For example, in the field of meteorology, the objective of a meteorological sys-tem is to forecast weather with some observed climate data (e.g., sys-temperature, cloud, and humidity data). The mechanism of drawing conclusions in Bayesian networks is called an in-ference engine, or the propagation of evidence. The inin-ference engine consists of updating the probability distributions of variables (e.g., weather forecast) according to the newly available evidence (e.g., climate data).
Two types of algorithms for an inference engine are available, namely, exact and ap-proximate. The basic concept of an exact inference is based on the Bayesian theorem. In accordance with the variable dependency relationships residing in a network, we can use more efficient methods to update probabilities. In recent years, several efficient methods have been developed, such as variable elimination [29], junction tree, and belief propagation [20]. However, all the exact algorithms suffer from a combinatorial explosion when dealing with large and complex networks.
Approximate inference methods are used in large-scale systems when exact methods are computationally inefficient. The basic idea of an approximate inference is to generate a sam-ple from the joint probability distribution of the variables, and then use the generated samsam-ple to compute approximate values for the probabilities of certain events given the evidence. Dif-ferent approximate methods attempt to improve the quality or efficiency of approximations.
The problem with an approximate inference is the requirement of an adequate sample size.
In large systems, large amounts of data are required to ensure reasonable results.
Although approximate methods can handle large-scale systems, for obtaining a better inference result, exact methods are used. We have reason to believe that the exact methods can be improved to be efficient while dealing with large, complex systems. Therefore, in the present exercise, we will also focus on the exact inference algorithms in this chapter.
This chapter is organized as follows: Section 3.2 presents an approach for estimating the parameters of Bayesian networks. After both the structure and the parameters are provided, the inference engine can be launched. In Section 3.3, the exact inference algorithms used for Bayesian networks are considered. The exact inference algorithms that we consider include the junction tree algorithm and belief propagation algorithm. Both algorithms will be used for developing a new algorithm in the next chapter. Section 3.4 discusses the advantages and the shortcomings of both the algorithms and explains why we need to develop a new algorithm for Bayesian networks.