Data Mining and Knowledge Discovery Handbook, 2 Edition part 21 pot

180 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni Several exact algorithms exist to perform this inference when the network variables are all discrete, all continuous and modeled with Gaussian distributions, or the network topology is constrained to particular structures (Castillo et al., 1997, Lau- ritzen and Spiegelhalter, 1988, Pearl, 1988). The most common approaches to evidence propagation in Bayesian networks can be summarized along four lines: Polytrees When the topology of a Bayesian network is restricted to a polytree structure — a direct acyclic graph with only one path linking any two nodes in the graph — we can the fact that every node in the network divides the polytree into two disjoint sub-trees. In this way, propagation can be performed locally and very efficiently. Conditioning The intuition underlying the Conditioning approach is that networks structures more complex than polytrees can be reduced to a set of polytrees when a subset of its nodes, known as loop cutset, are instantiated. In this way, we can efficiently propagate each polytree and then combine the results of these propagations. The source of complexity of these algorithms is the identification of the loop cutset (Cooper, 1990). Clustering The algorithms developed following the Clustering approach (Lauritzen and Spiegelhalter, 1988) transforms the graphical structure of a Bayesian network into an alternative graph, called the junction tree, with a polytree structure by appropriately merging some variables in the network. This map- ping consists first of transforming the directed graph into an undirected graph by joining the unlinked parents and triangulating the graph. The nodes in the junction tree cluster sets of nodes in the undirected graph into cliques that are defined as maximal and complete sets of nodes. The completeness ensures that there are links between every pair of nodes in the clique, while maximality guarantees that the set on nodes is not a proper subset of any other clique. The joint probability of the network variables can then be mapped into a probability distribution over the clique sets with some factorization properties. Goal-Oriented This approach differs from the Conditioning and the Clustering approach in that it does not transform the entire network in an alternative structure to simultaneously compute the posterior probability of all variables but it rather query the probability distribution of a variable and targets the transformation of the network to the queried variable. The intuition is to identify the network variables that are irrelevant to compute the posterior probability of a particular variable (Shachter, 1986). For general network topologies and non standard distributions, we need to resort to stochastic simulation (Cheng and Druzdzel, 2000). Among the several stochastic simulation methods currently available, Gibbs sampling (Geman and Geman, 1984,Thomas et al., 1992) is particularly appropriate for Bayesian network reasoning because of its ability to leverage on the graphical decomposition of joint multivariate distributions to improve computational efficiency. Gibbs sampling is also useful for probabilistic reasoning in Gaussian networks, as it avoids computations with joint multivariate distributions. Gibbs sampling is a Markov Chain Monte Carlo) method 10 Bayesian Networks 181 that generates a sample from the joint distribution of the nodes in the network. The procedure works by generating an ergodic Markov chain ⎛ ⎜ ⎝ y 10 . . . y v0 ⎞ ⎟ ⎠ → ⎛ ⎜ ⎝ y 11 . . . y v1 ⎞ ⎟ ⎠ → ⎛ ⎜ ⎝ y 12 . . . y v2 ⎞ ⎟ ⎠ →··· that, under regularity conditions, converges to a stationary distribution. At each step of the chain, the algorithm generates y ik from the conditional distribution of Y i given all current values of the other nodes. To derive the marginal distribution of each node, the initial burns-in is removed, and the values simulated for each node are a sample generated from the marginal distribution. When one or more nodes in the networks are observed, they are fixed in the simulation so that the sample for each node is from the conditional distribution of the node given the observed nodes in the network. Gibbs sampling in directed graphical models exploits the Global Markov property, so that to simulate from the conditional distribution of one node Y i given the current values of the other nodes, the algorithm needs to simulate from the conditional probability/density p(y i |y\y i ) ∝ p(y i |pa(y i )) ∏ h p(c(y i ) h |pa(c(y i ) h )) where y denotes a set of values of all network variables, pa(y i ) and c(y i ) are values of the parents and children of Y i , pa(c(y i ) h ) are values of the parents of the hth child of Y i , and the symbol \ denotes the set difference. 10.4 Learning Learning a Bayesian network from data consists of the induction of its two different components: 1) The graphical structure of conditional dependencies (model selection); 2) The conditional distributions quantifying the dependency structure (parameter estimation). While the process of parameter estimation follows quite standard statistical techniques (see (Ramoni and Sebastiani, 2003)), the automatic identification of the graphical model best fitting the data is a more challenging task. This automatic identification process requires two components: a scoring metric to select the best model and a search strategy to explore the space of possible, alternative models. This section will describe these two components — model selection and model search — and will also outline some methods to validate a graphical model once it has been induced from a data set. 10.4.1 Scoring Metrics We describe the traditional Bayesian approach to model selection that solves the problem as hypothesis testing. Other approaches based on independence tests or vari- ants of the Bayesian metric like the minimum description length (MDL) score or the 182 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni Bayesian information criterion (BIC) are described in (Lauritzen, 1996,Spirtes et al., 1993,Whittaker, 1990). We suppose to have a set M = {M 0 ,M 1 , ,M g } of Bayesian networks, each network describing an hypothesis on the dependency structure of the random variables Y 1 , ,Y v . Our task is to choose one network after observing a sample of data D = {y 1k , ,y vk }, for k = 1, ,n. By Bayes’ theorem, the data D are used to revise the prior probability p(M h ) of each model into the posterior probability, which is calculated as p(M h |D) ∝ p(M h )p(D|M h ) and the Bayesian solution consists of choosing the network with maximum posterior probability. The quantity p(D|M h ) is called the marginal likelihood and is computed by averaging out θ h from the likelihood function p(D | θ h ), where Θ h is the vector pa- rameterizing the distribution of Y 1 , ,Y v , conditional on M h . Note that, in a Bayesian setting, Θ h is regarded as a random vector, with a prior density p( θ h ) that encodes any prior knowledge about the parameters of the model M h . The likelihood function, on the other hand, encodes the knowledge about the mechanism underlying the data generation. In our framework, the data generation mechanism is represented by a network of dependencies and the parameters are usually a measure of the strength of these dependencies. By averaging out the parameters, the marginal likelihood pro- vides an overall measure of the data generation mechanism that is independent of the values of the parameters. Formally, the marginal likelihood is the solution of the integral p(D|M h )=  p(D| θ h )p( θ h )d θ h . The computation of the marginal likelihood requires the specification of a parameterization of each model M h that is used to compute the likelihood function p(D| θ h ), and the elicitation of a prior distribution for Θ h . The local Markov properties encoded by the network M h imply that the joint density/probability of a case k in the data set can be written as p(y 1k , ,y vk | θ h )= ∏ i p(y ik |pa(y i ) k , θ h ). (10.2) Here, y 1k , ,y vk is the set of values (configuration) of the variables for the kth case, and pa(y i ) k is the configuration of the parents of Y i in case k. By assuming exchange- ability of the data, that is, cases are independent given the model parameters, the overall likelihood is then given by the product p(D| θ h )= ∏ ik p(y ik |pa(y i ) k , θ h ). Computational efficiency is gained by using priors for Θ h that obey the Directed Hyper-Markov law (Dawid and Lauritzen, 1993). Under this assumption, the prior density p( θ h ) admits the same factorization of the likelihood function, namely p( θ h )= ∏ i p( θ hi ), where θ hi is the subset of parameters used to describe the dependency of 10 Bayesian Networks 183 Y i on its parents. This parallel factorization of the likelihood function and the prior density allows us to write p(D|M h )= ∏ ik  p(y ik |pa(y i ) k , θ hi )p( θ hi )d θ hi = ∏ i p(D|M hi ) where p(D|M hi )= ∏ k  p(y ik |pa(y i ) k , θ hi )p( θ hi )d θ hi . By further assuming decomposable network prior probabilities that factorize as p(M h )= ∏ i p(M hi ) (Heckerman et al., 1995), the posterior probability of a model M h is the product: p(M h |D)= ∏ i p(M hi |D). Here p(M hi |D) is the posterior probability weighting the dependency of Y i on the set of parents specified by the model M h . Decomposable network prior probabilities are encoded by exploiting the modularity of a Bayesian network, and are based on the assumption that the prior probability of a local structure M hi is independent of the other local dependencies M hj for j = i. By setting p(M hi )=(g + 1) −1/v , where g+1 is the cardinality of the model space and v is the cardinality of the set of variables, there follows that uniform priors are also decomposable. An important consequence of the likelihood modularity is that, in the comparison of models that differ for the parent structure of a variable Y i , only the local marginal likelihood matters. Therefore, the comparison of two local network structures that specify different parents for the variable Y i can be done by simply evaluating the product of the local Bayes factor BF hk = p(D |M hi )/p(D|M ki ), and the prior odds p(M h )/p(M k ), to compute the posterior odds of one model versus the other: p(M hi |D)/p(M ki |D). The posterior odds provide an intuitive and widespread measure of fitness. Another important consequence of the likelihood modularity is that, when the models are a priori equally likely, we can learn a model locally by maximizing the marginal likelihood node by node. When there are no missing data, the marginal likelihood p(D |M h ) can be calculated in closed form under the assumptions that all variables are discrete, or all variables follow Gaussian distributions and the dependencies between children and parents are linear. These two cases are described in the next examples. We conclude by noting that the calculation of the marginal likelihood of the data is the essential component for the calculation of the Bayesian estimate of the parameter θ h , which is given by the expected value of the posterior distribution: p( θ h |D)= p(D| θ h )p( θ h ) p(D|M h ) = ∏ i p(D| θ hi )p( θ hi ) p(D|M hi ) . 184 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni Fig. 10.4. A simple Bayesian network describing the dependency of Y 3 on Y 1 and Y 2 that are marginally independent. The table on the left describes the parameters θ 3 jk ( j = 1, ,4 and k = 1,2) used to define the conditional distributions of Y 3 = y 3k |pa(y 3 ) j , assuming all variables are binary. The two tables on the right describe a simple database of seven cases, and the frequencies n 3 jk . The full joint distribution is defined by the parameters θ 3 jk , and the parameters θ 1k and θ 2k that specify the marginal distributions of Y 1 and Y 2 . Discrete Variable Networks Suppose the variables Y 1 , ,Y v are all discrete, and denote by c i the number of cat- egories of Y i . The dependency of each variable Y i on its parents is represented by a set of multinomial distributions that describe the conditional distribution of Y i on the configuration j of the parent variables Pa(Y i ). This representation leads to writing the likelihood function as: p(D| θ h )= ∏ ijk θ n ijk ijk where the parameter θ ijk denotes the conditional probability p(y ik |pa(y i ) j ); n ijk is the sample frequency of (y ik , pa(y i ) j ), and n ij = ∑ k n ijk is the marginal frequency of pa(y i ) j . Figure 10.4 shows an example of the notation for a network with three variables. With the data in this example, the likelihood function is written as: 10 Bayesian Networks 185 { θ 4 11 θ 3 12 }{ θ 3 21 θ 4 22 }{ θ 1 311 θ 1 312 × θ 1 321 θ 0 322 × θ 2 331 θ 0 332 × θ 1 341 θ 1 342 }. The first two terms in the products are the contributions of nodes Y 1 and Y 2 to the likelihood, while the last product is the contribution of the node Y 3 , with terms cor- responding to the four conditional distributions of Y 3 given each of the four parent configurations. The hyper Dirichlet distribution with parameters α ijk is the conjugate Hyper Markov law (Dawid and Lauritzen, 1993) and it is defined by a density function proportional to the product ∏ ijk θ α ijk −1 ijk . This distribution encodes the assumption that the parameters θ ij and θ i  j  are independent for i  = i and j = j  . These assumptions are known as global and local parameter independence (Spiegelhalter and Lauritzen, 1990), and are valid only under the assumption the hyper-parameters α ijk satisfy the consistency rule ∑ j α ij = α for all i (Good, 1968,Geiger and Hecker- man, 1997). Symmetric Dirichlet distributions satisfy easily this constraint by setting α ijk = α /(c i q i ) where q i is the number of states of the parents of Y i . One advantage of adopting symmetric hyper Dirichlet priors in model selection is that, if we fix α constant for all models, then the comparison of posterior probabilities of different models is done conditionally on the same quantity α . With these parameterization and choice of prior distributions, the marginal likelihood is given by the equation ∏ i p(D|M hi )= ∏ ij Γ ( α ij ) Γ ( α ij + n ij ) ∏ k Γ ( α ijk + n ijk ) Γ ( α ijk ) where Γ (·) denotes the Gamma function, and the Bayesian estimate of the parameter θ ijk is the posterior mean E( θ ijk |D)= α ijk + n ijk α ij + n ij . (10.3) More details are in (Ramoni and Sebastiani, 2003). Linear Gaussian Networks Suppose now that the variables Y 1 , ,Y v are all continuous, and the conditional distribution of each variable Y i given its parents Pa(y i ) ≡{Y i1 , ,Y ip(i) }follows a Gaus- sian distribution with mean that is a linear function of the parent variables, and conditional variance σ 2 i = 1/ τ i . The parameter τ i is called the precision. The dependency of each variable on its parents is represented by the linear regression equation: μ i = β i0 + ∑ j β ij y ij that models the conditional mean of Y i given the parent values y ij . Note that the regression equation is additive (there are no interactions between the parent variables) to ensure that the model is graphical (Lauritzen, 1996). In this way, the dependency of Y i on a parent Y ij is equivalent to having the regression coefficient β ij = 0. Given a set of exchangeable observations D, the likelihood function is: 186 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni p(D| θ h )= ∏ i ( τ i /(2 π )) n/2 ∏ k exp [− τ i (y ik − μ ik ) 2 /2] where μ ik denotes the value of the conditional mean of Y i , in case k, and the vector θ h denotes the set of parameters τ i , β ij . It is usually more convenient to use a matrix notation and we use the n ×(p(i)+1) matrix X i to denote the matrix of regression coefficients, with kth row given by (1,y i1k ,y i2k , ,y ip(i)k ), β i to denote the vector of parameters ( β i0 , β i1 , , β ip(i) ) T associated with Y i and, in this example, y i to denote the vector of observations (y i1 , ,y in ) T . With this notation, the likelihood can be written in a more compact form: p(D| θ h )= ∏ i ( τ i /(2 π )) n/2 exp [− τ i (y i −X i β i ) T (y i −X i β i )/2] There are several choices to model the prior distribution on the parameters τ i and β i . For example, the conditional variance can be further parameterized as: σ 2 i = V (Y i ) −cov(Y i ,Pa(y i ))V (Pa(y i )) −1 cov(Pa(y i ),Y i ) where V (Y i ) is the marginal variance of Y i , V (Pa(y i )) is the variance- covariance matrix of the parents of Y i , and cov(Y i ,Pa(y i )) (cov(Pa(y i ),Y i )) is the row (column) vector of covariances between Y i and each parent Y ij . With this parameterization, the prior on τ i is usually a hyper-Wishart distribution for the joint variance-covariance matrix of Y i ,Pa(y i ) (Cowell et al., 1999). The Wishart distribution is the multivariate generalization of a Gamma distribution. An alternative approach is to work directly with the conditional variance of Y i . In this case, we estimate the conditional variances of each set of parents-child dependency and then the joint multivariate distribution that is needed for the reasoning algorithms is de- rived by multiplication. More details are described for example in (Whittaker, 1990) and (Geiger and Heckerman, 1994). We focus on this second approach and again use the global parameter independence (Spiegelhalter and Lauritzen, 1990) to assign independent prior distributions to each set of parameters τ i , β i that quantify the dependency of the variable Y i on its parents. In each set, we use the standard hierarchical prior distribution that consists of a marginal distribution for the precision parameter τ i and a conditional distribution for the parameter vector β i ,given τ i . The standard conjugate prior for τ i is a Gamma distribution τ i ∼ Gamma( α i1 , α i2 ) p( τ i )= 1 α α i1 i2 Γ ( α i1 ) τ α i1 −1 i e − τ i / α i2 where α i1 = ν io 2 , α i2 = 2 ν io σ 2 io . This is the traditional Gamma prior for τ i with hyper-parameters ν io and σ 2 io that can be given the following interpretation. The marginal expectation of τ i is E( τ i )= α i1 α i2 = 1/ σ 2 io and 10 Bayesian Networks 187 E(1/ τ i )= 1 ( α i1 −1) α i2 = ν io σ 2 io ν io −2 is the prior expectation of the population variance. Because the ratio ν io σ 2 io /( ν io −2) is similar to the estimate of the variance in a sample of size ν io , σ 2 io is the prior population variance, based on ν io cases seen in the past. Condition- ally on τ i , the prior density of the parameter vector β i is supposed to be multivariate Gaussian: β i | τ i ∼ N( β io ,( τ i R io ) −1 ) where β io = E( β i | τ i ). The matrix ( τ i R io ) −1 is the prior variance-covariance matrix of β i | τ i and R io is the identity matrix so that the regression coefficients are a priori independent, conditionally on τ i . The density function of β i is p( β i | τ i )= τ (p(i)+1)/2 i det(R io ) 1/2 (2 π ) (p(i)+1)/2 e − τ i /2( β i − β io ) T R io ( β i − β io ) . With this prior specifications, it can be shown that the marginal likelihood p(D|M h ) can be written in product form ∏ i p(D|M hi ), where each factor is given by the quantity: p(D|M hi )= 1 (2 π ) n/2 detR 1/2 io detR 1/2 in Γ ( ν in /2) Γ ( ν io /2) ( ν io σ 2 io /2) ν io /2 ( ν in σ 2 in /2) ν in /2 and the parameters are specified by the next updating rules: α i1n = ν io /2 +n/2 1/ α i2n =(− β T in R in β in + y T i y i + β T io R io β io )/2 +1/ α i2 ν in = ν io + n σ in = 2/( ν in α i2n ) R in = R io + X T i X i β in = R −1 in (R io β io + X T i y i ) The Bayesian estimates of the parameters are given by the posterior expectations: E( τ i |y i )= α i1n α i2n = 1/ σ 2 in , E( β i |y i )= β in , and the estimate of σ 2 i is ν in σ 2 in /( ν in −2). More controversial is the use of improper prior distributions that describe lack of prior knowledge about the network parameters by uniform distributions (Hagan, 1994). In this case, we set p( β i , τ i ) ∝ τ −c i ,so that ν io = 2(1 −c) and β io = 0. The updated hyper-parameters are: ν in = ν io + n R in = X T i X i β in =(X T i X i ) −1 X T i y i least squares estimate of β σ in = RSS i / ν in RSS i = y T i y i −y T i X i (X T i X i ) −1 X T i y i residual sum of squares 188 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni and the marginal likelihood of each local dependency is p(D|M hi )= Γ ((n −p(i) −2c + 1)/2)(RSS i /2) −(n−p(i)−2c+1)/2 det(X T i X i ) 1/2 1 (2 π ) (n−p(i)−1)/2 . A very special case is c = 1 that corresponds to ν io = 0. In this case, the local marginal likelihood simplifies to p(D|M hi )= 1 (2 π ) (n−p(i)−1)/2 Γ ((n −p(i) −1)/2)(RSS i /2) −(n−p(i)−1)/2 det(X T i X i ) 1/2 . The estimates of the parameters σ i and β i become the traditional least squares estimates RSS i /( ν in −2) and β in . This approach can be extended to model an unknown variance-covariance structure of the regression parameters, using Normal-Wishart priors (Geiger and Heckerman, 1994) 10.4.2 Model Search The likelihood modularity allows local model selection and simplifies the complexity of model search. Still, the space of the possible sets of parents for each variable grows exponentially with the number of candidate parents and successful heuristic search procedures (both deterministic and stochastic) have been proposed to render the task feasible (Cooper and Herskovitz, 1992,Larranaga et al., 1996,Singh and Val- torta, 1995, Zhou and Sakane, 2002). The aim of these heuristic search procedures is to impose some restrictions on the search space to capitalize on the decompos- ability of the posterior probability of each Bayesian network M h . One suggestion, put forward by (Cooper and Herskovitz, 1992), is to restrict the model search to a subset of all possible networks that are consistent with an ordering relation  on the variables {Y 1 , ,Y v }. This ordering relation  is defined by Y j  Y i if Y i can- not be parent of Y j . In other words, rather than exploring networks with arcs having all possible directions, this order limits the search to a subset of networks in which there is only a subset of directed associations. At first glance, the requirement for an order among the variables could appear to be a serious restriction on the appli- cability of this search strategy, and indeed this approach has been criticized in the artificial intelligence community because it limits the automation of model search. From a modeling point of view, specifying this order is equivalent to specifying the hypotheses that need to be tested, and some careful screening of the variables in the data set may avoid the effort to explore a set of not sensible models. For example, we have successfully applied this approach to model survey data (Sebastiani et al., 2000, Sebastiani and Ramoni, 2001C) and more recently genotype data (1). Recent results have shown that restricting the search space by imposing an order among the variables yields a more regular space over the network structures (Friedman and Koller, 2003). Other search strategies based on genetic algorithms (Larranaga et al., 1996), “ad hoc” stochastic methods (Singh and Valtorta, 1995) or Markov Chain 10 Bayesian Networks 189 Monte Carlo methods (Friedman and Koller, 2003) can also be used. An alternative approach to limit the search space is to define classes of equivalent directed graphical models (Chickering, 2002). The order imposed on the variables defines a set of candidate parents for each variable Y i and one way to proceed is to implement an independent model selection for each variable Y i and then link together the local models selected for each variable Y i . A further reduction is obtained using the greedy search strategy de- ployed by the K2 algorithm (Cooper and Herskovitz, 1992). The K2 algorithm is a bottom-up strategy that starts by evaluating the marginal likelihood of the model in which Y i has no parents. The next step is to evaluate the marginal likelihood of each model with one parent only and if the maximum marginal likelihood of these models is larger than the marginal likelihood of the independence model, the parent that increases the likelihood most is accepted and the algorithm proceeds to evaluate models with two parents. If none of the models has marginal likelihood that ex- ceeds that of the independence model, the search stops. The K2 algorithm is imple- mented in Bayesware Discoverer (http://www.bayesware.com), and the R-package Deal (Bottcher and Dethlefsen, 2003). Greedy search can be trapped in local max- ima and induce spurious dependency and a variant of this search to limit spurious dependency is stepwise regression (Madigan and Raftery, 1994). However, there is evidence that the K2 algorithm performs as well as other search algorithms (Yu et al., 2002). 10.4.3 Validation The automation of model selection is not without problems and both diagnostic and predictive tools are necessary to validate a multivariate dependency model extracted from data. There are two main approaches to model validation: one addresses the goodness of fit of the network selected from data and the other assesses the predictive accuracy of the network in some predictive/diagnostic tests. The intuition underlying goodness of fit measures is to check the accuracy of the fitted model versus the data. In regression models in which there is only one dependent variable, the goodness of fit is typically based on some summary of the residuals that are defined by the difference between the observed data and the data reproduced by the fitted model. Because a Bayesian network describes a multivariate dependency model in which all nodes represent random variables, we developed blanket residuals (Sebastiani and Ramoni, 2003) as follows. Given the network induced from data, for each case k in the database we compute the values fitted for each node Y i , given all the other values. Denote this fitted value by ˆy ik and note that, by the global Markov property, only the configuration in the Markov blanket of the node Y i is used to compute the fitted value. For categorical variables, the fitted value ˆy ik is the most likely category of Y i given the configuration of its Markov blanket, while for numerical variables the fitted value ˆy ik can be either the expected value of Y i ,given the Markov blanket, or the modal value. In both cases, the fitted values are computed by using one of the algorithms for probabilistic reasoning described in Section 10.2. . 185 { θ 4 11 θ 3 12 }{ θ 3 21 θ 4 22 }{ θ 1 311 θ 1 3 12 × θ 1 321 θ 0 322 × θ 2 331 θ 0 3 32 × θ 1 341 θ 1 3 42 }. The first two terms in the products are the contributions of nodes Y 1 and Y 2 to the likelihood,. quantity: p(D|M hi )= 1 (2 π ) n /2 detR 1 /2 io detR 1 /2 in Γ ( ν in /2) Γ ( ν io /2) ( ν io σ 2 io /2) ν io /2 ( ν in σ 2 in /2) ν in /2 and the parameters are specified by the next updating rules: α i1n = ν io /2 +n /2 1/ α i2n =(− β T in R in β in +. Abad, and Marco F. Ramoni and the marginal likelihood of each local dependency is p(D|M hi )= Γ ((n −p(i) −2c + 1) /2) (RSS i /2) −(n−p(i)−2c+1) /2 det(X T i X i ) 1 /2 1 (2 π ) (n−p(i)−1) /2 . A