Tài liệu Bài 10: ICA by Minimization of Mutual Information pdf

Thông tin tài liệu

10 ICA by Minimization of Mutual Information An important approach for independent component analysis (ICA) estimation, in- spired by information theory, is minimization of mutual information. The motivation of this approach is that it may not be very realistic in many cases to assume that the data follows the ICA model. Therefore, we would like to develop an approach that does not assume anything about the data. What we want to have is a general-purpose measure of the dependence of the components of a random vector. Using such a measure, we could define ICA as a linear decomposition that minimizes that dependence measure. Such an approach can be developed using mutual information, which is a well-motivated information-theoretic measure of statistical dependence. One of the main utilities of mutual information is that it serves as a unifying framework for many estimation principles, in particular maximum likelihood (ML) estimation and maximization of nongaussianity. In particular, this approach gives a rigorous justification for the heuristic principle of nongaussianity. 10.1 DEFINING ICA BY MUTUAL INFORMATION 10.1.1 Information-theoretic concepts The information-theoretic concepts needed in this chapter were explained in Chap- ter 5. Readers not familiar with information theory are advised to read that chapter before this one. 221 Independent Component Analysis. Aapo Hyv ¨ arinen, Juha Karhunen, Erkki Oja Copyright  2001 John Wiley & Sons, Inc. ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) 222 ICA BY MINIMIZATION OF MUTUAL INFORMATION We recall here very briefly the basic definitions of information theory. The differential entropy H of a random vector y with density p(y) is defined as: H (y)= Z p(y)logp(y) d y (10.1) Entropy is closely related to the code length of the random vector. A normalized version of entropy is given by negentropy J , which is defined as follows J (y)=H (y gauss )  H (y) (10.2) where y gauss is a gaussian random vector of the same covariance (or correlation) matrix as y . Negentropy is always nonnegative, and zero only for gaussian random vectors. Mutual information I between m (scalar) random variables, y i i =1:::m is defined as follows I (y 1 y 2 :::y m )= m X i=1 H (y i )  H (y) (10.3) 10.1.2 Mutual information as measure of dependence We have seen earlier (Chapter 5) that mutual information is a natural measure of the dependence between random variables. It is always nonnegative, and zero if and only if the variables are statistically independent. Mutual information takes into account the whole dependence structure of the variables, and not just the covariance, like principal component analysis (PCA) and related methods. Therefore, we can use mutual information as the criterion for finding the ICA representation. This approach is an alternative to the model estimation approach. We define the ICA of a random vector x as an invertible transformation: s = Bx (10.4) where the matrix B is determined so that the mutual information of the transformed components s i is minimized. If the data follows the ICA model, this allows estimation of the data model. On the other hand, in this definition, we do not need to assume that the data follows the model. In any case, minimization of mutual information can be interpreted as giving the maximally independent components. MUTUAL INFORMATION AND NONGAUSSIANITY 223 10.2 MUTUAL INFORMATION AND NONGAUSSIANITY Using the formula for the differential entropy of a transformation as given in (5.13) of Chapter 5, we obtain a corresponding result for mutual information. We have for an invertible linear transformation y = Bx : I (y 1 y 2  ::: y n )= X i H (y i )  H (x)  log j det Bj (10.5) Now, let us consider what happens if we constrain the y i to be uncorrelated and of unit variance. This means E fyy T g = BE fxx T gB T = I , which implies det I = 1 = det(BE fxx T gB T ) = (det B)(det E fxx T g)(det B T ) (10.6) and this implies that det B must be constant since det E fxx T g does not depend on B . Moreover, for y i of unit variance, entropy and negentropy differ only by a constant and the sign, as can be seen in (10.2). Thus we obtain, I (y 1 y 2  ::: y n )= const.  X i J (y i ) (10.7) where the constant term does not depend on B . This shows the fundamental relation between negentropy and mutual information. We see in (10.7) that finding an invertible linear transformation B that minimizes the mutual information is roughly equivalent to finding directions in which the negentropy is maximized. We have seen previously that negentropy is a measure of nongaussianity. Thus, (10.7) shows that ICA estimation by minimization of mutual information is equivalent to maximizing the sum of nongaussianities of the estimates of the independent components, when the estimates are constrained to be uncorrelated. Thus, we see that the formulation of ICA as minimization of mutual information gives another rigorous justification of our more heuristically introduced idea of finding maximally nongaussian directions, as used in Chapter 8. In practice, however, there are also some important differences between these two criteria. 1. Negentropy, and other measures of nongaussianity, enable the deflationary, i.e., one-by-one, estimation of the independent components, since we can look for the maxima of nongaussianity of a single projection b T x . This is not possible with mutual information or most other criteria, like the likelihood. 2. A smaller difference is that in using nongaussianity, we force the estimates of the independent components to be uncorrelated. This is not necessary when using mutual information, because we could use the form in (10.5) directly, as will be seen in the next section. Thus the optimization space is slightly reduced. 224 ICA BY MINIMIZATION OF MUTUAL INFORMATION 10.3 MUTUAL INFORMATION AND LIKELIHOOD Mutual information and likelihood are intimately connected. To see the connection between likelihood and mutual information, consider the expectation of the log- likelihood in (9.5): 1 T E flog L(B)g = n X i=1 E flog p i (b T i x)g + log j det Bj (10.8) If the p i were equal to the actual pdf’s of b T i x , the first term would be equal to  P i H (b T i x) . Thus the likelihood would be equal, up to an additive constant given by the total entropy of x , to the negative of mutual information as given in Eq. (10.5). In practice, the connection may be just as strong, or even stronger. This is because in practice we do not know the distributions of the independent components that are needed in ML estimation. A reasonable approach would be to estimate the density of b T i x as part of the ML estimation method, and use this as an approximation of the density of s i . This is what we did in Chapter 9. Then, the p i in this approximation of likelihood are indeed equal to the actual pdf’s b T i x . Thus, the equivalency would really hold. Conversely, to approximate mutual information, we could take a fixed approximation of the densities y i , and plug this in the definition of entropy. Denote the pdf’s by G i (y i )=logp i (y i ) . Then we could approximate (10.5) as I (y 1 y 2  ::: y n )= X i E fG i (y i )glog j det BjH (x) (10.9) Now we see that this approximation is equal to the approximation of the likelihood used in Chapter 9 (except, again, for the global sign and the additive constant given by H (x) ). This also gives an alternative method of approximating mutual information that is different from the approximation that uses the negentropy approximations. 10.4 ALGORITHMS FOR MINIMIZATION OF MUTUAL INFORMATION To use mutual information in practice, we need some method of estimating or approximating it from real data. Earlier, we saw two methods for approximating mutual entropy. The first one was based on the negentropy approximations introduced in Section 5.6. The second one was based on using more or less fixed approximations for the densities of the ICs in Chapter 9. Thus, using mutual information leads essentially to the same algorithms as used for maximization of nongaussianity in Chapter 8, or for maximum likelihood estimation in Chapter 9. In the case of maximization of nongaussianity, the corresponding algorithms are those that use symmetric orthogonalization, since we are maximizing the sum of nongaussianities, so that no order exists between the components. Thus, we do not present any new algorithms in this chapter; the reader is referred to the two preceding chapters. EXAMPLES 225 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 −4 iteration count mutual information Fig. 10.1 The convergence of FastICA for ICs with uniform distributions. The value of mutual information shown as function of iteration count. 10.5 EXAMPLES Here we show the results of applying minimization of mutual information to the two mixtures introduced in Chapter 7. We use here the whitened mixtures, and the FastICA algorithm (which is essentially identical whichever approximation of mutual information is used). For illustration purposes, the algorithm was always initialized so that W was the identity matrix. The function G was chosen as G 1 in (8.26). First, we used the data consisting of two mixtures of two subgaussian (uniformly distributed) independent components. To demonstrate the convergence of the algorithm, the mutual information of the components at each iteration step is plotted in Fig. 10.1. This was obtained by the negentropy-based approximation. At convergence, after two iterations, mutual information was practically equal to zero. The corresponding results for two supergaussian independent components are shown in Fig. 10.2. Convergence was obtained after three iterations, after which mutual information was practically zero. 10.6 CONCLUDING REMARKS AND REFERENCES A rigorous approach to ICA that is different from the maximum likelihood approach is given by minimization of mutual information. Mutual information is a natural information-theoretic measure of dependence, and therefore it is natural to estimate the independent components by minimizing the mutual information of their estimates. Mutual information gives a rigorous justification of the principle of searching for maximally nongaussian directions, and in the end turns out to be very similar to the likelihood as well. Mutual information can be approximated by the same methods that negentropy is approximated. Alternatively, is can be approximated in the same way as likelihood. 226 ICA BY MINIMIZATION OF MUTUAL INFORMATION 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 x 10 −3 iteration count mutual information Fig. 10.2 The convergence of FastICA for ICs with supergaussian distributions. The value of mutual information shown as function of iteration count. Therefore, we find here very much the same objective functions and algorithms as in maximization of nongaussianity and maximum likelihood. The same gradient and fixed-point algorithms can be used to optimize mutual information. Estimation of ICA by minimization of mutual information was probably first proposed in [89], who derived an approximation based on cumulants. The idea has, however, a longer history in the context of neural network research, where it has been proposed as a sensory coding strategy. It was proposed in [26, 28, 30, 18], that decomposing sensory data into features that are maximally independent is useful as a preprocessing step. Our approach follows that of [197] for the negentropy approximations. A nonparametric algorithm for minimization of mutual information was proposed in [175], and an approach based on order statistics was proposed in [369]. See [322, 468] for a detailed analysis of the connection between mutual information and infomax or maximum likelihood. A more general framework was proposed in [377]. PROBLEMS 227 Problems 10.1 Derive the formula in (10.5). 10.2 Compute the constant in (10.7). 10.3 If the variances of the y i are not constrained to unity, does this constant change? 10.4 Compute the mutual information for a gaussian random vector with covariance matrix C . Computer assignments 10.1 Create a sample of 2-D gaussian data with the two covariance matrices  3 0 0 2  and  3 1 1 2  (10.10) Estimate numerically the mutual information using the definition. (Divide the data into bins, i.e., boxes of fixed size, and estimate the density at each bin by computing the number of data points that belong to that bin and dividing it by the size of the bin. This elementary density approximation can then be used in the definition.) . space is slightly reduced. 224 ICA BY MINIMIZATION OF MUTUAL INFORMATION 10.3 MUTUAL INFORMATION AND LIKELIHOOD Mutual information and likelihood are intimately. using mutual information, which is a well-motivated information- theoretic measure of statistical dependence. One of the main utilities of mutual information

Ngày đăng: 23/12/2013, 07:19

Xem thêm: Tài liệu Bài 10: ICA by Minimization of Mutual Information pdf