Tài liệu Independent component analysis P3 pdf

Thông tin tài liệu

3 Gradients and Optimization Methods The main task in the independent component analysis (ICA) problem, formulated in Chapter 1, is to estimate a separating matrix that will give us the independent components. It also became clear that cannot generally be solved in closed form, that is, we cannot write it as some function of the sample or training set, whose value could be directly evaluated. Instead, the solution method is based on cost functions, also called objective functions or contrast functions. Solutions to ICA are found at the minima or maxima of these functions. Several possible ICA cost functions will be given and discussed in detail in Parts II and III of this book. In general, statistical estimation is largely based on optimization of cost or objective functions, as will be seen in Chapter 4. Minimization of multivariate functions, possibly under some constraints on the solutions, is the subject of optimization theory. In this chapter, we discuss some typical iterative optimization algorithms and their properties. Mostly, the algorithms are based on the gradients of the cost functions. Therefore, vector and matrix gradients are reviewed first, followed by the most typical ways to solve unconstrained and constrained optimization problems with gradient-type learning algorithms. 3.1 VECTOR AND MATRIX GRADIENTS 3.1.1 Vector gradient Consider a scalar valued function of variables 57 Independent Component Analysis. Aapo Hyv ¨ arinen, Juha Karhunen, Erkki Oja Copyright  2001 John Wiley & Sons, Inc. ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) 58 GRADIENTS AND OPTIMIZATION METHODS where we have used the notation . By convention, we define as a column vector. Assuming the function is differentiable, its vector gradient with respect to is the -dimensional column vector of partial derivatives . . . (3.1) The notation is just shorthand for the gradient; it should be understood that it does not imply any kind of division by a vector, which is not a well-defined concept. Another commonly used notation would be or . In some iteration methods, we have also reason to use second-order gradients. We define the second-order gradient of a function with respect to as . . . . . . (3.2) This is an matrix whose elements are second order partial derivatives. It is called the Hessian matrix of the function . It is easy to see that it is always symmetric. These concepts generalize to vector-valued functions; this means an -element vector . . . (3.3) whose elements are themselves functions of .TheJacobian matrix of with respect to is . . . . . . (3.4) Thus the th column of the Jacobian matrix is the gradient vector of with respect to . The Jacobian matrix is sometimes denoted by . For computing the gradients of products and quotients of functions, as well as of composite functions, the same rules apply as for ordinary functions of one variable. VECTOR AND MATRIX GRADIENTS 59 Thus (3.5) (3.6) (3.7) The gradient of the composite function can be generalized to any number of nested functions, giving the same chain rule of differentiation that is valid for functions of one variable. 3.1.2 Matrix gradient In many of the algorithms encountered in this book, we have to consider scalar-valued functions of the elements of an matrix : (3.8) A typical function of this kind is the determinant of . Of course, any matrix can be trivially represented as a vector by scanning the elements row by row into a vector and reindexing. Thus, when considering the gradient of with respect to the matrix elements, it would suffice to use the notion of vector gradient reviewed earlier. However, using the separate concept of matrix gradient gives some advantages in terms of a simplified notation and sometimes intuitively appealing results. In analogy with the vector gradient, the matrix gradient means a matrix of the same size as matrix , whose th element is the partial derivative of with respect to . Formally we can write . . . . . . (3.9) Again, the notation is just shorthand for the matrix gradient. Let us look next at some examples on vector and matrix gradients. The formulas presented in these examples will be frequently needed later in this book. 3.1.3 Examples of gradients Example 3.1 Consider the simple linear functional of , or inner product 60 GRADIENTS AND OPTIMIZATION METHODS where is a constant vector. The gradient is, according to (3.1), . . . (3.10) which is the vector . We can write Because the gradient is constant (independent of ), the Hessian matrix of is zero. Example 3.2 Next consider the quadratic form (3.11) where is a square matrix. We have . . . (3.12) which is equal to the vector .So, For symmetric , this becomes . The second-order gradient or Hessian becomes . . . . . . (3.13) which is equal to the matrix .If is symmetric, then the Hessian of is equal to . Example 3.3 For the quadratic form (3.11), we might quite as well take the gradient with respect to , assuming now that is a constant vector. Then . Compiling this into matrix form, we notice that the matrix gradient is the matrix . Example 3.4 In some ICA models, we must compute the matrix gradient of the determinant of a matrix. The determinant is a scalar function of the matrix elements VECTOR AND MATRIX GRADIENTS 61 consisting of multiplications and summations, and therefore its partial derivatives are relatively simple to compute. Let us prove the following: If is an invertible square matrix whose determinant is denoted ,then (3.14) This is a good example for showing that a compact formula is obtained using the matrix gradient; if were stacked into a long vector, and only the vector gradient were used, this result could not be expressed so simply. Instead of starting from scratch, we employ a well-known result from matrix algebra (see, e.g., [159]), stating that the inverse of a matrix is obtained as adj (3.15) with adj the so-called adjoint of . The adjoint is the matrix adj (3.16) where the scalar numbers are the so-called cofactors. The cofactor is obtained by first taking the submatrix of that remains when the th row and th column are removed, then computing the determinant of this submatrix, and finally multiplying by . The determinant can also be expressed in terms of the cofactors: (3.17) Row can be any row, and the result is always the same. In the cofactors , none of the matrix elements of the th row appear, so the determinant is a linear function of these elements. Taking now a partial derivative of (3.17) with respect to one of the elements, say, ,gives By definitions (3.9) and (3.16), this implies directly that adj But adj is equal to by (3.15), so we have shown our result (3.14). This also implies that (3.18) see (3.15). This is an example of the matrix gradient of a composite function consisting of the , absolute value, and functions. This result will be needed when the ICA problem is solved by maximum likelihood estimation in Chapter 9. 62 GRADIENTS AND OPTIMIZATION METHODS 3.1.4 Taylor series expansions of multivariate functions In deriving some of the gradient type learning algorithms, we have to resort to Taylor series expansions of multivariate functions. In analogy with the well-known Taylor series expansion of a function of a scalar variable , (3.19) we can do a similar expansion for a function of variables. We h ave (3.20) where the derivatives are evaluated at the point . The second term is the inner product of the gradient vector with the vector , and the third term is a quadratic form with the symmetric Hessian matrix . The truncation error depends on the distance ; the distance has to be small, if is approximated using only the first- and second-order terms. The same expansion can be made for a scalar function of a matrix variable. The second order term already becomes complicated because the second order gradient is a four-dimensional tensor. But we can easily extend the first order term in (3.20), the inner product of the gradient with the vector , to the matrix case. Remember that the vector inner product is defined as For the matrix case, this must become the sum This is the sum of the products of corresponding elements, just like in the vectorial inner product. This can be nicely presented in matrix form when we remember that for any two matrices, say, and , trace with obvious notation. So, we have trace (3.21) for the first two terms in the Taylor series of a function of a matrix variable. LEARNING RULES FOR UNCONSTRAINED OPTIMIZATION 63 3.2 LEARNING RULES FOR UNCONSTRAINED OPTIMIZATION 3.2.1 Gradient descent Many of the ICA criteria have the basic form of minimizing a cost function with respect to a parameter matrix , or possibly with respect to one of its columns . In many cases, there are also constraints that restrict the set of possible solutions. A typical constraint is to require that the solution vector must have a bounded norm, or the solution matrix has orthonormal columns. For the unconstrained problem of minimizing a multivariate function, the most classic approach is steepest descent or gradient descent. Let us consider in more detail the case when the solution is a vector ; the matrix case goes through in a completely analogous fashion. In gradient descent, we minimize a function iteratively by starting from some initial point , computing the gradient of at this point, and then moving in the direction of the negative gradient or the steepest descent by a suitable distance. Once there, we repeat the same procedure at the new point, and so on. For we have the update rule (3.22) with the gradient taken at the point . The parameter gives the length of the step in the negative gradient direction. It is often called the step size or learning rate. Iteration (3.22) is continued until it converges, which in practice happens when the Euclidean distance between two consequent solutions goes below some small tolerance level. If there is no reason to emphasize the time or iteration step, a convenient shorthand notation will be used throughout this book in presenting update rules of the preceding type. Denote the difference between the new and old value by (3.23) We can then write the rule (3.22) either as or even shorter as The symbol is read “is proportional to”; it is then understood that the vector on the left-hand side, , has the same direction as the gradient vector on the right-hand side, but there is a positive scalar coefficient by which the length can be adjusted. In the upper version of the update rule, this coefficient is denoted by . In many cases, this learning rate can and should in fact be time dependent. Yet a third very convenient way to write such update rules, in conformity with programming languages, is 64 GRADIENTS AND OPTIMIZATION METHODS where the symbol means substitution, i.e., the value of the right-hand side is computed and substituted in . Geometrically, a gradient descent step as in (3.22) means going downhill. The graph of is the multidimensional equivalent of mountain terrain, and we are always moving downwards in the steepest direction. This also immediately shows the disadvantage of steepest descent: unless the function is very simple and smooth, steepest descent will lead to the closest local minimum instead of a global minimum. As such, the method offers no way to escape from a local minimum. Nonquadratic cost functions may have many local maxima and minima. Therefore, good initial values are important in initializing the algorithm. Local minimum Gradient vector minimum Global Fig. 3.1 Contour plot of a cost function with a local minimum. As an example, consider the case of Fig. 3.1. A function is shown there as a contour plot. In the region shown in the figure, there is one local minimum and one global minimum. From the initial point chosen there, where the gradient vector has been plotted, it is very likely that the algorithm will converge to the local minimum. Generally, the speed of convergence can be quite low close to the minimum point, because the gradient approaches zero there. The speed can be analyzed as follows. Let us denote by the local or global minimum point where the algorithm will eventually converge. From (3.22) we have (3.24) Let us expand the gradient vector element by element as a Taylor series around the point , as explained in Section 3.1.4. Using only the zeroth- and first-order terms, we have for the th element LEARNING RULES FOR UNCONSTRAINED OPTIMIZATION 65 Now, because is the point of convergence, the partial derivatives of the cost function must be zero at . Using this result, and compiling the above expansion into vector form, yields where is the Hessian matrix computed at the point . Substituting this in (3.24) gives This kind of convergence, which is essentially equivalent to multiplying a matrix many times with itself, is called linear. The speed of convergence depends on the learning rate and the size of the Hessian matrix. If the cost function is very flat at the minimum, with second partial derivatives also small, then the Hessian is small and the convergence is slow (for fixed ). Usually, we cannot influence the shape of the cost function, and we have to choose , given a fixed cost function. The choice of an appropriate step length or learning rate is thus essential: too small a value will lead to slow convergence. The value cannot be too large either: too large a value will lead to overshooting and instability, which prevents convergence altogether. In Fig. 3.1, too large a learning rate will cause the solution point to zigzag around the local minimum. The problem is that we do not know the Hessian matrix and therefore determining a good value for the learning rate is difficult. A simple extension to the basic gradient descent, popular in neural network learning rules like the back-propagation algorithm, is to use a two-step iteration instead of just one step like in (3.22), leading to the so-called momentum method. Neural network literature has produced a large number of tricks for boosting steepest descent learning by adjustable learning rates, clever choice of the initial value, etc. However, in ICA, many of the most popular algorithms are still straightforward gradient descent methods, in which the gradient of an appropriate contrast function is computed and used as such in the algorithm. 3.2.2 Second-order learning In numerical analysis, a large number of methods that are more efficient than plain gradient descent have been introduced for minimizing or maximizing a multivariate scalar function. They could be immediately used for the ICA problem. Their ad- vantage is faster convergence in terms of the number of iterations required, but the disadvantage quite often is increased computational complexity per iteration. Here we consider second-order methods, which means that we also use the information contained in the second-order derivatives of the cost function. Obviously, this information relates to the curvature of the optimization terrain and should help in finding a better direction for the next step in the iteration than just plain gradient descent. 66 GRADIENTS AND OPTIMIZATION METHODS A good starting point is the multivariate Taylor series; see Section 3.1.4. Let us develop the function in Taylor series around a point as (3.25) In trying to minimize the function , we ask what choice of the new point gives us the largest decrease in the value of . We can write and minimize the function with respect to . The gradient of this function with respect to is (see Example 3.2) equal to ; note that the Hessian matrix is symmetric. If the Hessian is also positive definite, then the function will have a parabolic shape and the minimum is given by the zero of the gradient. Setting the gradient to zero gives From this, the following second-order iteration rule emerges: (3.26) where we have to compute the gradient and Hessian on the right-hand side at the point . Algorithm (3.26) is called Newton’s method, and it is one of the most efficient ways for function minimization. It is, in fact, a special case of the well-known Newton’s method for solving an equation; here it solves the equation that says that the gradient is zero. Newton’s method provides a fast convergence in the vicinity of the minimum, if the Hessian matrix is positive definite there, but the method may perform poorly farther away. A complete convergence analysis is given in [284]. It is also shown there that the convergence of Newton’s method is quadratic;if is the limit of convergence, then where is a constant. This is a very strong mode of convergence. When the error on the right-hand side is relatively small, its square can be orders of magnitude smaller. (If the exponent is , the convergence is called cubic, which is somewhat better than quadratic, although the difference is not as large as the difference between linear and quadratic convergence.) On the other hand, Newton’s method is computationally much more demanding per one iteration than the steepest descent method. The inverse of the Hessian has to be computed at each step, which is prohibitively heavy for many practical cost functions in high dimensions. It may also happen that the Hessian matrix becomes ill-conditioned or close to a singular matrix at some step of the algorithm, which induces numerical errors into the iteration. One possible remedy for this is [...]... encountered later in this book Several learning algorithms of principal component analysis networks and the well-known least-mean-squares algorithm, for example, are instantaneous stochastic gradient algorithms x x x x As x , with the elements Example 3.5 We assume that satisfies the ICA model = of the source vector statistically independent and the mixing matrix The problem is to solve and , knowing... Cx w are the corresponding eigenvalues The principal components of a random vector x are defined in terms of the eigenvectors, = Ef as discussed in Chapter 6 With a somewhat deeper analysis, it can be shown [324] that the only asymptotically stable fixed point is the eigenvector corresponding to the largest eigenvalue, which gives the first principal component The example shows how an intractable stochastic... because the randomness causes fluctuations that never die out unless they are deliberately frozen by letting the learning rate go to zero The analysis of stochastic algorithms like (3.32) is the subject of stochastic approximation; see, e.g., [253] In brief, the analysis is based on the averaged differential equation that is obtained from (3.32) by taking averages over on the right-hand side: the differential... points, i.e., roots of the right-hand side, because these are the points where the change in over time becomes zero It is also well-known how by linearizing the right-hand side with respect to a stability analysis of these fixed points can be accomplished Especially important are the so-called asymptotically stable fixed points that are local points of attraction Now, if the learning rate (t) is a suitably... is the eigenvector corresponding to the largest eigenvalue, which gives the first principal component The example shows how an intractable stochastic on-line rule can be nicely analyzed by the powerful analysis tools existing for ODEs 73 LEARNING RULES FOR CONSTRAINED OPTIMIZATION 3.3 LEARNING RULES FOR CONSTRAINED OPTIMIZATION w In many cases we have to minimize or maximize a function J ( ) under some... covered in [172] Constrained optimization has been extensively discussed in [284] Projection on the unit sphere and the short-cut approximation for normalization has been discussed in [323, 324] A rigorous analysis of the convergence of the stochastic on-line algorithms is discussed in [253] Problems @g 3.1 Show that the Jacobian matrix of the gradient vector @ w with respect to equal to the Hessian of g... function Formulate (a) the corresponding batch learning rule, (b) the averaged differential equation Consider a stationary point of (a) and (b) Show that if is such that the elements of are zero-mean and independent, then is a stationary point W 3.9 Assume that we want to maximize a function F w on the unit sphere, i.e., under the constraint kwk Prove that at the maximum, the gradient of F must point . in the independent component analysis (ICA) problem, formulated in Chapter 1, is to estimate a separating matrix that will give us the independent components Vector gradient Consider a scalar valued function of variables 57 Independent Component Analysis. Aapo Hyv ¨ arinen, Juha Karhunen, Erkki Oja Copyright 

Ngày đăng: 21/01/2014, 06:20

Xem thêm: Tài liệu Independent component analysis P3 pdf