Adaptive lọc và phát hiện thay đổi P7

Adaptive Filtering and Change Detection Fredrik Gustafsson Copyright © 2000 John Wiley & Sons, Ltd ISBNs: 0-471-49287-6 (Hardback); 0-470-84161-3 (Electronic) Change detection basedon filter banks 7.1 Basics 231 7.2.Problem setup 233 7.2.1 Thechanging regression model 233 234 7.2.2 Notation 7.3.Statistical criteria 234 7.3.1 The MML estimator 234 7.3.2 The a posteriori probabilities 236 7.3.3 On the choice of priors 237 7.4 Informationbased criteria 240 7.4.1 The MGL estimator 240 7.4.2 MGL withpenaltyterm 241 242 7.4.3 Relationto MML 7.5 On-line local search for optimum 242 7.5.1 Local treesearch 243 245 7.5.2 Design parameters 7.6 Off-line globalsearch for optimum 245 7.7 Applications 246 7.7.1 Storing EKG signals 247 248 7.7.2 Speechsegmentation 7.7.3 Segmentation of a car’s driven path 249 7.A Two inequalities for likelihoods 252 7.A.l The firstinequality 252 7.A.2 Thesecondinequality 254 255 7.A.3 Theexactpruningalgorithm 7.B The posterior probabilities of a jump sequence 256 7.B.1 Maintheorems 256 7.1 Basics Let us start with considering change detection in linear regressions as an offline problem which will be referred to as segmentation The goal is to find a Chanae detection based banks on filter 232 sequence of time indices kn = ( k l ,k2, ,k n ) , where both the number n and the locations ki are unknown, such that a linear regression model with piecewise constant parameters, is a good description of the observed signal yt In this chapter, the measurements may be vector valued, and the nominal covariance matrix of the noise is Rt, and X ( i ) is a possibly unknown scaling, which is piecewise constant One way to guarantee that the best possible solution is found, is to consider all possible segmentations kn, estimate one linear regression model in each segment, and then choose the particular kn that minimizes an optimality criteria, kn = arg V(kn) h n>l,O in (7.14) and N ( i ) p- d - > in (7.15) The segments must therefore be forced to be long enough 7.3.3 On the choice of priors The Gaussian assumption on the noise is a standard one, partly because it gives analytical expressions and partly because it has proven to work well in Chanae banksfilter detection on based 238 practice Other alternativesarerarely seen The Laplaciandistribution is shown in Wu and Fitzgerald (1995) to also give an analytical solution in the case of unknown mean models It was there found that it is less sensitive to large measurement errors The standard approach used here for marginalization is to consider both Gaussian and non-informative prior in parallel We often give priority to a non-informative prior on 8,using a flat density function, in our aimto have as few non-intuitive design parameters as possible That is, p ( P I X n ) = C is an arbitrary constant in (7.10) The use of non-informative priors, and especially improper ones, is sometimes criticized See Aitken (1991) for an interesting discussion Specifically, here the flat prior introducesan arbitrary termn log C in the log likelihood The idea of using a flat prior, or non-informative prior, in marginalization is perhaps best explained by an example Example 7.7 Marginalized likelihood for variance estimation Suppose we have t observations from a Gaussian distribution; yt E N(p, X) Thus the likelihood p(ytl,u, X) is Gaussian We want to compute the likelihood conditioned on just X using marginalization: p(ytIX) = Jp(ytlp,X)p(p)dp Two alternatives of priors are a Gaussian, p E N(p0, PO),and a flat prior, p ( p ) = C In both cases, we end up with an inverse Wishart density function (3.54) with maximas where jj is the sample average Note the scaling factor l / ( t - l),which makes the estimateunbiased The joint likelihood estimate of both mean andvariance gives a variance estimator scaling factor l/t The prior thus induces a bias in the estimate Thus, a flat prior eliminates the bias induced by the prior We remark that the likelihood interpreted as a conditional density functionis proper, and it does not depend upon the constant C The use of a flat prior can be motivated as follows: The datadependent terms in the log likelihood increase like log N That is, whatever the choice of C, the prior dependent term will be insignificant for a large amount of data 7.3 Statistical 239 The choice C M can be shown to give approximately the same likewouldgive if the true lihood as a proper informative Gaussian prior parameters were known and used in the prior See Gustafsson (1996), where an example is given More precisely, with the prior N(I90, PO),where 130 is the true value of O(i) the constant should be chosen as C = det PO.The uncertainty about 190 reflected in POshould be much larger than the data information in P ( i ) if one wants the data to speak for themselves Still, the choice of PO is ambiguous The larger value, the higher is the penalty on a large number of segments This is exactly L i n d l e y ' s p a r a d o x (Lindley, 1957): Lindley's paradox The more non-informative prior, the more the zero-hypothesis is favored I Thus, the prior should be chosen to be as informative as possible without interfering with data For auto-regressions and other regressions where the parameters are scaled to be around orless than 1, the choice PO= I is appropriate Since the true value 80 is not known, this discussion seems to validate the use of a flat prior with the choice C = 1, which has also been confirmed to work well by simulations An unknown noise variance is assigned a flat prior as well with the same pragmatic motivation Example 7.2 Lindley's paradox Consider the hypothesis test H0 :y E N(0,l) H1 :Y E N ( 11, and assume that the prior on I9 is N(I30,Po) Equation (5.98) gives for scalar measurements that Here we have N = 1, P1 = (P;' likelihood ratio is + l)-' + and 81 = P l y + y Then the Chanae detection based banks on filter 240 since the whole expression behaves like / a This fact is not influenced by the number of data or what the true mean is, or what 130 is That is, the more non-informative the prior, the more H0 is favored! 7.4 Informationbasedcriteria The information based approach of this section can be called a penalized Maximum Generalized Likelihood (MGL) approach 7.4.1 The MGL estimator It is straightforward to show that the minimum of (7.9) with respect to assuming a known X(i), is MGL(P) = -2 logp(yNI P ,P , P, x~) on = N p log(27r) + c N log det(Rt) t=l + c(a + n i=l W) N(i) log(X(i)p)) (7.17) Minimizing the right-hand side of (7.17) with respect to a constant unknown noise scaling X(i) = X gives and finally, for a changing noise scaling MGL(kn) = -210gp(yNIlcn, 0" ,X" = N p log(27r) + P , An) c N log det (Rt) t=l n (7.19) i=l 7.6 Off-linealobal search for oDtimum 7.5.2 245 Designparameters It has alreadybeen noted that thesegments have to belonger than a minimum segment length, otherwise the derivations in Appendix 7.B of (7.14) and (7.15) are not valid Consider Theorem 7.5 Since there is a term r((N(i)p-d-2)/2) and the gamma function r ( z ) has poles for z = 0, -1, -2, , the segment lengths must be larger than (2 d ) / p This is intuitively logical, since d data points are required to estimate I9 and two more to estimate X That it could be wise to use a minimum lifelength of the sequences can be determined asfollows Suppose the model structure on the regression is a third order model Then at least three measurements are needed to estimate the parameters, and more are needed to judge the fit of the model to data That is, after at least four samples, something intelligent can be said about the data fit Thus, the choice of a minimum lifelength is related to theidentifiability of the model, and should be chosen larger than dim(I9) It is interesting to point out the possibility of forcing the algorithm to give the exact MAP estimate by specifying the minimum lifelength and thenumber of sequences to N In this way, only the first rule is actually performed (which is the first step in Algorithm 7.3) The MAP estimate is, in this way, found in quadratic time Finally, the jump probability q is used to tune thenumber of segments + + 7.6 Off-line global search for optimum Numericalapproximations that have been suggested includedynamic programming (Djuric, 1992), batch-wise processing where only a small number of jump times is considered (Kitagawa and Akaike, 1978), and MCMC methods, but it is fairly easy to construct examples where these approaches have shortcomings, as demonstrated in Section 4.4 Algorithm 4.2 for signal estimation is straightforward to generalize to the parameter estimation problem This more general form is given in Fitzgerald et al.(1994), and is a combination of Gibbssampling and the Metropolis algorithm Chanae detection based banks on filter 246 Algorithm 7.2 MCMC segmentation Decide the number of changes n and choose which likelihood to use The 7.3,7.4 or 7.5 with options are the a posteriori probabilitiesinTheorems q = 0.5: Iterate Monte Carlo run i Iterate Gibbs sampler for component j in k n , where a random number from p(kj1kT except ) - is taken Denote the new candidate sequence p.The distribution may be taken as flat, or Gaussian centered around the previous estimate The candidate j is accepted if the likelihood increases, p ( F ) > p ( k n ) Otherwise,candidate j is accepted (the Metropolis step) if a random number from a uniform distribution is less than the likelihood ratio After the burn-in (convergence) time, the distribution of change times can be computed by Monte Carlo techniques The last step of random rejection sampling defines the Metropolis algorithm Here the candidate will be rejected with large probability if its value is unlikely We refer to Section 4.4 for illustrative examples and Section 7.7.3 for an application 7.7 Applications The first application uses segmentation as a means for signal compression, modeling an EKG signal as a piecewise constant polynomial In the second application, the proposed method is compared to existing segmentation methods In an attempt to be as fair as possible, first we choose a test signal that has been examined before in the literature In this way, it is clear that the algorithms under comparison are tunedas well as possible The last application concerns real time estimation for navigation in a car 7.7 Amlications 247 Piecewise quadratic model Piecewise linear model 10 10 5 -5 250 200 50150 300 100 250 300 -5 200 50 150 100 Signal andsegmentation Signal and segmentation 10 -50 Sampel number (4 10 25 15 20 Sampel number 30 35 (b) Figure 7.2 An EKG signal and a piecewise constant linear model (a) and quadratic model (b), respectively 7.7.1 Storing EKG signals The EKG compression problem defined in Section 2.6.2 is here approached by segmentation Algorithm 7.1 is used with 10 parallel filters and fixed noise variance u2 = 0.01 The assumptionof fixed variance gives us a tool to control the accuracy in the compression, and to trade it off to compression rate Figure 7.2 shows the EKG signal and a possible segmentation For evaluation, the following statistics are interesting: Model type Regression Linear Quadratic 0.032 Number of parameters Commession rate (%) 10 With this algorithm, the linear model gives far less error and almost the same compression rate The numerical resolution is the reason for the poor performance of the quadraticmodel, which includes the linear one as a special case If the lower value of u2 is supplied, then the performance will degrade substantially The remedy seems to be another basis for the quadratic polynomial Chanae detection based banks on filter 248 Table 7.1 Estimated change times for different methods I Method I na I Estimated change times Noisy signal 3626 2830 2125 1900 I 1450 16611 I 451 Divergence Brandt’s GLR 3626 I2830 2125 16 1900 1450 611 I 451 Brandt’s GLR 593 1450 2125 2830 3626 451 593 1608 2116 2741 2822 3626 Approx ML Pre-filtered signal 3626 2797Divergence 2151 1800 1550 645 16 445 3626 2797Brandt’s 2151 1800GLR 1550 16 645 445 3626 3400 2797 2151 1750 1550 645445 Brandt’s GLR 445 626 1609 2151 2797 3627 Approx ML 7.7.2 Speechsegmentation The speech signal’ under consideration was recorded inside a carby the French (1988) National Agency for Telecommunications, as described in Andre-Obrecht This example is an continuation of Example 6.7, and the performance of the filter bank will be compared to the consistency tests examined in Chapter To get a direct comparison with the segmentation result in Basseville and Nikiforov (1993), a second order AR model is used The approximate ML estimate is derived ( = 1/2), using 10 parallel filters where each new segment has a guaranteed lifelength of seven The following should be stressed: The resemblance with the result of Brandt’s GLR test and the divergence test presented in Basseville and Nikiforov (1993) is striking No tuning parameters are involved (although q # 1/2 can be used to influence the number of segments, if not satisfactory) This should be compared with the tricky choice of threshold, window size and drift parameter in the divergence test and Brandt’s GLRtest-which, furthermore, should be different for voiced and unvoiced zones Presumably, a considerable tuning effort was required in Andre-Obrecht (1988) to obtain a result similar to that which the proposed method gave in a first try using default parameters tuned on simple test signals The drawback compared to thetwo previously mentioned methods is a somewhat higher computational complexity Using the same implementation of the required RLS filters, the numberof floating point operations for AR(2) models were 1.6 106 for Brandt’s GLR test and 5.8 106 for the approximate ML method ‘The author would like to thank Michele Basseville and Regine Andree-Obrechtfor sharing the speech signals in this application 7.7 Amlications 0 249 The design parameters of the search scheme are not very critical There is a certain lower bound where the performance drastically deteriorates, but there is no trade-off, as is common for design parameters With the chosen search strategy, the algorithm is recursive and the estimated change points are delivered with a time delay of 10 samples This ismuch faster than the other methods due to their sliding windowof width 160 For instance, the change at time 2741 for the noisy signal, where the noise variance increases by a factor (see below), is only 80 samples away from a more significant change, and cannot be distinguished with the chosen sliding window Much of the power of the algorithm is due to the model with changing noise variance A speech signal has very large variations in the driving noise For these two signals, the sequences of noise variances are estimated to 105 X (0.035, 0.13, 1.6, 0.37, 1.3, 0.058, 1.7) and 105 X (0.038, 0.11, 1.6, 0.38, 1.6, 0.54,0.055, 1.8), respectively Note, that the noise variance differs as much as a factor 50 No algorithm based on a fixed noise variance can handle that Therefore, the proposed algorithm seems to be anefficient tool for getting a quick and reliable result The lack of design parameters makes it very suitable for general purpose software implementations 7.7.3 Segmentation of a car's driven path We will here study the case described in Section 2.6.1 Signal model The model is that the heading angle is piecewise constant or piecewise linear, corresponding to straightpathsandbends or roundabouts.The changing regression model is here &+l = (1- +t = + &ut @ +t& + et E e i = At The approximateMAP estimate of the change times can now be computed in real time (the sampling interval is 100 times larger than needed for computations) The number of parallel filters is 10, the minimum allowed segment length is and each new jump hypothesis is guaranteed to survive at least six samples The prior probability of a jump is 0.05 at each time Chanae detection based banks on filter 250 Local search Segmentation with a fixed accuracy of the model, using a fixed noise variance X = 0.05, gives the result in Figure 7.3 Figure 7.3 also shows the segmentation where the noise variance is unknown and changing over the segments In both cases, the roundabout is perfectly modeled by one segment and the bends are detected The seemingly bad performance after the first turn is actually a proof of the power of this approach Little data are available, and there is no good model for them, so why waste segments? It is more logical to tolerate larger errors here and just use one model This proves that the adaptive noise variance works for this application as well In any case, the model with fixed noise scaling seems to be the most appropriate for this application The main reason is to exclude small, though significant, changes, like lane changes on highways Optimal search The optimal segmentation using Algorithm 7.3 gives almost the same segmentation The estimated change time sequences are = (20, 46, 83, 111, 130, 173) Greclmarg = (18, 42, gTJt,marg 66, 79,89, = (18, 42, 65, 79,88, 121, 136, 165, 173, 180) 110, 133, 162, 173, 180), respectively Robustness to design parameters The robustness withrespect to the design parameters of the approximation is as follows: 0 The exact MAP estimate is almost identical to the approximation The number of segments is correct, but three jumps differ slightly A number of different values on the jump probability were examined Any value between q = 0.001 and q = 0.1 gives the same number of segments A q between 0.1 and 0.5 gives one more segment, just at the entrance to the roundabout A smaller numberof filters than A4 = 10 gives one or two more segments That is, a reasonable performance is obtained for almost any choice of design parameters ... schemes The main part of this chapter is devoted to the second approach, which provides a solution to adaptive filtering, which is an on-line problem 7.2 Problem setup 7.2.1 The changing regression... segments? It is more logical to tolerate larger errors here and just use one model This proves that the adaptive noise variance works for this application as well In any case, the model with fixed noise

Adaptive lọc và phát hiện thay đổi P7

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan