Tài liệu 27 Robust Speech Processing as an Inverse Problem docx

19 433 0
Tài liệu 27 Robust Speech Processing as an Inverse Problem docx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Mammone, R.J. & Zhang, X. “Robust Speech Processing as an Inverse Problem” Digital Signal Processing Handbook Ed. Vijay K. Madisetti and Douglas B. Williams Boca Raton: CRC Press LLC, 1999 c  1999byCRCPressLLC 27 Robust Speech Processing as an Inverse Problem Richard J. Mammone Rutgers University Xiaoyu Zhang Rutgers University 27.1 Introduction 27.2 Speech Production and Spectrum-Related Parameterization 27.3 Template-Based Speech Processing 27.4 Robust Speech Processing 27.5 Affine Transform 27.6 Transformation of Predictor Coefficients Deterministic Convolutional Channel as a Linear Transform • Additive Noise as a Linear Transform 27.7 Affine Transform of Cepstral Coefficients 27.8 Parameters of Affine Transform 27.9 Correspondence of Cepstral Vectors References 27.1 Introduction This section addresses the inverse problem in robust speech processing. A problem that speaker and speech recognition systems regularly encounter in the commercialized applications is the dramatic degradation of performance due to the mismatch of the training and operating environments. The mismatchgenerallyresults fromthediversityof theoperating environments. Forapplicationsoverthe telephone network, the operating environments may vary from offices and laboratories to household places and airports. The problem becomes worse when speech is transmitted over the wireless network. Herethe systemexperiencescross-channelinterferencesinaddition tothechannelandnoise degradations that exist in the regular telephone network. The key issue in robust speech processing is to obtain good performance regardless of the mismatchin the environmental conditions. The inverse problem in this sense refers to the process of modeling the mismatch in the form of a transformation and resolving it via an inverse transformation. In this section, we introduce the method of modeling the mismatch as an affine transformation. Before getting into the details of the inverse problem in robust speech processing, we would like to giveabrief reviewof themechanismof speechproduction,as wellasthe retrievalofusefulinformation from the speech for the recognition purposes. c  1999 by CRC Press LLC 27.2 Speech Production and Spectrum-Related Parameterization The speech signal consists of time-varying acoustic waveforms produced as a result of acoustical excitation of the vocal tract. It is nonstationary in that the vocal tract configuration changes over time. A time-varying digital filter is generally used to describe the vocal tract characteristics. The steady-state system function of the filter is of the form [1, 2]: S(z) = G 1 −  p i=1 a i z −i = G  p i=1  1 − z i z −1  , (27.1) where p is the order of the system and z i denote the poles of the transfer function. The time domain representation of this filter is s(n) = p  i=1 a i s(n− i)+ Gu(n) . (27.2) The speech sample s(n) is predicted as a linear combination of previous p samples plus the excitation Gu(n),whereG is the gain factor. The factor G is generally ignored in the recognition-type tasks to allow for robustness to variations in the energy of speech signals. This speech production model is often referred to as the linear prediction (LP) model, or the autoregressive model, and the coefficients a i are called the predictor coefficients. The cepstrum of the speech signal s(n) is defined as c(n) =  π −π log    S  e jω     e jωn dω 2π . (27.3) It is simply the inverse Fourier transform of the logarithm of the magnitude of the Fourier transform S(e jω ) of the signal s(n). From the definition of cepstrum in Eq. (27.3), we have n=∞  n=−∞ c(n)e −jωn = log    S  e jω     =      log 1 1 −  p n=1 a n e −jωn      . (27.4) If we differentiate both sides of the equation with respect to ω and equate the coefficients of like powers of e jω , the following recursion is obtained: c(n) =    log Gn= 0 a(n) + 1 n  n−1 i=1 ic(i)a(n − i) n > 0 (27.5) The cepstral coefficients can be calculated using the recursion once the predictor coefficients are solved. The zeroth order cepstral coefficient is generally ignored in speech and speaker recognition due to its sensitivity to the gain factor, G. An alternative solution for the cepstral coefficients is given by c(n) = 1 q p  i=1 z n i . (27.6) It is obtained by equating the terms of like powers of z −1 in the following equation: n=∞  n=−∞ c(n)z −n = log 1  p n=1  1 − z n z −1  =− p  i=1 log  1 − z n z −1  , (27.7) c  1999 by CRC Press LLC where the logarithm terms can be written as a power series expansion given as log  1 − z n z −1  = ∞  k=1 1 k z k n z −k . (27.8) There are two standard methods of solving for the predictor coefficients, a i , namely, the autocor- relation method and the covariance method [3, 4, 5, 6]. Both approaches are based on minimizing the mean square value of the estimation error e(n) as given by e(n) = s(n) − p  i=1 a i s(n− i) . (27.9) The two methods differ with respect tothe details of numerical implementation. The autocorrelation method assumes that the speech samples are zero outside the processing interval of N samples. This results in a nonzero prediction error, e(n), outside the interval. The covariance method fixes the interval over which the prediction error is computed and has no constraints on the sample values outside the interval. The autocorrelation method is computationally simpler than the covariance approach and assures a stable system where all poles of the transfer function lie within the unit circle. A brief description of the autocorrelation method is given as follows. The autocorrelation of the signal s(n) is defined as r s (k) = N−1−k  n=0 s(n)s(n + k) = s(n) ⊗ s(−n) , (27.10) where N is the number of samples in the sequence s(n) and the sign ⊗ denotes the convolution operation. The definition of autocorrelation implies that r s (k) is an even function. The predictor coefficients a i can therefore be obtained by solving the following set of equations      r s (0)r s (1) ··· r s (p − 1) r s (1)r s (0) ··· r s (p − 2) . . . . . . . . . . . . r s (p − 1)r s (p − 2) ··· r s (0)         a 1 . . . a p    =    r s (1) . . . r s (p)    . Denoting the p × p Toeplitz autocorrelation matrix on the left hand side by R s , the predictor coefficient vector by a, and the autocorrelation coefficients by r s ,wehave R s a = r s . (27.11) The solution for the predictor coefficient vector a can be solved by the inverse relation a = R −1 s r s . This equation will be used throughout the analysis in the rest of this article. Since the matrix R s is Toeplitz, a computationally efficient algorithm known as Levinson-Durbin recursion can be used to solve for a [3]. 27.3 Template-Based Speech Processing The template-based matching algorithms for speech processing are generally conducted using the similarity of the vocal tract characteristics inhabited in the spectrum of a particular speech sound. c  1999 by CRC Press LLC There are two types of speech sounds, namely, voiced and unvoiced sounds. Figure 27.1 shows the speech waveforms, the spectra, and the spectral envelopes of the voiced and the unvoiced sounds. Voiced sounds such as the vowel /a/ and the nasal sound /n/ are produced by the passage of a quasi- periodic air wave through the vocal tract that creates resonances in the speech waveforms known as formants. The quasi-periodic air wave is generated as a result of the vibration of the vocal cord. The fundamental frequency of the vibration is known as the pitch. In the case of generating fricative sounds such as /sh/, the vocal tract is excited by random noise, resulting in speech waveforms exhibiting no periodicity, as can be seen in Fig. 27.1. Therefore, the spectral envelopes of voiced sounds constantly exhibit the pitch as well as three to five formants when the sampling rate is 8 kHz, whereas the spectral envelopes of the unvoiced sounds reveal no pitch and formant characteristics. In addition, the formants of different voiced sounds differ with respect to the shape and the location of the center frequencies of the formants. This is due to the unique shape of the vocal tract formed to produce a particular sound. Thus, different sounds can be distinguished based on attributes of the spectral envelope. The cepstral distance given by d = ∞  n=−∞  c(n) − c  (n)  2 (27.12) is one of the metrics for measuring the similarity of two spectra envelopes. The reason is as follows. From the definition of cepstrum, we have ∞  n=∞  c(n) − c  (n)  e jωn = log|S  e jω  |−log|S   e jω  | = log |S  e jω  | |S   e jω  | . (27.13) The Fourier transform of the differencebetweena pair of cepstra is equal to the differencebetween the corresponding spectra pair. By applying the Parseval’s theorem, the cepstral distance can be related to the log spectral distance as d = ∞  n=∞  c(n) − c  (n)  2 =  π −π  log|S  e jω  |−log|S   e jω  |  2 dω 2π . (27.14) The cepstral distance is usually approximated by the distance between the first few lower order cepstral coefficients, the reason being that the magnitude of the high order cepstral coefficients is small and has a negligible contribution to the cepstral distance. 27.4 Robust Speech Processing Robust speech processing attempts to maintain the performance of speaker and speech recognition system when variations in the operating environment are encountered. This can be accomplished if the similarity in vocal tract structures of the same sound can be recovered under adverse conditions. Figure 27.2 illustrates how the deterministic channel and random noise contaminate a speech signal during the recording and transmission of the signal. First of all, at the front end of the speech acquisition system, additive background noise N 1 (ω) from the speaking environment distorts the speech waveform. Adverse background conditions are also found to put stress on the speech production system and change the characteristics of the vocal c  1999 by CRC Press LLC FIGURE 27.1: Illustration of voiced/unvoiced speech. c  1999byCRCPressLLC FIGURE 27.2: The speech acquisition system. tract. It is equivalent to performing a linear filtering of the speech. This problem will be addressed in another chapter and will not be discussed here. After being sampled and quantized, the speech samples corrupted by the background noise N 1 (ω) are then passed through the transmission channel such as a telephone network to get to the receiver’s site. The transmission channel generally involves two types of degradation sources: the deterministic and convolutional filter with the transfer function H(ω), and the additive noise denoted by N 2 (ω) in Fig. 27.2. The signal observed at the output of the system is, therefore, Y(ω)= H(ω) [ X(ω)+ N 1 (ω) ] + N 2 (ω) . (27.15) The spectrum of the output signal is corrupted by both additive and multiplicative interferences. The multiplicative interference due to the linear channel H(ω)is sometimes referred to as the mul- tiplicative noise. The various sources of degradation cause distortions of the predictor coefficients and the cepstral coefficients. Fig. 27.4 shows the change of spatial clustering of the cepstral coefficients due to inter- ferences of the linear channel, white noise, and the composite effect of both linear channel and white noise. • When the speech is interfered by a linear bandpass channel, the frequency response of which is shown in Fig. 27.3, a translation of the cepstral clusters is observed, as shown in Fig. 27.4(b). • When the speech is corrupted by Gaussian white noise of 15 dB SNR, a shrinkage of the cepstral vectors results. This is shown in Fig. 27.4(c), where it can be seen that the cepstral clusters move toward the origin. • When the speech is degraded by both the linear channel and Gaussian white noise, the cepstral vectors are translated and scaled simultaneously. There are three underlying thoughts behind the various solutions to robust speech processing. The first is to recover the speech signal from the noisy observation by removing an estimate of the noise from the signal. This is also known as the speech enhancement approach. Methods that are executed in the speech sample domain include noise suppression [7] and noise masking [8]. Other speech enhancement methods are carried out in the feature domain, for example, cepstral mean subtraction (CMS) and pole-filtered cepstral mean subtraction (PFCMS). In this category, c  1999 by CRC Press LLC FIGURE 27.3: The simulated environmental interference. (a) Medium voiced channel and (b) Gaus- sian white noise. the key to the problem is to find feature sets that are invariant 1 to the changes of transmission channel and environmental noise. Liftered cepstrum [9] and the adaptive component weighted (ACW) cepstrum [10] are examples of the feature enhancement approach. A third category consists of methods for matching the testing features with the models after adaptation of environmental conditions [11, 12, 13, 14]. In this case, the presenceof noise in the training and testing environments aretolerableaslongasanadaptationalgorithm canbefound tomatchtheconditions. Theadaptations can be performed in either of the following two directions, i.e., adapt the training data to the testing environment, or adapt the testing data to the environment. The focus of the following discussion will be on viewing the robust speech processing as an inverse problem. We utilize the fact that both deterministic and non-deterministic noise introduce a sound- dependent linear transformation of the predictor coefficients of speech. This can be approximated by an affine transformation in the cepstrum domain. The mismatch can, therefore, be resolved by solving for the inverse affine transformation of the cepstral coefficients. 27.5 Affine Transform An affine transform y ofavectorx is defined as y = Ax + b, for b = 0 . (27.16) The matrix, A, represents the linear transformation of the vector, x, and b is a nonzero vector representing the translation of the vector. Note that the addition of the vector b to the equation causes the transform to become nonlinear. The singular value decomposition (SVD) of the matrix, A, can be used to gain some insight into the geometry of an affine transform, i.e., y = UV T x + b , (27.17) 1 In practice, it is difficult to find a set of features invariant to the environmental changes. The robust features currently used are mostly less sensitive to environmental changes. c  1999 by CRC Press LLC FIGURE 27.4: The spatial distribution of cepstral coefficients under various conditions, “∗” for the sound /a/, “o” for the sound /n/, and “+” for the sound /sh/. (a) Cepstrum of the clean speech; (b) cepstrum of signals filtered by continentalU.S. mid-voice channel (CMV); (c) cepstrum of signals with 15 dB SNR, the noise type is additive white Gaussian (AWG); (d) cepstrum of speech corrupted by both CMV channel and AWG noise of 15 dB SNR. where U and V T are unitary matrices and  is a diagonal matrix . The geometric interpretation is thus seen to be that x is rotated by unitary matrix V T , rescaled by the diagonal matrix , rotated again by the unitary matrix U, and finally translated by the vector b. 27.6 Transformation of Predictor Coefficients It will be proved in this section that the contaminationof a speech signal by a stationaryconvolutional channel and random white noise is equivalent to a signal dependent linear transformation of the predictor coefficients. The conclusion drawn here will be used in the next section to show that the effect of environmental interference is equivalent to an affine transform in the cepstrum domain. 27.6.1 Deterministic Convolutional Channel as a Linear Transform When a sample sequence is passed through a convolutional channel of impulse response h(n), the filtered signal s  (n) obtained at the output of the channel is s  (n) = h(n) ⊗ s(n) . (27.18) c  1999 by CRC Press LLC If the power spectra of the signals s(n) and s  (n) are denoted S s (ω), and S s  (ω), respectively, then S s  (ω) =|H(ω)| 2 S s (ω) . (27.19) Therefore, in the time domain, r s  (k) =[h(n) ⊗ h(−n)]⊗r s (k) = r h (k) ⊗ r s (k) , (27.20) where r s (k) and r s  (k) are the autocorrelation of the input and output signals. The autocorrelation of the impulse response h(n) is denoted r h (k) and by definition, r h (k) = h(n) ⊗ h(−n) . (27.21) If the impulse response h(n) is assumed to be zero outside the interval [0,p− 1], then r h (k) = 0 for |k| >p− 1 . (27.22) Equation (27.20) can therefore be rewritten in matrix form as      r s  (0) r s  (1) . . . r s  (p − 1)      =      r h (0)r h (1)r h (2) ··· r h (p − 1) r h (1)r h (0)r h (1) ··· r h (p − 2) . . . . . . . . . . . . r h (p − 1)r h (p − 2)r h (p − 3) ··· r h (0)      ×      r s (0) r s (1) . . . r s (p − 1)      = R h1 r s . (27.23) R h1 refers to the autocorrelation matrix of the impulse response of the channel on the right-hand side of the above equation. The autocorrelation matrix R s  of the filtered signal s  (n) is then R s  =      r s  (0)r s  (1)r s  (2) ··· r s  (p − 1) r s  (1)r s  (0)r s  (1) ··· r s  (p − 2) . . . . . . . . . . . . r s  (p − 1)r s  (p − 2)r s  (p − 3) ··· r s  (0)      =      r h (0)r h (1)r h (2) ··· r h (p − 1) r h (1)r h (0)r h (1) ··· r h (p − 2) . . . . . . . . . . . . r h (p − 1)r h (p − 2)r h (p − 3) ··· r h (0)      ×      r s (0)r s (1)r s (2) ··· r s (p − 1) r s (1)r s (0)r s (1) ··· r s (p − 2) . . . . . . . . . . . . r s (p − 1)r s (p − 2)r s (p − 3) ··· r s (0)      = R h1 R s . (27.24) c  1999 by CRC Press LLC [...]... use of bandpass liftering in speech recognition, IEEE Trans Acoust., Speech, Signal Processing, 35, 947–954, July 1987 [10] Assaleh, K.T and Mammone, R.J., New 1p-derived features for speaker identification, IEEE Trans Speech, Audio Processing, 2, 630–638, October 1994 [11] Sankar, A and Lee, C.H., Robust speech recognition based on stochastic matching, ICASSP, 1, 121–124, 1995 [12] Newneyer, L and Weintraub,... that the transformation in Eq (27. 26) is sound dependent, as the estimates of the autocorrelation matrices assume stationary 27. 6.2 Additive Noise as a Linear Transform The random noise arising from the background and the fluctuation of the transmission channel is generally assumed to be additive white noise (AWN) The resulted noisy observation of the original speech signal is given by (27. 27) s (n)... Cepstral analysis techniques for automatic speaker verification, IEEE Trans Acoust., Speech, Signal Processing, 29, 254 272 , April 1981 [7] Boll, S.F., Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans Acoust., Speech, Signal Processing, 27, 113–120, April 1979 [8] Klatt, D.H., A digital filter bank for spectral matching, ICASSP, 573–576, 1976 [9] Juang, B.H., Rabiner, L.R., and... scores of the training and testing cepstral vectors The z score of a set of vectors, ci , is defined as zci = ci − µc , σc (27. 66) where µc is the mean of the vectors, ci , and σc is the variance Thus, we could form zci = σc ci − µc + µc σc (27. 67) In the above analysis, we show that the cepstrum domain distortions due to channel and noise interference can be modeled as an affine transformation The parameters... filtering for robust speech recognition, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, 1, 417–420, 1994 [13] Nadas, A., Nahamoo D., and Picheny, M.A., Adaptive labeling: Normalization of speech by adaptive transformation based on vector quantization, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, 521–524, 1988 c 1999 by CRC Press LLC [14] Gish, H., Ng, K., and Rohlicek, J.R., Robust mapping... 1 and bj = cj , for j = 1, 2, , q ⇒ c = c − bc ˆ (27. 65) This is equivalent to the method of cepstral mean subtraction (CMS) [6] 3 If the mismatch is caused by both channel and random noise, the testing vector is translated as well as shrunk The shrinkage is measured by αjj and the translation by bj The smaller the covariance of the model and the testing data, the greater the scaling of the testing...         (27. 49) hsig (p − 1) hsig (p − 2) · · · hsig (0) • When the speech is corrupted by additive noise, the autocorrelation matrix Rs can also be written as T (27. 50) Rs = H H c 1999 by CRC Press LLC Equating the right-hand side of Eqs (27. 34) and (27. 50) yields H =U + σ 2I 1/2 (27. 51) where H is an explicit function of the training data and the noise At this point, we can conclude that... (27. 70) P cj C |ci = √ 1/2 2π where is the variance matrix If we assume that every cepstral coefficient has a unit variance, namely, = I, where I is the identity matrix, then the maximization of the likelihood probability is equivalent to finding the cepstral vector in the VQ codebook that has minimum Euclidean distance AT to the affine-transformed vector cj C References [1] Flanagan, J.L., Speech Analysis,...   (27. 58) A= , αqq the solutions of αij in Eq (27. 55) can be simplified as αjj = and bj = N i=1 γij γij = 1 N N i=1 γij − N 2 i=1 γij · N i=1 γij − E γj , γj − E γj · E γj E γj , γj − E 2 γj N i=1 N i=1 γij /N 2 /N = COV γj , γj VAR γj , γj (27. 59) N γij − αjj γij i=1 = E γj − αjj E γj (27. 60) Here, E[] is the expected value operator, and V AR[] and COV [] represent the variance and covariance... respectively As can be seen from Eq (27. 60), the diagonal entries αjj are the weighted covariance of the model and the testing vector, and the value of bj is equal to the weighted difference between the mean of the training vectors and that of the testing vectors There are three cases of interest: 1 If the training and operating conditions are matched, then E γj = E γj and COV γj , γj = V AR γj , γj (27. 61) . Production and Spectrum-Related Parameterization 27. 3 Template-Based Speech Processing 27. 4 Robust Speech Processing 27. 5 Affine Transform 27. 6 Transformation. 1999byCRCPressLLC 27 Robust Speech Processing as an Inverse Problem Richard J. Mammone Rutgers University Xiaoyu Zhang Rutgers University 27. 1 Introduction 27. 2 Speech

Ngày đăng: 25/12/2013, 06:16

Tài liệu cùng người dùng

Tài liệu liên quan