Acoustic keyword spotting in speech with applications to data mining

Thông tin tài liệu

Speech and Audio Research Laboratory of the SAIVT program Centre for Built Environment and Engineering Research ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING A J Kishan Thambiratnam BE(Electronics)/BInfTech SUBMITTED AS A REQUIREMENT OF THE DEGREE OF DOCTOR OF PHILOSOPHY AT QUEENSLAND UNIVERSITY OF TECHNOLOGY BRISBANE, QUEENSLAND MARCH 2005 Keywords Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword Verification, Confidence Scoring, Speech Recognition, Utterance Verification i ii Abstract Keyword Spotting is the task of detecting keywords of interest within continuous speech The applications of this technology range from call centre dialogue systems to covert speech surveillance devices Keyword spotting is particularly well suited to data mining tasks such as real-time keyword monitoring and unrestricted vocabulary audio document indexing However, to date, many keyword spotting approaches have suffered from poor detection rates, high false alarm rates, or slow execution times, thus reducing their commercial viability This work investigates the application of keyword spotting to data mining tasks The thesis makes a number of major contributions to the field of keyword spotting The first major contribution is the development of a novel keyword verification method named Cohort Word Verification This method combines high level linguistic information with cohort-based verification techniques to obtain dramatic improvements in verification performance, in particular for the problematic short duration target word class The second major contribution is the development of a novel audio document indexing technique named Dynamic Match Lattice Spotting This technique augments lattice-based audio indexing principles with dynamic sequence matching techniques to provide robustness to erroneous lattice realisations The resulting algorithm obtains significant improvement in detection rate over lattice-based iii audio document indexing while still maintaining extremely fast search speeds The third major contribution is the study of multiple verifier fusion for the task of keyword verification The reported experiments demonstrate that substantial improvements in verification performance can be obtained through the fusion of multiple keyword verifiers The research focuses on combinations of speech background model based verifiers and cohort word verifiers The final major contribution is a comprehensive study of the effects of limited training data for keyword spotting This study is performed with consideration as to how these effects impact the immediate development and deployment of speech technologies for non-English languages iv Contents Keywords i Abstract iii List of Tables xiii List of Figures xvi List of Abbreviations xxi Authorship xxiii Acknowledgments xxv Introduction 1.1 Overview 1.1.1 Aims and Objectives 1.1.2 Research Scope 1.2 Thesis Organisation 1.3 Major Contributions of this Research 1.4 List of Publications A Review of Keyword Spotting 2.1 Introduction v 9 2.2 The keyword spotting problem 10 2.3 Applications of keyword spotting 11 2.3.1 Keyword monitoring applications 11 2.3.2 Audio document indexing 13 2.3.3 Command controlled devices 13 2.3.4 Dialogue systems 14 The development of keyword spotting 15 2.4.1 Sliding window approaches 15 2.4.2 Non-keyword model approaches 16 2.4.3 Hidden Markov Model approaches 17 2.4.4 Further developments 17 Performance Measures 18 2.5.1 The reference and result sets 19 2.5.2 The hit operator 19 2.5.3 Miss rate 20 2.5.4 False alarm rate 21 2.5.5 False acceptance rate 21 2.5.6 Execution time 22 2.5.7 Figure of Merit 22 2.5.8 Equal Error Rate 23 2.5.9 Receiver Operating Characteristic Curves 24 2.5.10 Detection Error Trade-off Plots 25 Unconstrained vocabulary spotting 26 2.6.1 HMM-based approach 26 2.6.2 Neural Network Approaches 28 Approaches to non-keyword modeling 31 2.7.1 Speech background model 31 2.7.2 Phone models 33 2.4 2.5 2.6 2.7 vi 2.7.3 Uniform distribution 34 2.7.4 Online garbage model 34 Constrained vocabulary spotting 36 2.8.1 Language model approaches 36 2.8.2 Event spotting 39 Keyword verification 41 2.9.1 A formal definition 42 2.9.2 Combining keyword spotting and verification 42 2.9.3 The problem of short duration keywords 43 2.9.4 Likelihood ratio based approaches 43 2.9.5 Alternate Information Sources 46 2.10 Audio Document Indexing 47 2.8 2.9 2.10.1 Limitations of the Speech-to-Text Transcription approach 48 2.10.2 Reverse dictionary lookup searches 49 2.10.3 Indexed reverse dictionary lookup searches 51 2.10.4 Lattice based searches 53 HMM-based spotting and verification 57 3.1 Introduction 57 3.2 The confusability circle framework 58 3.3 Analysis of non-keyword models 60 3.3.1 All-speech models 60 3.3.2 SBM methods 61 3.3.3 Phone-set methods 62 3.3.4 Target-word-excluding methods 62 Evaluation of keyword spotting techniques 63 3.4 3.4.1 Experiment setup vii 64 3.4.2 Results 65 3.5 Tuning the phone set non-keyword model 68 3.6 Output score thresholding for SBM spotting 70 Performance across keyword length 72 3.7.1 Evaluation sets 73 3.7.2 Results 73 HMM-based keyword verification 74 3.8.1 Evaluation set 76 3.8.2 Evaluation procedure 77 3.8.3 Results 77 Discriminative background model KV 79 3.9.1 System architecture 79 3.9.2 Results 80 3.10 Summary and Conclusions 82 3.7 3.8 3.9 Cohort word keyword verification 85 4.1 Introduction 85 4.2 Foundational concepts 87 4.2.1 Cohort-based scoring 87 4.2.2 The use of language information 88 4.3 Overview of the cohort word technique 90 4.4 Cohort word set construction 92 4.4.1 The choice of dmin and dmax 92 4.4.2 Cohort word set downsampling 94 4.4.3 Distance function 94 Classification approach 96 4.5.1 96 4.5 2-class classification approach viii 206 Chapter Summary, Conclusions and Future Work • Multiple formulations of the cohort word confidence score were presented and investigated Of these, it was found that the N-class hybrid approach provided the best compromise between error rate and execution speed • The large number of cohort word parameters were rationalised through a detailed analysis of their effects It was found that the main parameters of importance were dmax and the amount of cohort word set downsampling Other parameters provided only minors changes in performance 7.2.2 Future Work • It was demonstrated in chapter that considerable gains in performance could be obtained by using a more robust decision boundary estimate, such as a neural network classifier Future work could examine the application of discriminative decision boundary estimates to the cohort word confidence score as a means of further improving performance • Execution speeds for cohort word verification were not reported in this thesis However, this is an important metric that needs to be considered when applying this method to speed-critical tasks such as audio document indexing Speed improvements may be obtained in a variety of ways, for example, through the use of aggressive cohort word set size downsampling or tighter decoding pruning beamwidths A study of the execution speed of cohort word verification and the investigation of techniques to improving it is a possible avenue for future research 7.3 Dynamic Match Lattice Spotting Chapter presented a novel technique of fast and accurate unrestricted vocabulary audio document indexing This method was evaluated on conversational 7.3 Dynamic Match Lattice Spotting 207 telephone speech and found to provide significant improvements over conventional lattice-based and HMM-based techniques 7.3.1 Conclusions • The chapter presented a novel unrestricted vocabulary audio document indexing method named Dynamic Match Lattice Spotting Through experimentation, it was demonstrated that this method was capable of searching hours of data using only seconds of processing time, while maintaining excellent detection performance The proposed method provided significant improvements in detection rate and execution speed over the baseline conventional lattice based and SBM-based systems • The lack of robustness to erroneous lattice realisations was identified as a weakness in conventional lattice-based techniques Experiments reported in this chapter highlighted that significant gains in miss rate performance could be obtained by incorporating robustness to such errors within the lattice search process • Two methods of improving the speed of DMLS were investigated and implemented It was found that these provided considerable gains in search speed without affecting miss or false alarm rates As a result, a DMLS system was constructed that could search at speeds of 33 hours per minute with good miss and false alarm rates • Individual dynamic match rules were evaluated within the context of the proposed technique It was found that even the simplest of rules provided some tangible gains in performance over the conventional lattice-based technique However, in particular, vowel-substitution and closure/stop substitution rules provided dramatic gains, though at the expense of increased 208 Chapter Summary, Conclusions and Future Work false alarm rates • An analysis of the parameters of DMLS demonstrated that the technique could be easily tuned to obtain low miss rates or low false alarms while maintaining its fast execution speed 7.3.2 Future Work • The dynamic match rules that were proposed and evaluated were derived empirically Although these rules provided good performance, they were unlikely to be optimal Future work could examine the use of probabilistic rules, for example, derived from the phone recogniser confusion matrix • The MED score used in DMLS was a discrete variable and as such thresholding on this value resulted in a discontinuous tuning curve Smoother tuning would be possible if a continuous probabilistic output score could be derived One possible solution is to use a combination of the MED score and the acoustic score of the putative occurrence as estimated from the lattice Fusion of these values may result in a output score that is more useful for continuous tuning 7.4 Non-English Spotting Chapter examined the application of keyword spotting to non-English languages and assessed the impact of limited training data on system performance 7.4.1 Conclusions • It was found that the sensitivity of keyword spotting performance to training database size was considerably less than that previously reported for 7.5 Final Comments 209 speech transcription This finding supports the argument for the development of speech applications that use keyword spotting instead of speech transcription to satisfy the immediate need for non-English speech-enabled applications • Analysis of the experimental results demonstrated that keyword spotters trained on limited amounts of training data could achieve performances well within an order of magnitude of systems trained on very large database sizes • It was demonstrated that monophone based systems were more effective than triphone based systems in terms of miss rate when using limited amounts of training data This was because triphone models suffered from data sparsity issues for small sized training databases 7.5 Final Comments A number of novel contributions to the field of keyword spotting have been generated by this research A considerable amount of this research has been used in the development of data mining applications that are being actively trialled by external bodies It is believed that this demonstrates that the work is not only theoretically sound but practically viable 210 Chapter Summary, Conclusions and Future Work Bibliography [1] J Alvarez-Cercadillo, J Ortega-Garcia, and L A Hernandez-Gomez, “Context modeling using RNN for keyword detection,” in Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1993 [2] I Bazzi and J R Glass, “Modeling out-of-vocabulary words for robust speech recognition,” in Proceedings of the 2000 International Conference on Spoken Language Processing (ICSLP), 2000 [3] S Bengio, “Learning the decision function for speaker verification,” in Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2001 [4] G Bernadis and H Bourlard, “Improving posterior-based confidence measures in hybrid HMM/ANN speech recognition systems,” in Proceedings of the 1998 International Conference on Spoken Language Processing (ICSLP), 1998 [5] H Bourlard, B D’hoore, and J M Boite, “Optimizing recognition and rejection performance in wordspotting systems,” in Proceedings of the 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1994 211 212 Bibliography [6] J S Bridle, “An efficient elastic-template method for detecting given words in running speech,” British Acoustic Society Metting, pp 1–4, 1973 [7] A Canavan and G Zipperlen, “CALLHOME Spanish Speech.” http://www.ldc.upenn.edu, 2005 [8] B Chigier, “Rejection and keyword spotting algorithms for a directory assitance city name recognition application,” in Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2, pp 93–96, 1992 [9] R Cole and Y Muthusamy, “OGI Multilanguage Corpus.” http://www.ldc.upenn.edu, 2005 [10] S Dharanipragada and S Roukos, “A multistage algorithm for spotting new words in speech,” IEEE Transactions on Speech and Audio Processing, vol 10, no 8, pp 542–550, 2002 [11] Google, “The Google Internet Search Engine.” http://www.google.com [12] Q Gou, Y H Yan, Z W Lin, B S Yuan, Q W Zhao, and J Liu, “Keyword spotting in auto-attendant system,” in Proceedings of the 2000 International Conference on Spoken Language Processing (ICSLP), 2000 [13] A L Higgins and R E Wohlford, “Keyword recognition using template concatenation,” in Proceedings of the 1985 International Conference on Audio, Speech and Signal Processing, 1985 [14] D A James and S J Young, “A fast lattice-based approach to vocabulary independent wordspotting,” in Proceedings of the 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1, (Adelaide, Australia), pp 377–380, 1994 Bibliography 213 [15] P Jeanrenaud, K Ng, M Siu, J R Rohlicek, and H Gish, “Phonetic based word spotter: various configurations and applications to event spotting,” in Proceedings of the 1993 European Conference on Speech Communication and Technology (EUROSPEECH), 1993 [16] P Jeanrenaud, M H Siu, J R Rohlicek, M Meteer, and G Gish, “Spotting events in continuous speech,” in Proceedings of the 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1, pp 381–384, 1994 [17] Z Jianlai, L Jian, S Yantao, and Y Tiecheng, “Keyword spotting based on recurrent neural network,” in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1998 [18] A Kenji, T Kazushige, O Kazunari, O Sumio, and F Hiroya, “A new method for dialogue management in an intelligent system for information retrieval,” in Proceedings of the 2000 International Conference on Spoken Language Processing (ICSLP), 2000 [19] P Kingsbury, S Strassel, C McLemore, and R MacIntyre, “CALLHOME American English Lexicon (PRONLEX).” http://www.ldc.upenn.edu, 2005 [20] L K Leung and P Fung, “A more efficient and optimal LLR for decoding and verification,” in Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1999 [21] R P Lippmann and E Singer, “Hybrid neural-network/HMM approaches to wordspotting,” in Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), (Sydney, Australia), 1993 214 Bibliography [22] J Liu and X Zhu, “Utterance verification based on dynamic garbage evaluation approach,” in Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2000 [23] M Mohri, F Pereira, and M Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech and Language, vol 16, no 1, pp 69– 88(20), 2002 [24] R Moore, “A comparison of the data requirements of automatic speech recognition systems and human listeners,” in Proceedings of the 2003 European Conference on Speech Communication and Technology (EUROSPEECH), 2003 [25] J Ou, K Chen, X Wang, and Z Li, “Utterance verification of short keywords using hybrid neural-network/HMM approach,” in Proceedings of the 2001 IEEE International Conference on Info-tech and Info-net (ICII), (Beijing, China), 2001 [26] D Reynolds and et al., “The SuperSID Project: Exploiting High-level Information for High-accuracy,” in Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003 [27] J R Rohlicek, P Jeanrenaud, K Ng, H Gish, B Musicus, and M Siu, “Phonetic training and language modeling for word spotting,” in Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2, pp 459–462, 1993 [28] R C Rose and D B Paul, “A Hidden Markov Model based keyword recognition system,” in Proceedings of the 1990 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp 129–132, 1990 Bibliography 215 [29] H Sakoe and S Chiba, “A dynamic programming approach to continuous speech recognition,” in Proceedings of Seventh International Congress on Acoustics, 1971 [30] A Sethy and S Narayanan, “A syllable based approach for improved recognition of spoken names,” in Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology, 2002 [31] M Silaghi and H Bourlard, “A new keyword spotting approach based on iterative dynamic programming,” in Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2000 [32] R A Sukkar and C H Lee, “Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition,” in Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1996 [33] E Trentin, Y Bengio, C Furlanello, and R D Mori, Spoken Dialogues with Computers, ch Neural Networks for Speech Recognition, pp 343–347 Academic Press, 1998 [34] B Walker, B C Lackey, J S Muller, and P J Schone, “LanguageReconfigurable Universal Phone Recognition,” in Proceedings of the 2003 European Conference on Speech Communication and Technology (EUROSPEECH), 2003 [35] J G Wilpon, L R Rabiner, C H Lee, and E R Goldman, “Automatic recognition of keywords in unconstrained speech using Hidden Markov Models,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol 38, pp 1870–1878, 1990 216 Bibliography [36] C H Wu, Y J Chen, and G L Yan, “Integration of phonetic and prosodic information for robust utterance verification,” Vision, Image and Signal Processing, vol 147, pp 55–61, 2000 [37] L Xin and B Wang, “Utterance verification for spontaneous Mandarin speech keyword spotting,” in Proceedings of the 2001 IEEE International Conference on Info-tech and Info-net (ICII), 2001 [38] C Yining, L Jing, Z Lin, L Jia, and L Runsheng, “Keyword spotting based on mixed grammar model,” in Proceedings of Intelligent Multimedia, Video and Speech Processing 2001, 2001 [39] S J Young and M G Brown, “Acoustic indexing for multimedia retrieval and browsing,” in Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1997 [40] T Zeppenfeld, “A hybrid neural network, dynamic programming word spotter,” in Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1992 [41] V Zue, J Glass, M Phillips, and S Seneff, “The MIT SUMMIT Speech Recognition System: A Progress Report,” in Proceedings of the First DARPA Speech and Natural Language Workshop, pp 178–189, 1989 Appendix A The Levenstein Distance A.1 Introduction The Levenstein distance measures the minimum cost of transforming one string to another Transformation is performed by successive applications of one of four transformations: matching, substitution, insertion and deletion Typically, each transformation has an associated cost, and hence implicitly the Levenstein algorithm must discover which sequence of transformations results in the cheapest total transformation cost A.2 Applications Applications of the Levenstein Distance, also known as the Minimum Edit Distance (MED), span a plethora of fields In biology, this algorithm is used to identify similar sequences of nucleic acids in DNA or amino acids in proteins Web search engines have used this method for detecting similarity in phrases and query terms More obtuse is the use of the edit distance to discover similarities in documents for the purpose of detecting plagiarism 217 218 Appendix A The Levenstein Distance In speech research, the Levenstein distance is particularly useful in the analysis of phonetic and word sequences For example, the word error rate of a speech transcription systems can be calculated using this method The phonetic similarity between two pronunciations can also be measured using the Levenstein distance, for example, for the purpose of finding similarly pronounced words A.3 Algorithm A basic implementation of the Levenstein algorithm uses a cost matrix to accumulate transformation costs A recursive process is used to update successive elements of this matrix in order to discover the overall minimum transformation cost Let the sequence P = (p1 , p2 , , pM ) be defined as the source sequence and the sequence Q = (q1 , q2 , , qN ) be defined as the target sequence Additionally, three transformation cost functions are defined • Cs (x, y) - This represents the cost of transforming symbol x in P to symbol y in Q Typically this has a cost of if x = y ie a match operation • Ci (y) - The cost of inserting the symbol y into sequence P • Cd (x) - The cost of deleting the symbol x from sequence P The element at row i and column j in the cost matrix represents the minimum cost of transforming the subsequence (pk )i1 to (qk )j1 Hence the bottom-right element of the cost matrix represents the total minimum cost of transforming the entire source sequence P to the target sequence Q The basic premise of the Levenstein algorithm is that the minimum cost of transforming the sequence (pk )i1 to (qk )j1 is either: The cost of transforming (pk )i1 to (qk )j−1 plus the cost of inserting qj 219 A.3 Algorithm The cost of transforming (pk )1i−1 to (qk )j1 plus the cost of deleting pi The cost of transforming (pk )1i−1 to (qk )j−1 plus the cost of substituting pi with qj If pi = qj then this is usually taken to have a cost of In this way, the cost matrix can be filled from the top-left corner to the bottomright corner in an iterative fashion The Levenstein algorithm is then as follows: Initialise a (M + 1) × (N + 1) matrix Ω This is called the Levenstein cost matrix The top left element Ω0,0 represents the cost of transforming the empty sequence to the empty sequence This is therefore initialised to The first row of the cost matrix represents the sequence of successive insertions Hence it can be initialised to be Ω0,j = j ∗ Ci (qj ) (A.1) The first column of the cost matrix represents successive deletions It therefore can also be immediately initialised to be Ωi,0 = i ∗ Cd (pi ) (A.2) Update elements of the cost matrix from the top-left down to the bottomright using the Levenstein update equation  Ωi,j    = M in     Ωi,j−1 + Ci (qj ), Ωi−1,j + Cd (pi ), Ωi−1,j−1 + Cs (pi , qj )         (A.3) 220 Appendix A The Levenstein Distance Figure A.1 shows an example of the cost matrix obtained using the MED method for transforming the word deranged to the word hanged using constant transformation functions It shows that the cheapest transformation cost is There are multiple means of obtaining this minimum cost For example, both the operation sequences (del, del, subst, match, match, match, match, match) and (subst, del, del, match, match, match, match, match) have costs of / h a n g e d / d 1 5 e 2 4 r 3 3 5 a 4 4 n 5 4 g 6 4 e 7 4 d 8 Figure A.1: Example of cost matrix calculated using Levenstein algorithm for transforming deranged to hanged Cost of substitutions, deletions and insertions all fixed at 1, cost of match fixed at ...Keywords Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword Verification, Confidence Scoring, Speech Recognition, Utterance Verification i ii Abstract Keyword Spotting is... specifically examines the application of keyword spotting technologies to two data mining tasks: real-time keyword monitoring and large audio document database indexing With the ever-increasing amounts... be used to improve the performance of keyword spotting techniques for data mining applications To investigate the application of keyword spotting technologies for nonEnglish data mining 1.1.2

Ngày đăng: 07/08/2017, 11:32

Xem thêm: Acoustic keyword spotting in speech with applications to data mining , Acoustic keyword spotting in speech with applications to data mining

Acoustic keyword spotting in speech with applications to data mining

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan