Multimodal music information retrieval from content analysis to multimodal fusion

MULTIMODAL MUSIC INFORMATION RETRIEVAL: FROM CONTENT ANALYSIS TO MULTIMODAL FUSION ZHONGHUA LI Under Supervision of Prof YE WANG A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE, SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE 2013 Acknowledgments During my stay in the Sound and Music Computing (SMC) group, I had the fortune to experience an atmosphere of motivation, support, and encouragement that was crucial for progress in my research activities as well as my personal growth First and foremost, I would like to express my sincere gratitude to my supervisor, Prof Ye Wang, who has led me to the research world and continued to guide me at every step of my PhD journey His great passion for research and life, deep knowledge, and endless patience have been my strongest support, boosting my confidence not only in research but also in life I also would like to thank all who directly or indirectly involved in my research projects I thank Prof Ichiro Fujinaga, Prof Jialie Shen, Prof Hsin-Min Wang, Jason Hockman, Qiaoliang Xiang, Jianqing Yang, Yu Yi, Bingjun Zhang, Yi Yu, JuChiang Wang, Jingli Cai, and Zhiyan Duan, for their collaborations and feedback Also specially, I wish to thank my lab mates in the SMC group, Zhendong Zhao, Yinsheng Zhou, Graham Percival, Xinxi Wang, Lian He, Shenggao Zhu, Haotian Fang, Zhe Xing, Shant Sagar, and many others, who worked alongside me, debated with me, laughed with me, and made my PhD years unforgettably colorful My gratitude also goes to Mukesh Kumar Saini, Xiangyu Wang, and all the other friends for their helpful discussions and suggestions I also want to thank the School of Computing, National University of Singapore, for giving me the opportunity to study here and also providing me with financial support Finally, I would like to express my deepest appreciation for my family, who have always supported and encouraged me in my study and daily life iii Abstract With the explosive growth of online music data over the past decade, music information retrieval has become increasingly important to help users find their desired music information Under different application scenarios, users generally need to search for music in various ways with different information needs Moreover, music is inherently multi-faceted and contains heterogeneous types of music data (e.g., metadata, audio content) For effective multimodal music retrieval, therefore, it is essential to discover users’ information needs and to appropriately combine the multiple facets Most existing music search engines are intended for general search using textual metadata or example tracks Thus, they fail to address the needs of many specific domains, where the required music dimensions or query methods may differ from those covered by general search engines Content analysis on these music dimensions (e.g., ethnic styles, audio quality) are also not well addressed In addition, fusion methods of multiple music dimensions and modalities also tend to associate the fusion weight to only queries and cannot achieve the optimal fusion strategy My research studies and improves multimodal music retrieval system from several aspects First, I study multimodal music retrieval in a specific domain where queries are restricted to certain music dimensions (e.g., tempo) Novel query input methods are proposed to capture users’ information needs Then effective audio content analysis is performed to improve the unimodal music retrieval performance Audio quality, an important but overlooked music dimension for online music search is also studied Given that multiple music dimensions in different modalities are related to a given query, effective fusion methods to combine different modalities are also investigated For the first time, document dependence is introduced into fusion weight derivation, and its efficacy is also verified A general multimodal fusion framework, query-document-dependent fusion, is then proposed to extend existing works by deriving the optimal fusion strategy for each query-document pair This iv enables each document to combine its modalities in the optimal way and unleashes the power of different modalities in the retrieval process Besides using existing datasets, several datasets are also constructed for the related research Comprehensive experiments and user studies have been carried out and have validated the efficacy of both the proposed approaches and systems List of Publications Zhonghua Li, Bingjun Zhang, Yi Yu, Jialie Shen, Ye Wang “Query-Document-Dependent Fusion: A Case Study of Multimodal Music Retrieval” IEEE Transactions on Multimedia (to appear), 2014 Zhonghua Li, Ju-Chiang Wang, Jingli Cai, Zhiyan Duan, Hsin-Min Wang, Ye Wang “Non-Reference Audio Quality Assessment for Online Live Music Recordings” ACM Multimedia, October 21 - 25, 2013, Barcelona, Spain Zhonghua Li, Ye Wang “A Domain-Specific Music Search Engine for Gait Training”, ACM Multimedia, October 29 - November 2, 2012, Nara, Japan Zhonghua Li, Bingjun Zhang, Ye Wang “Document Dependent Fusion in Multimodal Music Retrieval” ACM Multimedia, November 28 - December 1, 2011, Scottsdale, Arizona, USA Zhonghua Li, Qiaoliang Xiang, Jason Hockman, Jianqing Yang, Yu Yi, Ichiro Fujinaga, Ye Wang “A Music Search Engine for Therapeutic Gait Training” ACM Multimedia, October 25 - 29, 2010, Firenze, Italy Zhendong Zhao, Xinxi Wang, Qiaoliang Xiang, Andy Sarroff, Zhonghua Li, Ye Wang “Large-scale Music Tag Recommendation with Explicit Multiple Attributes” ACM Multimedia, October 25 - 29, 2010, Firenze, Italy Yinsheng Zhou, Zhonghua Li, Dillion Tan, Graham Percival, Ye Wang “MOGFUN: Musical mObile Group for FUN”, ACM Multimedia, October 19 - 24, 2009, Beijing, China List of Figures 3.1 System architecture 31 3.2 Histogram of ground truth tempo values 33 3.3 Tempo estimation errors made by different tempo estimation algorithms 3.4 Tempi distributions of songs annotated as slow or fast by a user randomly selected from the subjects 3.5 36 37 Accuracy comparison between original tempo estimation algorithms (AC1o and AC2o ) and user perception-based methods (AC1u and AC2u ) 3.6 45 Estimated beat strength distributions within three human-labeled levels: Weak, Medium and Strong X-axis represents the estimated beat strength 48 3.7 System snapshot 49 3.8 Average precision for different query types Single, double, and triple queries represent queries with one, two, and three required music dimensions, respectively All the queries include all three types of queries 51 4.1 The system framework 56 4.2 The flowchart for collecting the live music videos 59 4.3 A snapshot of the song assignment for human annotation 60 4.4 A snapshot of the interface for annotating the audio and visual quality 61 4.5 The designed interface for labeling the ranking The index of each live recording is dragged and dropped into a desired ranking position 61 4.6 Ranking performance based on overall audio quality using the binary and ranking labels on ADB-H 69 viii 4.7 List of Figures Ranking performance based on overall audio quality using the binary and ranking labels of ADB-S 4.8 70 Ranking performance on ADB-H using SVM-Rank with all types of labels and different audio feature sets Except the overall quality aspect, the others are all numerical labels 4.9 71 Ranking performance of SVM-Rank on ADB-S using different audio feature sets 71 5.1 A retrieval and fusion example 78 5.2 Transitions between different multimodal fusion approaches 82 5.3 Dual-phase fusion weight learning diagram 85 5.4 Regression-based fusion weight learning diagram 88 5.5 Retrieval accuracy (MAP) comparison of D-QDDF-LNR approaches (trained with the dataset of 200K) using different β 5.6 97 Retrieval accuracy (MAP) comparison between D-QDDF and QDF approaches (a) ∼ (d) show that by integrating document dependence how much D-QDDF approaches improve upon the four QDF approaches (e) and (f) compare the performance of D-QDDF-MUL and D-QDDF-LNR approaches based on different QDF approaches 5.7 98 Retrieval accuracy (MAP) of R-QDDF (trained with the dataset of 200K) using different regularization factors (λ) and numbers of chosen samples per iteration (m ) 100 5.8 Comparison of retrieval accuracy (MAP) given different query complexities when R-QDDF is trained with datasets of different sizes 101 Bibliography 121 [Fraisse 1982] Paul Fraisse Rhythm and tempo The psychology of music, pages 149–180, 1982 (Cited on page 33.) [Freund 1997] Yoav Freund and Robert E Schapire A decision-theoretic generalization of online learning and an application to boosting Journal of Computer and System Sciences, vol 55, no 1, pages 119–139, 1997 (Cited on page 66.) [Friedman 1999] Jerome H Friedman Greedy function approximation: A gradient boosting machine, 1999 (Cited on page 66.) [Gouyon 2005] Fabien Gouyon and Simon Dixon A review of automatic rhythm description systems Computer Music Journal, vol 29, pages 34–54, 2005 (Cited on pages 18 and 31.) [Gouyon 2006] Fabien Gouyon, Anssi Klapuri, Simon Dixon, Miguel Alonso, George Tzanetakis, Christian Uhle and Pedro Cano An experimental comparison of audio tempo induction algorithms IEEE Transactions on Audio, Speech, and Language Processing, vol 14, no 5, pages 1832–1844, 2006 (Cited on pages 18, 34, 35, 36 and 41.) [Guironnet 2005] Mickael Guironnet, Denis Pellerin and Michele Rombaut Video classification based on low-level feature fusion model In Proceedings of European Signal Processing Conference, 2005 (Cited on page 26.) [Hazard 2008] Sergio Hazard Music therapy in Parkinsonâs disease Voices: A World Forum for Music Therapy, vol 8, no 3, 2008 (Cited on page 29.) [Hemami 2010] Sheila S Hemami and Amy R Reibman No-reference image and video quality estimation: Applications and human-motivated design Image Communication, vol 25, no 7, pages 469–481, 2010 (Cited on pages 22 and 54.) [Hockman 2010] Jason Hockman and Ichiro Fujinaga Fast vs slow: Learning tempo octaves from user data In Proceedings of International Society for Music 122 Bibliography Information Retrieval Conference (ISMIR), pages 231–236, 2010 (Cited on pages 38 and 44.) [Hu 2010] Xiao Hu and J Stephen Downie Improving mood classification in music digital libraries by combining lyrics and audio In Proceedings of Joint Conference on Digital Libraries (JCDL), pages 159–168, 2010 (Cited on pages 24 and 25.) [Huber 2006] Rainer Huber and Birger Kollmeier PEMO-Q – A new method for objective audio quality assessment using a model of auditory perception IEEE Transactions Audio, Speach, and Language Processing, vol 14, no 6, pages 1902–1911, 2006 (Cited on page 22.) [Int 1997] International Telecommunications Union Recommendation (ITU-R) BS.1116-1 Methods for the subjective assessment of small impairments in audio system including multichannel sound systems, 1997 (Cited on page 20.) [Int 1998] International Telecommunications Union Recommendation (ITU-R) BS.1387 Method for objective measurements of perceived audio quality, 1998 (Cited on pages 21 and 54.) [Int 2001] International Telecommunications Union Recommendation (ITU-R) P.862 Peceptual evaluation of speech quality (PESQ): An objective method for end-to-end speach quality assessment of narrow-band telephone networks and speech codecs, 2001 (Cited on page 21.) [Int 2003a] International Telecommunications Union Recommendation (ITU-R) BS.1284-1 General methods for the subjective assessment of sound quality, 1997–2003 (Cited on page 21.) [Int 2003b] International Telecommunications Union Recommendation (ITU-R) BS.1534-1 Methods for the subjective assessment of intermediate quality level of coding systems, 2003 (Cited on page 20.) Bibliography 123 [Jarvelin 2002] Kalervo Jarvelin and Jaana Kekalainen Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems, vol 20, no 4, pages 422–446, 2002 (Cited on page 68.) [Joachims 2002] Thorsten Joachims Optimizing search engines using clickthrough data In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pages 133–142, 2002 (Cited on page 66.) [Joachims 2006] Thorsten Joachims Training linear SVMs in linear time In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pages 217–226, 2006 (Cited on page 66.) [Kang 2003] In-Ho Kang and GilChang Kim Query type classification for web document retrieval In Proceedings of ACM Special Interest Group on Information Retrieval (SIGIR), pages 64–71, 2003 (Cited on pages 4, 27 and 77.) [Karjalainen 1985] Matti Karjalainen A new auditory model for the evaluation of sound quality of audio systems In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 10, pages 608–611, 1985 (Cited on page 21.) [Kennedy 2005] Lyndon S Kennedy, Apostol P Natsev and Shih-Fu Chang Automatic discovery of query-class-dependent models for multimodal search In Proceedings of ACM international conference on Multimedia, pages 882–891, 2005 (Cited on pages 27, 82 and 94.) [Kennedy 2008] Lyndon Kennedy, Shih-Fu Chang and Apostol Natsev Queryadaptive fusion for multimodal search Proceedings of the IEEE, vol 96, no 4, pages 567–588, 2008 (Cited on page 27.) [Kennedy 2009] Lyndon Kennedy and Mor Naaman Less talk, more rock: Automated organization of community-contributed collections of concert videos In 124 Bibliography Proceedings of International Conference on World Wide Web, pages 311–320, 2009 (Cited on pages 22 and 54.) [Kittler 1998] Josef Kittler, Mohamad Hatef, Robert P.W Duin and Jiri Matas On combining classifiers vol 20, no 3, pages 226–239, 1998 (Cited on page 25.) [Klapuri 1999] Anssi Klapuri Sound onset detection by applying psychoacoustic knowledge In Proceedings of International Conference on Acoustic Speech Signal Processing, pages 3089–3092, 1999 (Cited on page 40.) [Klapuri 2006] Anssi Klapuri, Antti Eronen and Jaakko Astola Analysis of the meter of acoustic musical signals IEEE Transactions on Audio, Speech, and Language Processing, vol 14, no 1, pages 342–355, 2006 (Cited on pages 18 and 34.) [Lartillot 2007a] Olivier Lartillot and Petri Toiviainen A Matlab toolbox for musical feature extraction from audio In Proceedings of International Conference on Digital Audio Effects, 2007 (Cited on page 63.) [Lartillot 2007b] Olivier Lartillot and Petri Toiviainen MIR in Matlab (II): A toolbox for musical feature extraction from audio In Proceedings of International Conference on Music Information Retrieval, pages 127–130, 2007 (Cited on pages 20 and 40.) [Lartillot 2008] Olivier Lartillot, Tuomas Eerola, Petri Toiviainen and Jose Fornari Multi-feature modeling of pulse clarity: Design, validation, and optimization In Proceedings of International Symposium on Music Information Retrieval, pages 521–526, 2008 (Cited on pages 20, 40 and 47.) [Laurier 2008] Cyril Laurier, Jens Grivolla and Perfecto Herrera Multimodal music mood classification using audio and lyrics In Proceedings of International Conference on Machine Learning and Applications (ICMLA), pages 688–693, 2008 (Cited on pages 24 and 25.) Bibliography 125 [Levy 2009] Mark Levy and Mark Sandler Music information retrieval using social tags and audio IEEE Transactions on Multimedia, vol 11, no 3, pages 383–395, 2009 (Cited on page 24.) [Li 2003] Tao Li, Mitsunori Ogihara and Qi Li A comparative study on contentbased music genre classification In Proceedings of International Conference on Research and Development in Informaion Retrieval, pages 282–289, 2003 (Cited on page 19.) [Li 2010] Zhonghua Li, Qiaoliang Xiang, Jason Hockman, Jianqing Yang, Yu Yi, Ichiro Fujinaga and Ye Wang A music search engine for therapeutic gait training In Proceedings of ACM international conference on Multimedia, pages 627–630, 2010 (Cited on page 8.) [Li 2011] Zhonghua Li, Bingjun Zhang and Ye Wang Document dependent fusion in multimodal music retrieval In Proceedings of ACM international conference on Multimedia, pages 1105–1108, 2011 (Cited on pages 9, 82, 87, 94 and 96.) [Li 2012] Zhonghua Li and Ye Wang A domain-specific music search engine for gait training In Proceedings of ACM international conference on Multimedia, pages 1311–1312, 2012 (Cited on page 8.) [Li 2013] Zhonghua Li, Ju-Chiang Wang, Jingli Cai, Zhiyan Duan, Hsin-Min Wang and Ye Wang Non-reference audio quality assessment for online live music recordings In Proceedings of ACM international conference on Multimedia, pages 63–72, 2013 (Cited on page 8.) [Li 2014] Zhonghua Li, Bingjun Zhang, Yi Yu, Jialie Shen and Ye Wang Querydocument-dependent fusion: A case study in multimodal music retrieval IEEE Transactions on Multimedia (to appear), 2014 (Cited on page 9.) [Lim 2005] I Lim, E Van Wegen, C De Goede, M Deutekom, A Nieuwboer, A Willems, D Jones, L Rochester and G Kwakkel Effects of external 126 Bibliography rhythmical cueing on gait in patients with Parkinson’s disease: A systematic review Clinical Rehabilitation, vol 19, pages 695–713, 2005 (Cited on page 29.) [Liu 2009a] Tie-Yan Liu Learning to rank for information retrieval Foundations and Trends in Information Retrieval, vol 3, no 3, pages 225–331, 2009 (Cited on page 65.) [Liu 2009b] Yuxiang Liu, Qiaoliang Xiang, Ye Wang and Lianhong Cai Cultural style based music classification of audio signals In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, pages 57–60, Los Alamitos, CA, USA, 2009 IEEE Computer Society (Cited on pages 19 and 39.) [Malfait 2006] Ludovic Malfait, Jens Berger and Martin Kastner P.563-8212, The ITU-T standard for single-ended speech quality assessment IEEE Transactions on Audio, Speech, and Language Processing, vol 14, no 6, pages 1924–1934, 2006 (Cited on page 22.) [Manning 2008] Christopher D Manning, Prabhakar Raghavan and Hinrich SchÃ tze Introduction to information retrieval Cambridge University Press, 2008 (Cited on page 89.) [Masahiro 2008] Niitsuma Masahiro, Hiroshi Takaesu, Hazuki Demachi, Masaki Oono and Hiroaki Saito Development of an automatic music selection system based on runner’s step frequency In Proceedings of International Society for Music Information Retrieval Conference, pages 193–199, 2008 (Cited on page 16.) [Mayer 2008] Rudolf Mayer, Robert Neumayer and Andreas Rauber Combination of audio and lyrics features for genre classification in digital audio collections Bibliography 127 In Proceedings of ACM international conference on Multimedia, pages 159– 168, 2008 (Cited on page 24.) [Mayer 2011] Rudolf Mayer and Andreas Rauber Music genre classification by ensembles of audio and lyrics features In Proceedings of International Society on music information retrieval (ISMIR), pages 675–680, 2011 (Cited on pages 24, 27 and 82.) [McFee 2009] Brian McFee and Gert Lanckriet Heterogeneous embedding for subjective artist similarity In Proceedings of International society on music information retrieval (ISMIR), pages 513–518, 2009 (Cited on page 26.) [McIntosh 1997] Gerald C McIntosh, Susan H Brown, Ruth R Rice and Michael H Thaut Rhythmic auditory-motor facilitation of gait patterns in patients with Parkinson’s disease Journal of Neurology, Neurosurgery, and Psychiatry, vol 62, no 1, pages 22–26, 1997 (Cited on page 29.) [McKay 2009] Cory McKay and Ichiro Fujinaga jMIR: Tools for automatic music classification In Proceedings of International Society for Music Information Retrieval Conference, pages 65–68, 2009 (Cited on pages 38 and 44.) [McKinney 2006] Martin F McKinney and Dirk Moelants Ambiguity in tempo perception: What draws listeners to different metrical levels? Music Perception, vol 24, no 2, pages 155–166, 2006 (Cited on page 33.) [Moelants 2002] Dirk Moelants Preferred tempo reconsidered In Proceedings of International Conference on Music Perception and Cognition, pages 580– 583, 2002 (Cited on page 33.) [Neumayer 2007] Robert Neumayer and Andreas Rauber Integration of text and audio features for genre classification in music information retrieval In Proceedings of European Conference on Information Retrieval (ECIR), pages 724–727, 2007 (Cited on page 24.) 128 Bibliography [Ngo 2007] Chong-Wah Ngo, Yu-Guang Jiang, Xiaoyong Wei, Feng Wang, Wanlei Zhao, Hung-Khoon Tan and Xiao Wu Experimenting VIREO-374: Bagsof-visual-words and visual-based ontology for semantic video indexing and search In NIST TRECVID Workshop, 2007 (Cited on pages 4, 26, 77, 81 and 82.) [Ni 2004] Jianjun Ni, Xiaoping Ma, Lizhong Xu and Jianying Wang An image recognition method based on multiple bp neural networks fusion In Proceedings of International Conference on Information Acquisition, pages 323–326, 2004 (Cited on page 25.) [Norowi 2005] Noris Mohd Norowi, Shyamala Doraisamy and Rahmita Wirza Factors affecting automatic genre classification: An investigation incorporating non-western musical forms In Proceedings of International Society on Music Information Retrieval (ISMIR), pages 13–20, 2005 (Cited on page 19.) [Olivares 2008] Ximena Olivares, Massimiliano Ciaramita and Roelof van Zwol Boosting image retrieval through aggregating search results based on visual annotations In Proceedings of ACM international conference on Multimedia, pages 189–198, 2008 (Cited on pages 4, 24, 25 and 77.) [Oliver 2006a] Nuria Oliver PAPA: Physiology and purpose-aware automatic playlist generation In Proceedings of International Society on music information retrieval (ISMIR), pages 250–253, 2006 (Cited on page 16.) [Oliver 2006b] Nuria Oliver and Lucas Kreger-Stickles Enhancing exercise performance through real-time physiological monitoring and music: A user study In Proceedings of Pervasive Health Conference and Workshops, pages 1–10, 2006 (Cited on page 16.) [Oliver 2006c] Nuria Oliver and Fernando F Mangas MPTrain: A mobile, music and physiology-based personal trainer In Proceedings of International con- Bibliography 129 ference on Human-computer interaction with mobile devices and services, pages 21–28, 2006 (Cited on page 16.) [Orio 2011] Nicola Orio, David Rizo, Riccardo Miotto, Markus Schedl, Nicola Montecchio and Olivier Lartillot MusiCLEF: A benchmark activity in multimodal music information retrieval In Proceedings of International Society on Music Information Retrieval (ISMIR), pages 603–608, 2011 (Cited on page 77.) [Pachet 2005] Francois Pachet Knowledge management and musical metadata In Idea Group, 2005 (Cited on page 12.) [Pandey 2003] Gurav Pandey, Gaurav P, Chaitanya Mishra and Paul Ipe Tansen: A system for automatic raga identification In Proceedings of the 1st Indian International Conference on Artificial Intelligence, pages 1350–1363, 2003 (Cited on page 19.) [Perron 1994] Marius Perron Checking tempo stability of MIDI sequencers In Proceedings of the 97th Audio Engineering Society Convention, 1994 (Cited on page 40.) [Porter 1980] M F Porter An algorithm for suffix stripping Program, vol 14, no 3, pages 130–137, 1980 (Cited on page 92.) [Rix 2006] Antony W Rix, John G Beerends, Doh-Suk Kim, Peter Kroon and Oded Ghitza Objective assessment of speech and audio quality – Technology and applications IEEE Transactions on Audio, Speech, and Language Processing, vol 14, no 6, pages 1890–1901, 2006 (Cited on page 22.) [Robertson 1995] S E Robertson, S Walker, M M Beaulieu, M Gatford and A Payne Okapi at TREC-4 In NIST Special Publication 500-236:TREC-4, pages 73–96, 1995 (Cited on page 92.) [Saini 2012] Mukesh Kumar Saini, Raghudeep Gadde, Shuicheng Yan and Wei Tsang Ooi MoViMash: Online mobile video mashup In Proceedings of 130 Bibliography ACM International Conference on Multimedia, pages 139–148, 2012 (Cited on pages 23 and 56.) [Salton 1975] Gerard M Salton, Andrew Wong and ChungShu Yang A vector space model for automatic indexing Communications of the ACM, vol 18, no 11, pages 613–620, 1975 (Cited on page 92.) [Scheirer 1999] Eric D Scheirer Tempo and beat analysis of acoustic musical signals Journal of the Acoustical Society of America, vol 103, no 1, pages 588–601, 1999 (Cited on page 18.) [Scholkopf 2001] Bernhard Scholkopf and Alexander J Smola Learning with kernels: Support vector machines, regularization, optimization, and beyond MIT Press, Cambridge, MA, USA, 2001 (Cited on page 89.) [Shalev-Shwartz 2007] Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro Pegasos: Primal estimated sub-gradient solver for SVM In Proceedings of International conference on machine learning (ICML), pages 807–814, 2007 (Cited on page 107.) [Shalev-Shwartz 2008] Shai Shalev-Shwartz and Nathan Srebro SVM optimization: Inverse dependence on training set size In Proceedings of International conference on machine learning (ICML), pages 928–935, 2008 (Cited on page 107.) [Shaw 1994] Joseph A Shaw and Edward A Fox Combination of multiple searches In TREC-2, 1994 (Cited on pages 4, 24, 26, 77, 79, 81 and 82.) [Shen 2006] Jialie Shen, John Shepherd and Anne H H Ngu Towards effective content-based music retrieval with multiple acoustic feature combination IEEE Transactions on Multimedia, vol 8, no 6, pages 1179–1189, 2006 (Cited on page 77.) Bibliography 131 [Smeaton 2003] Alan Smeaton and Paul Over TRECVID: Benchmarking the effectiveness of information retrieval tasks on digital video In Proceedings of International conference on image and video retrieval (CIVR), pages 451– 456, 2003 (Cited on page 27.) [Snoek 2005] Cees G.M Snoek, Marcel Worring and Arnold W.M Smeulders Early versus late fusion in semantic video analysis In Proceedings of ACM international conference on Multimedia, pages 399–402, 2005 (Cited on pages 23 and 24.) [Sporer 1996] T Sporer, U Gbur, J Herre and R Kapust Evaluating a measurement system In Audio Engineering Society Convention 95, page 3704, 1996 (Cited on page 21.) [Sporer 1997] Thomas Sporer Objective audio signal evaluation–Applied psychoacoustics for modeling the perceived quality of digital audio In Audio Engineering Society Convention 103, page 4512, 1997 (Cited on page 21.) [Thaut 1996] M H Thaut, G C Mcintosh, R R Rice, R A Miller, J Rathbun and J M Brault Rhythmic auditory stimulation in gait training for Parkinson’s disease patients Movement Disorders, vol 11, no 2, pages 193–200, 1996 (Cited on page 29.) [Thiede 1999] Thilo Thiede Perceptual audio quality assessment using a non-linear filter bank In PhD thesis, Fachbereich Electrotechnik, Technical University of Berlin, 1999 (Cited on page 21.) [Thiede 2000] Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends and Catherine Colomes PEAQ The ITU standard for objective measurement of perceived audio quality Journal of Audio Engineering Society, vol 48, no 1/2, pages 3–29, 2000 (Cited on pages 21, 22 and 54.) 132 Bibliography [Treurniet 2000] William C Treurniet and Gilbert A Soulodre Evaluation of the ITU-R objective audio quality measurement method Journal of Audio Engineering Society, vol 48, no 3, pages 164–173, 2000 (Cited on pages 21 and 54.) [Turnbull 2008] Douglas Turnbull, Luke Barrington and Gert Lanckriet Five approaches to collecting tags for music In Proceedings of International Conference on Music Information Retrieval, pages 225–230, 2008 (Cited on page 13.) [Tzanetakis 2000] George Tzanetakis and Perry Cook Marsyas: A framework for audio analysis Organized Sound, vol 4, no 3, pages 169–175, 2000 (Cited on page 93.) [Tzanetakis 2002a] George Tzanetakis and Perry Cook Musical genre classification of audio signals IEEE Transactions on Speech and Audio Processing, vol 10, no 5, pages 293–302, 2002 (Cited on pages 19 and 20.) [Tzanetakis 2002b] George Tzanetakis, Georg Essl and Perry Cook Human perception and computer extraction of musical beat strength In Proceedings of International Conference on Digital Audio Effects, pages 257–261, 2002 (Cited on page 20.) [Voorhees 1995] Ellen M Voorhees, Narendra K Gupta and Ben Johnson-Laird Learning collection fusion strategies In Proceedings of ACM Special Interest Group on Information Retrieval (SIGIR), pages 172–179, 1995 (Cited on pages 27 and 82.) [Voorhees 1999] Ellen M Voorhees The TREC-8 question answering track report In Proceedings of Text Retrieval Conference, pages 77–82, 1999 (Cited on page 73.) Bibliography 133 [Wang 2011] Ju-Chiang Wang, Hung-Shin Lee, Hsin-Min Wang and Shyh-Kang Jeng Learning the similarity of audio music in bag-of-frames representation from tagged music data In Proceedings of International Society for Music Information Retrieval Conference, pages 85–90, 2011 (Cited on pages 56 and 63.) [Wang 2012] Xinxi Wang, David Rosenblum and Ye Wang Context-aware mobile music recommendation for daily activities In Proceedings of ACM international conference on Multimedia, pages 99–108, 2012 (Cited on page 16.) [Whiteman 2002] Brian Whiteman and Ryan Rifkin Musical query-by-description as a multiclass learning problem In Proceedings of IEEE workshop on Multimedia Signal Processing Conference (MMSP), pages 153–156, 2002 (Cited on page 13.) [Wu 2002] Shengli Wu and Fabio Crestani Data fusion with estimated weights In Proceedings of International Conference on Information and Knowledge Management (CIKM), pages 648–651, 2002 (Cited on pages 24, 79 and 82.) [Wu 2004a] Yi Wu, Edward Y Chang, Kevin C Chang and John R Smith Optimal multimodal fusion for multimedia data analysis In Proceedings of ACM international conference on Multimedia, pages 572–579, 2004 (Cited on page 26.) [Wu 2004b] Yi Wu, Edward Y Chang, Kevin Chen-Chuan Chang and John R Smith Optimal multimodal fusion for multimedia data analysis In Proceedings of ACM international conference on Multimedia, pages 572–579, 2004 (Cited on page 25.) [Wu 2010] Qiang Wu, Christopher J.C Burges, Krysta M Svore and Jianfeng Gao Adapting boosting for information retrieval measures Information Retrieval, vol 13, no 3, pages 254–270, 2010 (Cited on page 66.) 134 Bibliography [Xie 2007] Lexing Xie, Apostol Natsev and Jelena Tesic Dynamic multimodal fusion in video search In Proceedings of IEEE International Conference on Multimedia and Expo (ICME), pages 1499–1502, 2007 (Cited on pages 5, 24, 27, 82 and 94.) [Xu 2006] Huaxin Xu and Tat-Seng Chua Fusion of AV features and external information sources for event detection in team sports video ACM Transactions on Multimedia Computing, Communications, and Applications, vol 2, no 1, pages 44–67, 2006 (Cited on page 25.) [Yan 2004] Rong Yan, Jun Yang and Alexander G Hauptmann Learning queryclass dependent weights in automatic video retrieval In Proceedings of ACM international conference on Multimedia, pages 548–555, 2004 (Cited on pages 4, 24, 25, 27, 77, 78 and 82.) [Yan 2006] Rong Yan and Alexander G Hauptmann Probabilistic latent query analysis for combining multiple retrieval sources In Proceedings of ACM Special Interest Group on Information Retrieval (SIGIR), pages 324–331, 2006 (Cited on pages 5, 27, 82 and 94.) [Yi 2011] Yu Yi, Yinsheng Zhou and Ye Wang A tempo-sensitive music search engine with multimodal inputs In International ACM Workshop on Music Information Retrieval with User-Centered and Multimodal Strategies (MIRUM), 2011 (Cited on page 15.) [Yom-Tov 2005a] Elad Yom-Tov, Shai Fine, David Carmel and Adam Darlow Learning to estimate query difficulty: Including applications to missing content detection and distributed information retrieval In Proceedings of ACM Special Interest Group on Information Retrieval (SIGIR), pages 512–519, 2005 (Cited on pages 27 and 82.) Bibliography 135 [Yom-Tov 2005b] Elad Yom-Tov, Shai Fine, David Carmel and Adam Darlow Meta-search and federation using query difficulty prediction In Proceedings of ACM SIGIR Query Prediction Workshop, 2005 (Cited on page 82.) [Zhang 2009a] Bingjun Zhang, Jialie Shen, Xiang Qiaoliang and Ye Wang CompositeMap: A novel framework for music similarity measure In Proceedings of ACM Special Interest Group on Information Retrieval (SIGIR), pages 403–410, 2009 (Cited on pages 1, 16, 24, 89 and 93.) [Zhang 2009b] Bingjun Zhang, Qiaoliang Xiang, Huanhuan Lu, Jialie Shen and Ye Wang Comprehensive query-dependent fusion using regression-on- folksonomies: A case study of multimodal music search In Proceedings of ACM international conference on Multimedia, pages 213–222, 2009 (Cited on pages 5, 24, 27, 78, 82, 84, 85, 90, 91, 94 and 107.) [Zhang 2009c] Bingjun Zhang, Qiaoliang Xiang, Ye Wang and Jialie Shen CompositeMap: A novel music similarity measure for personalized multimodal music search In Proceedings ACM international conference on Multimedia, System Demo, pages 973–974, 2009 (Cited on page 1.) [Zhu 2006] Qiang Zhu, Mei-Chen Yeh and Kwang-Ting Cheng Multimodal fusion using learned text concepts for image categorization In Proceedings of ACM international conference on Multimedia, pages 211–220, 2006 (Cited on page 26.) ... scope of music retrieval and explore approaches and challenges of multimodal music retrieval in specific domains; • assess audio quality automatically based on music content analysis to ensure... users’ information needs, how to represent and analyze music content, and how to effectively retrieve the desired music are key problems of music retrieval Much of MIR research has been geared towards... studies and improves multimodal music retrieval system from several aspects First, I study multimodal music retrieval in a specific domain where queries are restricted to certain music dimensions