Auto annotation of multimedia contents theory and application

Chapter Introduction With the steady progress in image/video compression and communication technologies, many home users are beginning to have high bandwidth cable connections to view images and DVD-quality videos. Many users are putting large amounts of digital image/video online, and more and more media content providers are delivering live or on-demand image/videos over the Internet. While the amount of image/video data, including image/video collection, is rapidly increasing, multimedia applications are still very limited in content management capabilities. There is a growing demand for new techniques that are able to efficiently process, model and manage image/video contents. There are two main approaches to browsing and searching for images and videos in large multimedia collections(Smith et al. 2003). One is based on query by examples (QBE), in which an image or a video is used as the query. Almost all QBE systems use visual content features such as color, texture and shape as the basis for retrieval. These lowlevel content features are inadequate to model the contents of images and videos effectively. Moreover, it is difficult to formulate precise queries using visual features or examples. As a result, QBE is not very effective and not readily accepted by ordinary users. The other approach is query by concepts1 or keywords (QBK), which essentially retrieves images based on text annotations attached to images/videos. QBK approach is ease to use and is readily accepted by ordinary users because human thinks in terms of semantics. However, for QBE to be effective, good annotations for images and videos Throughout the thesis, we liberally use the terms “concept” and “keyword” interchangeably. are needed. As most current image/video collections either have none or come with few and incomplete annotation, effective techniques must be developed to annotate images. Most commercial image collections are annotated manually. As the size of image/video collection is large, in order of 104-107 or more, manually annotating or labeling such large collection is tedious, time consuming and error prone. Recently, supervised statistical learning approaches have been developed to perform automatic and semiautomatic annotation in order to reduce human efforts. However, the performance of such supervised learning approaches is still low. Moreover, they need large amounts of labeled training samples to learn the target concepts. These problems have motivated our research to explore machine learning approaches to perform auto-annotation of large image/video collections. Throughout this thesis, we use the term image to denote both image and video. We also loosely use the term keyword and concept interchangeably to denote text annotations of images. In a way, the problem is reduced to learn a set of classifiers, one for each predefined keyword or concept, in order to automatically annotate the contents of an image. Although the title of the thesis is on multimedia contents, this thesis is about visual contents including images and videos. 1.1 Motivation There are several factors that motivate our research: (1) There are large collections of images/videos that need annotation: such collections typically come with incomplete or without any annotation. However, for effective searching and browsing, the users prefer to use semantic concepts or keywords. (2) Supervised learning approaches still need large amounts of training data for effective learning: Manually annotating a large amount of training data is error prone, and the errors could affect the final learning performance. Because of subjective judgment and perceptual differences, different users/experts may assign different annotations to the same image/video. It is therefore important to minimize the manual labeling data required to learn the target concepts. (3) The need to develop effective techniques for auto-annotation of image collections: The goal is to find an effective way to auto-annotate image/video collections based on a pre-defined list of concepts that require as few training data as possible. The annotated image/video collections can then be used as the basis to support keywordbased search and filtering of images. 1.2 Our Approaches In this dissertation, we propose a scalable and flexible framework to automatically annotate large collections of images and videos based on a predefined list of concepts. The framework is based on the idea of hierarchical learning, i.e., performing autoannotation from image region level to image level, which is consistent with human cognition. In particular, at the region level, we assign one or more pre-defined concepts to a region based on the association between visual contents and concepts; while at the image level, we make use of contextual relationships among the concepts and regions to disambiguate the concepts learned. The framework is open and can incorporate different base learners ranging from the traditional single-view learners to multi-view learners. In multi-view learning approach, two learners, representing two orthogonal views of the problem, are trained to solve the problem collaboratively. In addition, it is well known that labeling training data for machine learning is tedious, time-consuming and error prone, especially for multimedia data. Consequently, it is of utmost importance to minimize the amount of labeled data needed to train the classifiers for the target concepts. Based on the framework, we implement three learning approaches to auto-annotate image collection with the aim to contrast the scalability, flexibility and effectiveness of different learning approaches, and the extensibility and efficiency of the framework. The three learning approaches are as follows: (1) The fully supervised single-view learning-based approach. In this approach, we consider the use of fully supervised SVM classifiers, one for each concept, to associate visual contents of segmented regions with concepts. In order to alleviate the unreliability problem of current segmentation methods, we employ two different segmentation methods to derive two sets of segmented regions for each image. We then employ a contextual model to disambiguate the concepts learned from multiple overlapping regions. We also evaluate the performances of classifiers developed based on different types of SVM: (a) hard SVM, which returns a binary decision for each concept; and (b) soft-SVM, that returns a probability value for each concept. (2) A bootstrapping scheme with two view-independent learners. Here we develop two sets of view-independent SVM classifiers by using two disjoint subsets of content features for each region. The two sets of view-independent classifiers are then used in a co-training framework to learn the concepts for each region collaboratively. As with the view-independent approach, we investigate: (a) the annotation of regions generated by two different segmentation methods; (b) the use of a contextual model to disambiguate the concepts learned from multiple overlapping regions; and (c) the performance of both hard- and soft-SVM models to train the classifiers. We compare the performance of bootstrapping approach and fully supervised single-view approach. We expect the performance of bootstrapping approach to be comparable of better than the fully supervised single-view approach, but require a much smaller set of training samples. In addition, we investigate the role of active learning, in which users/experts participate in the loop of learning by manually judging the classes of samples selected by the system. We aim to demonstrate that the resulting bootstrapping cum active learning framework is scalable and requires a much smaller set of training sample to kick start the learning process. (3) A bootstrapping scheme for Web image/video mining. In order to evaluate the flexibility and extensibility of our bootstrapping cum active learning framework, we apply the framework to web image/video annotation and retrieval. Web images posses both intrinsic visual contents (visual view) and text annotations derived from the associated HTML pages (text view). We develop two sets of view-independent SVM-based classifiers based on the above two orthogonal views. For effective learning, we also incorporate a language model into our framework. 1.3 Contributions In this thesis, we make the following three contributions: 1. We introduce the two-level framework for the auto-annotation of images/videos. The framework generates multiple sets of segmented regions and concepts and employs a contextual model at the image level to disambiguate the concepts. It is designed to incorporate different base learners. 2. We propose and implement a bootstrapping cum active learning approach and incorporate them into our framework, which reduces the amount of labeled samples required for effective learning. 3. We extend the bootstrapping approach to handle the heterogeneous media in Internet and explore web image annotation and retrieval by incorporating both the visual and textual features. 1.4 Thesis Overview The dissertation is organized as follows: Chapter discusses the basic question at what is auto-annotation for images/videos. The chapter also motivates the need for such approach and discusses how we may measure the performance of learning approaches. Chapter overviews the state-of-the-art research on image/video retrieval, the feature extraction, feature dimension reduction, indexing and retrieval. Chapter overviews existing research on statistical learning, its principle and related applications. We also introduce the ideas of supervised, unsupervised, semi-supervised and active learning schemes. Chapter discusses our proposed hierarchical Image/video semantic concept structure and learning framework. Based on the hierarchical lexicon (structure), we introduce the learning framework, which is an open framework and can incorporate different base learners. In Chapter 6, we propose and implement an approach to auto-annotate image/video collections using the framework introduced in Chapter 5. In this chapter, we evaluate and verify the framework with traditional learning approach, i.e., single-view learner. Chapter extends the work in Chapter by proposing a bootstrapping approach to annotate large image/video collections. Our bootstrapping cum active learning approach performs image annotation by using a small set of labeled data and a large set of unlabeled data. Chapter applies the bootstrapping framework to annotate and retrieve WWW images. Here we explore the integration of image visual feature (visual view) and its associated textual feature (textual view) to collaboratively learn the semantic concepts for web images. We also evaluate different combinations of visual feature and textual feature to find an effective combination of features for auto image annotation task. Finally, Chapter concludes the thesis with discussions for future research. Chapter Auto-annotation of Images In this chapter, we discuss what we mean by auto-annotation of images, why one needs to perform auto-annotation of images, why it is difficult, some possible approaches to be employed, why we are using machine learning techniques, and what characteristics that make auto-annotation of images difficult for machine learning. We also discuss several evaluation criteria for machine learning approaches. 2.1 What is Auto-annotation of Images? The purpose of auto-annotation of images is to assign the appropriate concept labels (or keywords) to each image. In fact, concepts are categories that describe the main contents of images (one image may belong to multiple categories). Thus auto-annotation of images is basically a classification or pattern recognition problem. Our aim is to learn the annotation of as many concepts as possible instead of just the main theme of the image. The list of concepts is predetermined. Each image can be assigned one or one more concepts. We can imagine that an image in a collection enters a concept channel that is composed of the tandem concept learners, and when the image exits, it is assigned concepts that represent the contents of the image. The process by which different learners determine the concepts of the image is the result of many aspects of pattern recognition and natural language analysis – as there are interactions between the concepts and their relations between the context of regions and content of image as we will see in the latter chapter. 2.2 Why We Need Auto-annotation of Images? Semantic concepts of image are very important for multimedia retrieval because human thinks in terms of semantics. Although content-based image/video retrieval (CBIR) has been developed and has achieved a sufficiently high level of retrieval effectiveness, it has been limited to research community and is not readily accepted by the ordinary user. One important reason is that it is difficult for ordinary user to master and understand the relationship between what they want and the low level visual features used in CBIR systems. With semantic concepts of images, this problem can be easy handled. Autoannotation of images can be used for: 1. Routing/filtering images/videos of interest such as animals, plants, vehicles, etc. 2. Image/video retrieval—assigning semantic concepts to images, thus ordinary users can master and understand the retrieval for their own use. 3. Using the assigned semantic concepts as initial retrieval step for a content-based image/video retrieval system, and then performing relevance feedback in order to make the user actively participate the process. 4. Combining with web techniques to enrich the web content search technique and facilitating image search as is done in text search engines. However, Auto-annotation of images is a very difficult and challenging task. There are many reasons: 10 Y. Cao, H. Li & L. Lian. (2003). Uncertainty Reduction in Collaborative Bootstapping: Measure and Algorithm. in Proceeding of the 41th Annual Meeting of the Association for Computational Linguistics., Japan. C. Carson, M. Thomas, S. Belongie et al. (1999). BlobWorld: A System for Regionbased Image Indexing and Retrieval. in Int.Conf. Visual Info. Sys R. M. Center. (1991). The Use of ROC Curves and Their Analyses. Medical Decision Making 11:102-6. Edward Chang, King-Shy Goh, Gerard Sychay et al. (2003). CBSA: Content-based Soft Annotation for Multimodal Image Retrieval Using Bayes Point Machines. IEEE Transactions on Circuits and Systems for Video Technology Special Issue on Conceptual and Dynamical Aspects of Multimedia Content Description. 13:26-38. Edward Chang & B. Li. (2003). MEGA --- The Maximizing Expected Generalization Algorithm for Learning Complex Query Concepts. ACM Transactions on Information Systems (TOIS) . N. S. Chang & K. S. Fu. (1980). A Query-by Pictorial example. IEEE Transaction on Software Engineering 6:519-524. Shi-Kuo Chang & Arding Hsu. (1992). image information systems: where we go from here. IEEE Transaction on Knowledge and Data Engineering 4:441-442. Shih-Fu Chang. (2002). The Holy Grail of Content-based Media Analysis. IEEE Multimedia . Shih-Fu Chang, A. Eleftheriadis & Robert McClintock. (1998). Next-generation content representation, creation and searching for new media applications in education. IEEE Proceedings. 86:602-615. T. S. Chua, Y. L. Zhao, L. Chaisorn et al. (2003a). TREC 2003 Video retrieval and Story Segmentation Task at NUS PRIS. in http://wwwnlpir.nist.gov/projects/tv.pubs.org. Tat-Seng Chua & C. X. Chu. (1998). Pseudo Obeject Method for Image Retrieval with Relevance Feedback. in 1st International Conference on Advance Multimedia Content Processing , Osaka, Japan. Tat-Seng Chua, Chun-Xin Chu & Mohan KanKanhalli. (1999). Relevance Feedback Techniques for Image Retrieval Using Multiple Attributes. in Proceedings of IEEE International Conference on MM Computing & Systems (ICMCS'99), Floence, Italy. Tat-Seng Chua, HuaMin Feng & A. Chandra. (2003b). Shot detection with support vector machines. in Proceedings of IEEE ICASSP , Hong Kong. Tat-Seng Chua, S. K. Lim & H. K. Pung. (1994). Content-based Retrieval of Segmented Images. Pages 211-218 in Proceedings of ACM Multimedia, San Francisco. Tat-Seng Chua & Jimin Liu. (2002). Learning Pattern Rules for Chinese Named-entity Extraction. Pages 411-418 in AAAI'2002., Edmonton, Canada. Tat-Seng Chua, Wai-Chee Low & Chun-Xin Chu. (1998). Relevance Feedback techniques for Color-based Image Retrival. Pages 24-31 in Proceedings of Multimedia Modeling , Lausanne, Switzerland. Tat-Seng Chua, K. L. Tan & B. C. Ooi. (1997). Fast Signature-based Color-Spatial Image Retrieval. in Proceedings of IEEE International Conference on Multimedia Computing and Systems, Ontario, Canada. Pages 362-369. J. M. Coggins & A. K. Jain. (1985). A Spatial Filtering Approach to Texture Ananlysis. Pattern Recognition Letters 3:195-203. 158 David A. Cohn, Zoubin Ghahramani & Michael I. Jordan. (1996). Active Learning with Statistical Models. Pages 129-145. M. Collins & Y. Singer. (1999). Unsupervised Models for Name Entity Classification. in In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural language Processing and Very Large Corpora G. Cortelazzo, G. A. Mian, G. Vezzi et al. (1994). Trademark Shapes description by string matching techniques. Pattern Recognition 27:1005-1018. Ingemar J. Cox, Matt L. Miller, Stephen M. Omohundro et al. (1996). PicHunter: Bayesian Relevance Feedback for Image Retrieval. in. Nello Cristianini & John Shawe-Taylor. (2003). An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press. I. Dagan & S. Engelson. (1995). Committee-based on Sampling for Training Probabilistic Classifiers. in Proceedings of the Twelfth International Conference on Machine Learning. Pages 150-170. David Pierce & Claire Cardie. (2001). Limitations of co-training for Natural language Learning from Large Datasets. in Proceeding of the 2001 Conference on Empirical Methods in Natural Language Processing. L. S. Davis. (1979). Shape Matching using Relaxation techinques. IEEE Transaction on Pattern Analysis and Machines Intelligence PAMI-1:60-72. Y. Deng & B. S. Manjunath. (2001). Unsupervised Segmentation of Color-texture Regions in Images and video. IEEE Trans.on Pattern Analysis and Machine Intelligence. 23:800-810. Nevenka Dimitrova. (2003). multimedia content analysis the next wave. Lecture Notes in Computer Science 2728:8-17. Richard O. Duda & Peter E. Hart. (1973). Pattern Classification and Scene Analysis. John Wiley & Sons. Richard O. Duda, Peter E. Hart & David G. Sttork. (2000). Pattern Classification. JOHN WILEY & SONS, INC S. Dumais, J. Platt, D. Heckerman et al. (1998). Inductive Learning Algorithms and Representations for Text Classification. in Proceedings of the 7th International Conference on Information and Knowledge Management. P. Enser & C. SandomTowards a comprehensive survey of the semantic gap in visual image retrieval. (2003). International Conference on Image and Video Retrieval 2728, 291-299. 2003. Springer C. Faloutsos, M. Flickner et al. (1993). Efficient and Effective Querying by Image Content. 1993. IBM Research Report Christos Faloutsos & King-lp David Lin. (1995). FASTMAP: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets. Pages 163-174 in Proceedings of SIGMOD. Christiane Fellbaum. (1997). WordNet: an electronic lexical database. MIT Press, Cambridge, Mass. HuaMin Feng & Tat-Seng Chua. (2003). A Bootstrapping Approach to Annotating Large Image Collection. in 5th International Workshop on Multimedia Information Retrieval. HuaMin Feng & Tat-Seng Chua. (2004). A Learning-based Approach for Annotating Large On-Line Image Collection. in Multimedia Modeling , Brisbane, Australia. 159 M. Flickner, Harpreet Sawhney, Wayne Niblack et al. (1995). Query by Image and Video Content: The QBIC System. IEEE Computer Magazine 28:23-32. Y. Freund & R. Schapire. (1996). Experiments with a New Boosting Algorithms. Pages 148-156 in Proceedings of the Thirteenth International Conference on Machine Learning. Yoav Freund, H. Sebastian Seung & Eli Shamir. (1997). Selective Sampling Using the Query by Committee Algorithm. Machine Learning. pages 133-168. B. Furht. (1998). The Handbook of Multimedia Computing: Chapter 13 - Content- based Image Indexing and Retrieval. in LLC:CRC Press. G.Pass, R.Zabih & J.Millar. (1996). Comparing Images Using Color Coherence Vectors. in Proceedings of ACM Multimedia, Boston, Massachusetts. Pages 65-73. King-Shy Goh, Edward Chang & Kwang-Ting Cheng. (2001). SVM binary classifier ensembles for image classification. Pages 395-402 in Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, USA . Amarnath Gupta & Ramesh Jain. (1997). Visual Information Retrieval. Communicaitons of ACM 40:71-79. R. Hall. (1989). Illumination and Color in Computer Generated Imagery. SpringerVerlag, New York. A. Hauptman, R. V. Baron, M.-Y Chen et al. (2003). Informedia at TRECVID 2003: in http://wwwAnalyzing and Searching broadcast news video. nlpir.nist.org/projects/tv.pubs.org. Ralf Herbrich. (2002). Learning Kernel Classifiers Therory and Algorithms. The MIT Press, Cambridge, Massachusetts, London, England. R. Herbrick, T. Graepel & C. Campbell. (2001). Bayes Point Machines. Journal of Machine Learning Research 1:245-279. WS Hsu, Tat-Seng Chua & H. K. Pung. (1995). Integrated Color-spatial Approach to Content-based Image Retrieval. Pages 305-313 in Proceedings of ACM Multimedia. P. W. Huang & Y. R. Jean. (1994). Using 2D C+-strings as spatial knowledge representation for Image Database Systems. Pattern Recognition 27:1249-1257. Thomas S. Huang, Shared Mehrotra & Kannan Ramchandran. (1996). Multimedia Analysis and Retrieval System(MARS) project. in Proceeding of 33rd Annual Clinic on Library Application of Data Processing - Digital Image Access and Retrieval. D. P. Huttenlocher, G. A. Klanderman & W. J. Rucklidge. (1993). Comparing Images using the Hausdorff Distance. IEEE Transaction of Pattern Recogniton and Machine Intelligence 15:850-863. J.Mao & A.K.Jain. (1992). Texture Classification and Segmentation using Multiresolution Simultineous Autoregressive Models. Pattern Recognition 25:173-188. T. Jaakkola & D. Haussler. (1999). Probabilistic kernel regression models. in In Proceedings of the 1999 conference on AI and Statistics. Tommi Jaakkola & David Haussler. (1998). Exploiting Generative Models in Discriminative Classifiers. Advances in Neural Information Processing Systems 11. A. K. Jain & F. Farrokhnia. (1991). Unsupervised texture Segmentation using Gabor Filters. Pattern Recognition 24:1167-1186. A. K. Jain & A. Vailaya. (1998). Shape-based Retrieval: A Case Study with Trademark Image Databases. Pattern Recognition 31:1369-1390. 160 Ramesh Jain. (1995). Workshop Report: NFS Workshop on Visual Information Management System. in Proceedings of SPIE Storage and Retrieval for Image and Video Databases. Ramesh Jain, R. Kasturi & B. Schunck. (1995). Machine Vision. MIT Press, New York. T. Joachims. (1998). Making Large-scale SVM learning practical. in Cambridge:MIT press. T. Joachims. (1999a). Transductive Inference for Text Classification using Support Vector Machines. Pages 200-209 in Proceedings of the Sixteenth International Conference on Machine Learning. Thorsten Joachims. (1999b). Transductive Inference for Text Classification using Support Vector Machines. Pages 200-209 in Proceedings of ICML-99, 16th International Conference on Machine Learning. John C.Platt. (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers.MIT Press . D. J. Kahl, A. Rosenfeld & A. Danker. (1980). Some Experiments in Point Pattern Matching. IEEE Transaction on Systems, Man and Cybernetics 10:105-116. M. Kearns, M. Li & L. Valiant. (1994). Learning Boolean Formulae. ACM . Y. LeCun, L. D. Jackel, L. Bottou et al. (1995). Comparing of Learning Algorithms for Handwritten Digit Recognition. Pages 53-60 in Artificial Neural Networks, Paris. D. D. Lewis & W. A. Gale. (1994). A Sequential Algorithm for Training Text Classifiers. Pages 3-12 in In Proceeding of ACM SIGIR, London, U.K. Beitao Li, Wei-Cheng Lai, Edward Chang et al. (2001). Mining Image Features for efficient Query processing. in IEEE Data Mining. Library of Congress(2002). Thesaurus for Graphic Materials I and II. Library of Congress . 2002. http://www.loc.gov/rr/print/tgm1. Rainer Lienhart & Alex Hartmann. (2002). Classifying Images on the Web Automatically. Journal of Electronic Imaging 11. R. Liere & P. Tadepalli. (1997). Active Learning with Committees for Text Classification. Pages 591-596 in Proceedings of AAAI. Ray Liere. (1999). Active Learning with Committees: An Approach to Efficient Learning in Text Categorization Using Linear Threshold Algorithms. Wenyin Liu, Susan Dumais, Yanfeng Sun et al. (2001). Semi-Automatic Image Annotation. Pages 326-333 in IEEE Conference on HCI. W. Y. Ma & B. S. Manjunath. (1995). A Comparison of Wavelet transform features for Texture Image Annotation. in proceedings of IEEE International Conference on Image Processing. W. Y. Ma & B. S. Manjunath. (1996). Texture Features and learning Similarity. in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. W. Y. Ma & B. S. Manjunath. (1997a). Edge Flow: a Framework for Boundary Detection and Image Segmentation. in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. W. Y. Ma & B. S. Manjunath. (1997b). Netra: a Toolbox for Navigating large Image Databases. in Proceedings of IEEE International Conference on Image Processing. 161 J. Malik & P. Perona. (1990). Representative texture discrimination with early Vision Mechanisms. Opt.Soc.Am. A 7:923-932. S. G. Mallat & Z. F. Zhang. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transaction on Signal Processing 41:3397-3415. B. S. Manjunath & W. Y. Ma. (1996). Texture Features for Browsing and Retrieval of Image Data. IEEE Transaction of Pattern Recognition and Machine Intelligent 18:837842. B. S. Manjunath & W. Y. Ma. (1997). Image Indexing using a texture Dictionary. in Proceedings of SPIE Storage and Retrieval for Image and Video Databases. B. S. Manjunath, Philippe Salembier & Thomas Sikora. (2002). Introduction to MPEG-7: multimedia Content Description Language. John Wiley, Chichester. Andrew McCallum & Kamal Nigam. (1998). Employing EM in Pool-Based Active Learning for Text Classification. in Proceedings of ICML-98, 15th International Conference on Machine Learning. Shared Mehrotra, Kaushik Chakrarti, Mike Ortega et al. (1997). Multimedia Analysis and Retrieval System. in Proceedings of 3rd International Workshop on Information Retrieval Systems. Milind R.Naphade, T.Kristjansson, B.Frey et al. (1998). Probabilistic Multimedia Objects Multijects: A novel Approach to Indexing and Retrieval in Multimedia Systems. Pages 536-540 in Proceedings of IEEE International Conference on Image Processing(ICIP). T. P. Minka & R. W. Picard. (1996). interactive learning using a "society of models". in Proceedings of IEEE conference on Computer vision and pattern recognition. Tom M. Mitchell. (1997). Machine Learning. McGraw-Hill. Y. Mori, H. Takahashi & R. Oka. (1999). Image-to-word Transformation Based on Dividing and Vector Quantizing Images With Words. in First International Workshop on Multimedia Intelligent Storage and Retrieval Management R. Morris, X. Descombes & J. Zerubia. (1997). Fully Bayesian Image Segmentation - An Engineering Perspective. in Proceedings of International Conference on Image Processinn, Santa Barbara, CA. I. Muslea, S. Minton & C. A. Knoblock. (2000). Selective Sampling with Co-testing. in In CRM Workshop on Combining and Selecting Multiple Models with Machine Learning, Montreal, QC, Canada. C. Nakajima, I. Norihiko, M. Pontil et al. (2000). Object Recognition and Detection by a Combination of Support Vector Machine and Rotation Invariant Phase Only Correlation. in Proceedings of International Conference on Pattern Recognition. Milind R. Naphade. (2001). A Probabilistic Framework for Mapping Audio-Visual Features to High-level Semantics in terms of Concepts and Context. Milind R. Naphade & John R. Smith (2003). A Hybrid Framework for Detecting the Semantics of Concepts and Context. [2728], 196-205. 2003. Springer-Verlag Heidelberg Raymond Ng & Andishe Sedighian. (1996). Evaluating multi-dimensional indexing structures for images transformed by Principal Component Analysis. in Proceedings of SPIE Storage and Retrieval for Image and Video Databases. 162 K. Nigam & R. Ghani. (2000). Analyzing the Effectiveness and Applicability of Cotraining. in In Proceedings of the 9th International Conference on Information and Knowledge management Michael Ortega, Yong Rui, Kaushik Chakrarti et al. (1997). Supporting Similarity Queries in MARS. in Proceedings of ACM Conference on Multimedia. Lexin Pan (2003). Image8: an image search engine for the Internet. 2003. Singapore, Honors year project report, school of computing, National University of Singapore Thomas V. Papathomas, Tiffany E. Conway, I. J. Cox et al. (1998). Psychophysical studies of the performance of an Image databases retrieval system. in Proceedings of IS&T/SPIE conference on Human Vision and Electronic Imaging III. G. Pass, R. Zabih & J. Millar. (1996). Comparing Images Using Color Coherence Vectors. Pages 65-73 in Proceedings of ACM Multimedia, Boston, Massachusetts. T. Pavlidis. (1978). Algorithms for Shape Analysis for contours and Waveforms. in Proceedings of Fourth International Joint Conference on Pattern Recognition, Kyoto, Japan. A. Pentland. (1984). Fractal-based Description of Natural Scenes. IEEE Transaction on circ.Sys.Video Tech. 9:661-674. A. Pentland, R. W. Picard & S. Sclaroff. (1996). Photobook: Content-based manipulation of image databases. Int.J. Computer Vision 18:233-254. R. W. Picard & T. P. Minka. (1995a). Vision Texture for Annotation. Multimedia Systems . R. W. Picard & T. P. Minka. (1995b). Visual texture for annotation. ACM Multimedia System 3:1-11. David Pierce & Claire Cardie. (2001). Limitations of co-training for Natural language Learning from Large Datasets. in Proceeding of the 2001 Conference on Empirical Methods in Natural Language Processing. John C. Platt. (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers.MIT Press . F. Provost, T. Fawcett & R. Kohavi. (1998). The Case against Accuracy Estimation for Comparing Induction Algorithms. in J.Shavlik editor. Proceedings of the Fifteenth International Conference on Machine Learning (ICML98). J. R. Quinlan. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. Ross Quinlan(2003). Data Mining Tools See5 and C5.0. http://www.rulequest.com/see5info.html . 2003. Raskutti, Bhavani, Herman Ferra et al. (2002). Using Unlabeled Data for Text Classification through Addition of Cluster Parameters. Pages 514-521 in Proceedings of 19th International Conference on Machine Learning (ICML-2002). Richard O.Duda, Peter E.Hart & David G.Sttork. (2000). Pattern Classification. JOHN WILEY & SONS, INC Bernice E. Rogowitz, Thomas Frese, John Smith et al. (1998). Perceptual Image Similarity Experiments. in Proceedings of IS&T/SPIE Conference on human Vision and Electronic Imaging III. 163 Yong Rui & Thomas S. Huang. (2000). Optimizing Learning in Image Retrieval. in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition(CVPR). Yong Rui, Thomas S. Huang & Shih-Fu Chang. (1999). Image Retrieval: Current Techniques, Promising Directions And Open Issues (1999). Journal of Visual Communication and Image Representation . Yong Rui, Thomas S. Huang, Michael Ortega et al. (1998). Relevance Feedback: a Powerful Tool in interactive Content-based Multimedia Information Retrieval Systems. IEEE Transaction on circ. Sys. Video Tech. 8:644-655. G. Salton & M. J. McGill. (1983). Introduction to modern information retrieval. McGraw Hill. H. M. Sanderson & M. D. Dunlop. (1997). Image Retrieval by Hypertext Links. Pages 296-303 in ACM SIGIR. R. Schettini. (1994). Multicolored Object Recognition and Location. Pattern Recognition Letters 15:1089-1097. Nicu Sebe, Michael S. Lew, Xiang Zhou et al. (2003). The state of art in image and video retrieval. in Springer-Verglag Heidelberg. H. S. Seung, M. Opper & H. Sompolinsky. (1992). Query by Committee. in Computational Learning Theory. Heng-Tao Shen, Beng-Chin Ooi & Kian-Lee Tan. (2000). Giving meaning to WWW Images. Pages 39-47 in ACM Multimedia, LA. USA. M. Shenier & M. Abedel-Mottaleb. (1996). Exploiting the JPEG Compression acheme for image retrieval. IEEE Transaction on Pattern Analysis and Machines Intelligence 18:849-853. Rui Shi, HuaMin Feng, Tat-Seng Chua et al. (2004). An adaptive image content representation and segmentation approach to automatic image annotation. in Proceedings of international conference in image and video retrieval(CIVR'04), Dublin. Shlomo-Engelson & I. Dagan. (1999). Committee-Based Sample Selection for Probabilistic Classifiers. Journal of Artificial Intelligence Research 11:335-360. A. Smeanton, V. Kraaij & P. Over. (2003). TRECID 2003 - An Introduction. in http://www-nlpir.nist.gov/projects/tv.pubs.org. John R. Smith & S.-F. Chang. (1996a). VisualSeek: A Fully Automated Content-based Query System. Pages 87-92 in In Proc. Fourth Int. Conf. Multimedia, ACM. John R. Smith & S.-F. Chang. (1996b). VisualSeek:A Fully Automated Content-based Query System. in. John R. Smith & Shih-Fu Chang. (1996c). VisualSEEK: A Fully Automated Contentbased Image Query System. in Proceedings of ACM Multimedia. John R. Smith & Shih-Fu Chang. (1997a). Enhancing Image Search in Visual Information Environments. in IEEE 1st Multimedia Signal Processing Workshop. John R. Smith & Shih-Fu Chang. (1997b). Visually Searching the Web for Content. IEEE Multimedia 4:12-20. John R. Smith, Milind Naphade & Apostol Paul Natsev. (2003). Multimedia Semantic Indexing Using Model Vectors. in ICME 2003. 164 M. Stricker & A. Dimai. (1996). Color Indexing with Weak Spatial Constraints. Pages 29-41 in Proceedings of SPIE Storage and Retrieval for Image and Video Databases IV, San Jose. J. A. Swets & R. M. Pickett. (1982). Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press, New York. Hideyuki Tamura, Shunji Mori & Takashi Yamawaki. (1978). Texture Feature corresponding to Visual Perception. IEEE Transaction on sys, man and Cyb, SMC 8:460-473. Simon Tong & Edward Chang. (2001). Support Vector Machine Active Learning for Image Retrieval. in In Proc. of international conference on Multimedia . Simon Tong & Daphne Koller. (2000). Support Vector Machine Active Learning with Applications to Text Classification . in Proceedings of ICML-00, 17th International Conference on Machine Learning. M. Tuceryan & A. K. Jain. (1990). Texture Segmentation using Voronoi Polygons. IEEE Transaction on Pattern Analysis and Machines Intelligence 12:211-216. M. Tuceryan & A. K. Jain. (1993). Texture Analysis. Pages 235-276 in C.H.Chen, L.F.Pau, and P.S.P.Wang editors. Handbook of Pattern Recognition and Computer Vision. World Scientific Publishing Company. A. Vailaya, M. Feigueiredo, A. Jain et al. (1999). Content-based Hierarchical classification of Vacation Images. in Proceedings of IEEE Multimedia Systems (International Conference on Multimedia Computing and Systems), Florence, Italy. A. Vailaya, A. K. Jain & H. J. Zhang. (1998). On Image Classification: City Images vs. landscapes. Pattern Recognition 31:1921-1936. V. Vapnik. (1995). The Nature of Statistical Learning Theory. Springer, New York. Vladimir Vapnik. (1998). Statistical Learning Theory. Wiley. Vladimir. Vapnik. (1999). The Nature of Statistical Learning Theory. Springer, New York. Vladimir Vapnik. (1995). The Nature of Statistical Learning Theory. Springer, New York. James Z. Wang. (2000). SIMPLIcity: a Region-based Image Retrieval System for Picture Libraries and Biomedical Image Databases. in Proceedings of ACM Multimedia. James Z. Wang. (2001). Integrated Region-based Image Retrieval. Kluwer Academic Publishers, Boston/Dordrecht/London. James Z. Wang & Jia Li. (2002). Learning-based Linguistic Indexing of Pictures with 2D MHHMs. Pages 436-445 in The 10th ACM Int. Conference on Multimedia. Sholom M. Weiss & Casimir A. Kulikowski. (1990). Computer Systems that Learn. Morgan Kaufmann. D. White & R. Jain (1996a). Algorithms and Strategies for Similarity Retrieval. TR VCL96-101. 1996a. University of California, San Diego D. White & R. Jain. (1996b). Similarity Indexing: Algorithms and Performance. in In Proceedings of SPIE Storage and Retrieval for Image and Video Databases. Gang Wu & Edward Chang (2003). Adaptive Feature-Space Conformal Transformation for Imbalanced-data. 2003. Washington DC, Proceedings of the Twentieth International Conference on Machine Learning(ICML-2003) Y.Deng & B.S.Manjunath. (2001). Unsupervised Segmentation of Color-texture Regions in Images and video. IEEE Trans. on Pattern Analysis and Machine Intelligence. 23:800-810. 165 Keiji Yanai. (2003). Generic Image Classification Using Visual Knowledge on the Web. in In Proceeding of International conference on ACM multimedia. C. T. Zahn & R. Z. Roskies. (1972). Fourier Descriptors for Plane Closed Curves. IEEE Transaction on Computers C-21:269-281. Cha Zhang & Tsuhan Chen. (2002). An Active Learning Framework for Content-based Information Retrieval. IEEE Transactions on Multimedia 4:260-268. 166 Appendix A Table A.1 Hierarchical semantic concepts structure Order 1st level 2nd level 3rd level 4th level Fishes Golden fish Comments Unknown Aquatic Animal Whales Unknown Unknown* Tigers Animals Lions Bears Dogs Cats Birds Unknown People Female Male Skiing Ice skate Winter sports Ice hokey unknown Sports Football 167 Basketball Volleyball Golf Unknown Trees Grass Plants Cactus Unknown Roses Tulips Flowers Sunflower Unknown Automobiles Taxi(Cabs) Trucks Vehicles Planes Aircraft Helicopter Unknown Unknown Transportation Railroad Unknown Bridge Transportation Road(highway) 168 facilities Tunnel Known Rock(Mountains) Cliffs Land Island Beach Known Fruit Meat Food Wine Beverage Beer Unknown Known 10 Bodies of Harbors water Sea Waterfall Lake Unknown Art 11 Painting Sculpture unknown Fire Clouds 169 Snow 12 Natural Ice phenomena Light Sunrise Sunset Unknown Table 13 Furniture Bed Unknown Buildings Historic Old building building 14 Facilities Skyscrapers Modern building Gardens unknown 15 Office Computer equipment & Copying supplies machines Typewrites Known Hats Shirts 16 Clothing Costumes 170 Unknown Composite Air shows concept Composite Parade & concepts processing Composite Interviews concepts Composite War 18 concepts Events Composite Meetings concepts Composite Unknown concepts Dolls 19 Toys Puppets Known Screws Locks Metalwork Hardware 20 Keys Known Unknown Financial chart 21 chart 21 Null unknown Null* 171 Note that: (1) “unknown*” means that at the current moment, this concept is not included in our lexicon, it will be added in future and we examples in this concept can learn further. “unknown” concept avoid re-annotating or labeling the whole data set when we extend the lexicon. (2) “null*” means that the image or object/region in the image can not be distinguished by human perception. “null” is different from “unknown”. 172 Appendix B Publications 1. Huamin Feng, Rui Shi and Tat-Seng. A Bootstrapping Framework for Annotating and Retrieving WWW Images. ACM Multimedia 2004, Oct 2004. 960-967. 2. Rui Shi, Huamin Feng, Tat-Seng Chua and Chin-Hui Lee. An Adaptive Image Content Representation and Segmentation Approach to Automatic Image Annotation. International Conference on Image and Video Retrieval (CIVR’04), Dublin, Ireland, Jul 2004, 545-554. 3. Tat-Seng Chua and Huamin Feng. A Scalable Bootstrapping Framework for Auto-Annotation of Large Image Collections, “Intelligent Multimedia Processing with soft Computing”, Ed. Volume by Yap-Peng Tan, Kim-Hui Yap and Lipo Wang, Springer Verlag, 2004, 75-90. 4. Huamin Feng and Tat-Seng Chua. A Learning-based Approach for Annotating Large On-Line Image Collection. The 10th International Multimedia Modeling (MMM’04), Brisbane, Australia, Jan 2004, 249-256. 5. Huamin Feng and Tat-Seng Chua. Semantic Concepts Learning via Bootstrapping for Large Image Collection. IWAIT, Singapore, Jan 2004 6. Huamin Feng and Tat-Seng Chua. A Bootstrapping Approach to Annotating Large Image Collection. ACM SIGMM International Workshop on Multimedia Information Retrieval. Berkeley, Nov 2003. 55-62. 7. Huamin Feng, Chandrashekhara A and Tat-Seng Chua. ATMRA: An Automatic Temporal Multi-resolution Analysis, The 9th International Multimedia Modeling (MMM’03), Taipei, 2003. 8. Tat-Seng Chua, Huamin Feng and Chandrashekhara A. An Unified Framework for Shot Boundary Detection with Active Learning IEEE ICASSP, Hong Kong, April 2003. 9. Tat-Seng Chua, Chandrashekhara A and Huamin Feng. A Temporal Multiresolution Approaches to video Shot Segmentation., Handbook of video database, CRC press, 2003 10. Chandrashekhara A, Huamin Feng and Tat-Seng Chua, An Unified Framework for Shot detection and Keyframe Extraction. TREC (Text REtrieval Conference), Gaithersburg, Nov 2002 173 [...]... will survey the state -of- the-art of image retrieval systems and statistical learning (machine learning theory) , and pave the way for a scalable framework for auto- annotation of multimedia contents 28 Chapter 3 Overview of Existing Research Work on Image Retrieval In this chapter, we present the overview of the state -of- art of image retrieval, image semantic analysis, image annotation and web image mining... segmentation, and (c) the image quality, etc 2.6 Why We Need to Minimize the Required Amount of Labeled Data for Learning? In many supervised learning approaches such as those in auto- annotation of images, we have thousands and thousands of images and many more segmented image regions, and 19 for text documents from the web, the samples are even more than we can imagine Thus labeling a reasonable set of samples... the use of random field models (Besag 1974), fractals (Pentland 1984), and SAR texture models (J.Mao and A.K.Jain 1992) Signal processing methods use frequency analysis of the image to classify texture The schemes include the use of spatial domain (Malik and Perona 1990) , Fourier domain filtering (Coggins and Jain 1985), Gabor filters and Wavelet models (Jain and Farrokhnia 1991, Manjunath and Ma 1997)... one increases the number of input features, the performance of a machine learning system often degrades There are at least two reasons that the curse of dimensionality affects the performance of learner One reason is, in general, the demand for a large number of samples grows exponentially with the dimensionality of the feature space This severely restricts the application of machine learning methods... (Chang and Fu 1980, Chang and Hsu 1992) However, there exist two major difficulties, especially when the size of image collection is large (tens or hundreds of thousands) One 29 is the vast amount of labor required in manually annotating images The other difficulty, which is more essential, results from the rich content in images and subjectivity of human perception The perception subjectivity and annotation. .. Fβ (Lewis and Gale 1994) Fβ is a function of precision, recall and a positive constant β =recall/precision, which is ratio of the importance of recall to precision β is determined by the need of a particular user When β =0.0, it ignores the recall and only precision is of concern; while β =1.0, only recall is of interest (β 2 + 1)precision * recall (precision + recall ) if precisionn = 0 and recall... learning 2.5 Why Auto- annotation of Images Is Difficult for Machine Learning? Auto- annotation of images for machine learning is difficult for several reasons as follows: 1 There is no clear mapping from a set of visual properties to its semantic concepts 17 Although region-based systems attempt to decompose images into constituent objects/regions, a representation composed of visual properties of regions... be further classified as general features and domain-specific features The former include color, texture and shape features while the latter is application dependent and may include, for example, human faces and fingerprint The domain-specific features are very diverse and there are many literatures in the field of pattern recognition and involved the use of domain knowledge Here we concentrate on... the complexity of image semantics and the lack of a “gold standard” (Wang 2001) It is very difficult to specify one single performance measure for use in all situations, since the choice of performance measure depends on the characteristics of the image collection and the needs of the user In the thesis, we will adopt two performance measures one derives from information retrieval and the other from... erroneous and costly In many cases, the labeling is not only one-to-one (one sample with one concept) but often it is one-to-many (one sample with at least one concept) Hence labeling also needs the expertise of trained personnels and high amount of work Because of these difficulties, finding a way to minimize the number of labeled samples is beneficial Usually, the training set is chosen to be a random . thousands and thousands of images and many more segmented image regions, and 20 for text documents from the web, the samples are even more than we can imagine. Thus labeling a reasonable set of. many aspects of pattern recognition and natural language analysis – as there are interactions between the concepts 10 and their relations between the context of regions and content of image as. a non-member of a concept of interest. In order to learn the target concept, the user typically provides a set of training samples. Each of which consists of an instance ∈ xX and its label, y .

Auto annotation of multimedia contents theory and application

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan