Data mining for bioinformatics dua chowriappa 2012 11 06

Data Mining for Bioinformatics Sumeet Dua Pradeep Chowriappa Data Mining for Bioinformatics Data Mining for Bioinformatics Sumeet Dua Pradeep Chowriappa CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2013 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Version Date: 20120725 International Standard Book Number-13: 978-1-4200-0430-4 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface xv About the Authors xix Section I Introduction to Bioinformatics .3 1.1 1.2 Introduction .3 Transcription and Translation 1.2.1 The Central Dogma of Molecular Biology 1.3 The Human Genome Project 11 1.4 Beyond the Human Genome Project 12 1.4.1 Sequencing Technology .13 1.4.1.1 Dideoxy Sequencing 14 1.4.1.2 Cyclic Array Sequencing 15 1.4.1.3 Sequencing by Hybridization .15 1.4.1.4 Microelectrophoresis 16 1.4.1.5 Mass Spectrometry 16 1.4.1.6 Nanopore Sequencing 16 1.4.2 Next-Generation Sequencing .17 1.4.2.1 Challenges of Handling NGS Data 18 1.4.3 Sequence Variation Studies 20 1.4.3.1 Kinds of Genomic Variations 21 1.4.3.2 SNP Characterization 22 1.4.4 Functional Genomics 24 1.4.4.1 Splicing and Alternative Splicing .26 1.4.4.2 Microarray-Based Functional Genomics 30 1.4.5 Comparative Genomics .32 1.4.6 Functional Annotation 33 1.4.6.1 Function Prediction Aspects 33 1.5 Conclusion .37 References 37 v vi ◾ Contents Biological Databases and Integration 41 2.1 2.2 Introduction: Scientific Work Flows and Knowledge Discovery .41 Biological Data Storage and Analysis 44 2.2.1 Challenges of Biological Data 44 2.2.2 Classification of Bioscience Databases .48 2.2.2.1 Primary versus Secondary Databases 48 2.2.2.2 Deep versus Broad Databases 48 2.2.2.3 Point Solution versus General Solution Databases 49 2.2.3 Gene Expression Omnibus (GEO) Database .51 2.2.4 The Protein Data Bank (PDB) 53 2.3 The Curse of Dimensionality 58 2.4 Data Cleaning 59 2.4.1 Problems of Data Cleaning 59 2.4.2 Challenges of Handling Evolving Databases .61 2.4.2.1 Problems Associated with Single-Source Techniques 62 2.4.2.2 Problems Associated with Multisource Integration 62 2.4.3 Data Argumentation: Cleaning at the Schema Level 63 2.4.4 Knowledge-Based Framework: Cleaning at the Instance Level 65 2.4.5 Data Integration 67 2.4.5.1 Ensembl .68 2.4.5.2 Sequence Retrieval System (SRS) .68 2.4.5.3 IBM’s DiscoveryLink 69 2.4.5.4 Wrappers: Customizable Database Software 70 2.4.5.5 Data Warehousing: Data Management with Query Optimization 70 2.4.5.6 Data Integration in the PDB .74 2.5 Conclusion .76 References 78 Knowledge Discovery in Databases 81 3.1 3.2 3.3 Introduction 81 Analysis of Data Using Large Databases 84 3.2.1 Distance Metrics 84 3.2.2 Data Cleaning and Data Preprocessing .85 Challenges in Data Cleaning 86 3.3.1 Models of Data Cleaning 89 3.3.1.1 Proximity-Based Techniques 90 3.3.1.2 Parametric Methods 91 3.3.1.3 Nonparametric Methods 93 Contents ◾ vii 3.3.1.4 Semiparametric Methods 93 3.3.1.5 Neural Networks .93 3.3.1.6 Machine Learning .95 3.3.1.7 Hybrid Systems 96 3.4 Data Integration .97 3.4.1 Data Integration and Data Linkage 97 3.4.2 Schema Integration Issues 98 3.4.3 Field Matching Techniques .99 3.4.3.1 Character-Based Similarity Metrics 99 3.4.3.2 Token-Based Similarity Metrics .101 3.4.3.3 Data Linkage/Matching Techniques .102 3.5 Data Warehousing 104 3.5.1 Online Analytical Processing 105 3.5.2 Differences between OLAP and OLTP 106 3.5.3 OLAP Tasks 106 3.5.4 Life Cycle of a Data Warehouse .107 3.6 Conclusion 109 References .109 Section II Feature Selection and Extraction Strategies in Data Mining 113 4.1 4.2 4.3 4.4 4.5 4.6 Introduction 113 Overfitting 114 Data Transformation 115 4.3.1 Data Smoothing by Discretization 115 4.3.1.1 Discretization of Continuous Attributes 116 4.3.2 Normalization and Standardization 118 4.3.2.1 Min-Max Normalization 118 4.3.2.2 z-Score Standardization 118 4.3.2.3 Normalization by Decimal Scaling 119 Features and Relevance 119 4.4.1 Strongly Relevant Features 119 4.4.2 Weakly Relevant to the Dataset/Distribution 120 4.4.3 Pearson Correlation Coefficient .120 4.4.4 Information Theoretic Ranking Criteria 121 Overview of Feature Selection 121 4.5.1 Filter Approaches .122 4.5.2 Wrapper Approaches .123 Filter Approaches for Feature Selection 124 4.6.1 FOCUS Algorithm 124 4.6.2 RELIEF Method—Weight-Based Approach 126 viii ◾ Contents 4.7 Feature Subset Selection Using Forward Selection 128 4.7.1 Gram-Schmidt Forward Feature Selection 128 4.8 Other Nested Subset Selection Methods 130 4.9 Feature Construction and Extraction 131 4.9.1 Matrix Factorization 132 4.9.1.1 LU Decomposition 132 4.9.1.2 QR Factorization to Extract Orthogonal Features 133 4.9.1.3 Eigenvalues and Eigenvectors of a Matrix 133 4.9.2 Other Properties of a Matrix 134 4.9.3 A Square Matrix and Matrix Diagonalization 134 4.9.3.1 Symmetric Real Matrix: Spectral Theorem 135 4.9.3.2 Singular Vector Decomposition (SVD) 135 4.9.4 Principal Component Analysis (PCA) .136 4.9.4.1 Jordan Decomposition of a Matrix 137 4.9.4.2 Principal Components .138 4.9.5 Partial Least-Squares-Based Dimension Reduction (PLS) 138 4.9.6 Factor Analysis (FA) 139 4.9.7 Independent Component Analysis (ICA) 140 4.9.8 Multidimensional Scaling (MDS) 141 4.10 Conclusion 142 References .143 Feature Interpretation for Biological Learning 145 5.1 5.2 5.3 Introduction 145 Normalization Techniques for Gene Expression Analysis .146 5.2.1 Normalization and Standardization Techniques 146 5.2.1.1 Expression Ratios 148 5.2.1.2 Intensity-Based Normalization 148 5.2.1.3 Total Intensity Normalization 149 5.2.1.4 Intensity-Based Filtering of Array Elements 153 5.2.2 Identification of Differentially Expressed Genes 155 5.2.3 Selection Bias of Gene Expression Data 156 Data Preprocessing of Mass Spectrometry Data 157 5.3.1 Data Transformation Techniques 158 5.3.1.1 Baseline Subtraction (Smoothing) 158 5.3.1.2 Normalization 158 5.3.1.3 Binning 159 5.3.1.4 Peak Detection 160 5.3.1.5 Peak Alignment .160 ... Data Mining for Bioinformatics Data Mining for Bioinformatics Sumeet Dua Pradeep Chowriappa CRC Press Taylor & Francis Group 6000 Broken... data mining in bioinformatics It introduces the evolution of bioinformatics and the challenges that can be addressed using data mining techniques Simplistically titled “Introduction to Bioinformatics, ”... Introduction 113 Overfitting 114 Data Transformation 115 4.3.1 Data Smoothing by Discretization 115 4.3.1.1 Discretization of Continuous Attributes 116 4.3.2 Normalization

Data mining for bioinformatics dua chowriappa 2012 11 06

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan