data mining and business analytics with r

361 592 0
data mining and business analytics with r

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

DATA MINING AND BUSINESS ANALYTICS WITH R www.it-ebooks.info DATA MINING AND BUSINESS ANALYTICS WITH R Johannes Ledolter Department of Management Sciences Tippie College of Business University of Iowa Iowa City, Iowa www.it-ebooks.info Copyright  2013 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Ledolter, Johannes. Data mining and business analytics with R / Johannes Ledolter, University of Iowa. pages cm Includes bibliographical references and index. ISBN 978-1-118-44714-7 (cloth) 1. Data mining. 2. R (Computer program language) 3. Commercial statistics. I. Title. QA76.9.D343L44 2013 006.3  12–dc23 2013000330 Printed in the United States of America 10987654321 www.it-ebooks.info CONTENTS Preface ix Acknowledgments xi 1. Introduction 1 Reference 6 2. Processing the Information and Getting to Know Your Data 7 2.1 Example 1: 2006 Birth Data 7 2.2 Example 2: Alumni Donations 17 2.3 Example 3: Orange Juice 31 References 39 3. Standard Linear Regression 40 3.1 Estimation in R 43 3.2 Example 1: Fuel Efficiency of Automobiles 43 3.3 Example 2: Toyota Used-Car Prices 47 Appendix 3.A The Effects of Model Overfitting on the Average Mean Square Error of the Regression Prediction 53 References 54 4. Local Polynomial Regression: a Nonparametric Regression Approach 55 4.1 Model Selection 56 4.2 Application to Density Estimation and the Smoothing of Histograms 58 4.3 Extension to the Multiple Regression Model 58 4.4 Examples and Software 58 References 65 5. Importance of Parsimony in Statistical Modeling 67 5.1 How Do We Guard Against False Discovery 67 References 70 v www.it-ebooks.info vi CONTENTS 6. Penalty-Based Variable Selection in Regression Models with Many Parameters (LASSO) 71 6.1 Example 1: Prostate Cancer 74 6.2 Example 2: Orange Juice 78 References 82 7. Logistic Regression 83 7.1 Building a Linear Model for Binary Response Data 83 7.2 Interpretation of the Regression Coefficients in a Logistic Regression Model 85 7.3 Statistical Inference 85 7.4 Classification of New Cases 86 7.5 Estimation in R 87 7.6 Example 1: Death Penalty Data 87 7.7 Example 2: Delayed Airplanes 92 7.8 Example 3: Loan Acceptance 100 7.9 Example 4: German Credit Data 103 References 107 8. Binary Classification, Probabilities, and Evaluating Classification Performance 108 8.1 Binary Classification 108 8.2 Using Probabilities to Make Decisions 108 8.3 Sensitivity and Specificity 109 8.4 Example: German Credit Data 109 9. Classification Using a Nearest Neighbor Analysis 115 9.1 The k-Nearest Neighbor Algorithm 116 9.2 Example 1: Forensic Glass 117 9.3 Example 2: German Credit Data 122 Reference 125 10. The Na ¨ ıve Bayesian Analysis: a Model for Predicting a Categorical Response from Mostly Categorical Predictor Variables 126 10.1 Example: Delayed Airplanes 127 Reference 131 11. Multinomial Logistic Regression 132 11.1 Computer Software 134 11.2 Example 1: Forensic Glass 134 www.it-ebooks.info CONTENTS vii 11.3 Example 2: Forensic Glass Revisited 141 Appendix 11.A Specification of a Simple Triplet Matrix 147 References 149 12. More on Classification and a Discussion on Discriminant Analysis 150 12.1 Fisher’s Linear Discriminant Function 153 12.2 Example 1: German Credit Data 154 12.3 Example 2: Fisher Iris Data 156 12.4 Example 3: Forensic Glass Data 157 12.5 Example 4: MBA Admission Data 159 Reference 160 13. Decision Trees 161 13.1 Example 1: Prostate Cancer 167 13.2 Example 2: Motorcycle Acceleration 179 13.3 Example 3: Fisher Iris Data Revisited 182 14. Further Discussion on Regression and Classification Trees, Computer Software, and Other Useful Classification Methods 185 14.1 R Packages for Tree Construction 185 14.2 Chi-Square Automatic Interaction Detection (CHAID) 186 14.3 Ensemble Methods: Bagging, Boosting, and Random Forests 188 14.4 Support Vector Machines (SVM) 192 14.5 Neural Networks 192 14.6 The R Package Rattle: A Useful Graphical User Interface for Data Mining 193 References 195 15. Clustering 196 15.1 k -Means Clustering 196 15.2 Another Way to Look at Clustering: Applying the Expectation-Maximization (EM) Algorithm to Mixtures of Normal Distributions 204 15.3 Hierarchical Clustering Procedures 212 References 219 16. Market Basket Analysis: Association Rules and Lift 220 16.1 Example 1: Online Radio 222 16.2 Example 2: Predicting Income 227 References 234 www.it-ebooks.info viii CONTENTS 17. Dimension Reduction: Factor Models and Principal Components 235 17.1 Example 1: European Protein Consumption 238 17.2 Example 2: Monthly US Unemployment Rates 243 18. Reducing the Dimension in Regressions with Multicollinear Inputs: Principal Components Regression and Partial Least Squares 247 18.1 Three Examples 249 References 257 19. Text as Data: Text Mining and Sentiment Analysis 258 19.1 Inverse Multinomial Logistic Regression 259 19.2 Example 1: Restaurant Reviews 261 19.3 Example 2: Political Sentiment 266 Appendix 19.A Relationship Between the Gentzkow Shapiro Estimate of “Slant” and Partial Least Squares 268 References 271 20. Network Data 272 20.1 Example 1: Marriage and Power in Fifteenth Century Florence 274 20.2 Example 2: Connections in a Friendship Network 278 References 292 Appendix A: Exercises 293 Exercise 1 294 Exercise 2 294 Exercise 3 296 Exercise 4 298 Exercise 5 299 Exercise 6 300 Exercise 7 301 Appendix B: References 338 Index 341 www.it-ebooks.info PREFACE This book is about useful methods for data mining and business analytics. It is written for readers who want to apply these methods so that they can learn about their processes and solve their problems. My objective is to provide a thorough discussion of the most useful data-mining tools that goes beyond the typical “black box” description, and to show why these tools work. Powerful, accurate, and flexible computing software is needed for data mining, and Excel is of little use. Although excellent data-mining software is offered by various commercial vendors, proprietary products are usually expensive. In this text, I use the R Statistical Software, which is powerful and free. But the use of R comes with start-up costs. R requires the user to write out instructions, and the writing of program instructions will be unfamiliar to most spreadsheet users. This is why I provide R sample programs in the text and on the webpage that is associated with this book. These sample programs should smooth the transition to this very general and powerful computer environment and help keep the start-up costs to using R small. The text combines explanations of the statistical foundation of data mining with useful software so that the tools can be readily applied and put to use. There are certainly better books that give a deeper description of the methods, and there are also numerous texts that give a more complete guide to computing with R. This book tries to strike a compromise that does justice to both theory and practice, at a level that can be understood by the MBA student interested in quantitative methods. This book can be used in courses on data mining in quantitative MBA programs and in upper-level undergraduate and graduate programs that deal with the analysis and interpretation of large data sets. Students in business, the social and natural sciences, medicine, and engineering should benefit from this book. The majority of the topics can be covered in a one semester course. But not every covered topic will be useful for all audiences, and for some audiences, the coverage of certain topics will be either too advanced or too basic. By omitting some topics and by expanding on others, one can make this book work for many different audiences. Certain data-mining applications require an enormous amount of effort to just collect the relevant information, and in such cases, the data preparation takes a lot more time than the eventual modeling. In other applications, the data collection effort is minimal, but often one has to worry about the efficient storage and retrieval of high volume information (i.e., the “data warehousing”). Although it is very important to know how to acquire, store, merge, and best arrange the information, ix www.it-ebooks.info x PREFACE this text does not cover these aspects very deeply. This book concentrates on the modeling aspects of data mining. The data sets and the R-code for all examples can be found on the webpage that accompanies this book (http://www.biz.uiowa.edu/faculty/jledolter/DataMining). Supplementary material for this book can also be found by entering ISBN 9781118447147 at booksupport.wiley.com. You can copy and paste the code into your own R session and rerun all analyses. You can experiment with the software by making changes and additions, and you can adapt the R templates to the analysis of your own data sets. Exercises and several large practice data sets are given at the end of this book. The exercises will help instructors when assigning homework problems, and they will give the reader the opportunity to practice the techniques that are discussed in this book. Instructions on how to best use these data sets are given in Appendix A. This is a first edition. Although I have tried to be very careful in my writing and in the analyses of the illustrative data sets, I am certain that much can be improved. I would very much appreciate any feedback you may have, and I encourage you to write to me at johannes-ledolter@uiowa.edu. Corrections and comments will be posted on the book’s webpage. www.it-ebooks.info ACKNOWLEDGMENTS I got interested in developing materials for an MBA-level text on Data Mining when I visited the University of Chicago Booth School of Business in 2011. The outstanding University of Chicago lecture materials for the course on Data Min- ing (BUS41201) taught by Professor Matt Taddy provided the spark to put this text together, and several examples and R-templates from Professor Taddy’s notes have influenced my presentation. Chapter 19 on the analysis of text data draws heavily on his recent research. Professor Taddy’s contributions are most gratefully acknowledged. Writing a text is a time-consuming task. I could not have done this without the support and constant encouragement of my wife, Lea Vandervelde. Lea, a law professor at the University of Iowa, conducts historical research on the freedom suits of Missouri slaves. She knows first-hand how important and difficult it is to construct data sets for the mining of text data. xi www.it-ebooks.info [...]... needed to predict a certain phenomenon) that becomes of central importance Parsimonious representations are important as simpler models tend to give more insight into a problem Large models overfitted on training data sets usually turn out to be extremely poor predictors in new situations as unneeded predictor variables increase the prediction error variance Furthermore, overparameterized models are of little... customer preferences and they use this for targeted advertising; they use recommender systems to direct their ads to areas that are most profitable Advertising related products that have a good chance of being bought and “cross-selling” of products become more and more important Data from loyalty programs, from e-Bay auction histories, and from digital footprints of users clicking on Internet webpages are... the purchasing behavior of individuals Electronic scanners keep track of purchases, prices, and the presence of promotions Loyalty programs of retail chains and frequent-flyer programs make it possible to link the purchases to the individual shopper and his/her demographic characteristics and preferences Innovative marketing firms combine the customer’s purchase decisions with the customer’s exposure to... buyer or the qualitative characteristics of the product such as new or old) Regression methods, regression trees, and nearest neighbor methods are well suited for problems that involve a continuous response Logistic regression, classification trees, nearest neighbor methods, discriminant analysis (for continuous predictor variables) and na¨ve Bayes methods ı (mostly for categorical predictor variables)... several developments have come together over the past few years, making the present period a perfect time to use these methods for solving business problems 1 More and more data relevant for data mining applications are now being collected 2 Data is being warehoused and is now readily available for analysis Much data from numerous sources has already been integrated, and the data is stored in a format... (program) instructions will be unfamiliar to a spreadsheet user, and there will be startup costs to using R However, the R sample programs in this book and their listing on the book’s webpage should help with the transition to this very general and powerful computer environment REFERENCE Ledolter, J and Burrill, C.: Statistical Quality Control: Strategies and Tools for Continual Improvement New York: John... orange juice data set taken from P Rossi’s bayesm package for R that was used earlier in Montgomery (1987) The three data sets are of suitable size (427,323 records and 13 variables in the 2006 birth data set; 1230 records and 11 variables in the contribution data set; and 28,947 records and 17 variables in the orange juice data set) The data sets include both continuous and categorical variables, have... are “fine-tuned” to the data of the training set, and it is not obvious whether this good performance carries over to other data sets In this book, we use the R Statistical Software (Version 15 as of June 2012) It is powerful and free One may search for the software on the web and download the system R is similar to Matlab and requires the user to write out simple instructions The writing of (program)... they work best; and which medical procedures lead to complications and for which patients Business analytics and data mining deal with collecting and analyzing data for better decision making in business Managers and business students can gain a competitive advantage through business analytics and data mining Most tools and methods for data mining discussed in this book have been around for a very long... team through data mining and business analytics It is not only business applications of data mining that are important; data mining is also important for applications in the sciences We have enormous data bases on drugs and their side effects, and on medical procedures and their complication rates This information can be mined to learn which drugs work and under which www.it-ebooks.info 6 INTRODUCTION . shoppers’ previous order histories without ever meeting them in person, and they use the information from previous order histories of their users to develop automatic recommender systems. Credit risk. poor predictors in new situations as unneeded predictor variables increase the prediction error variance. Furthermore, overparameterized models are of little use if it is difficult to collect data. Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness

Ngày đăng: 05/05/2014, 13:27

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan