Tài liệu Wiley - Data Mining with Microsoft SQL Server 2008 (2009)01 pdf

Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page ii Maclennan ffirs.tex V3 - 10/04/2008 Data Mining with Microsoft SQL Server2008 3:27am Page i Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page ii Maclennan ffirs.tex V3 - 10/04/2008 Data Mining with Microsoft SQL Server2008 Jamie MacLennan ZhaoHui Tang Bogdan Crivat Wiley Publishing, Inc 3:27am Page iii Maclennan ffirs.tex V3 - 10/04/2008 Data Mining with MicrosoftSQL Server2008 Published by Wiley Publishing, Inc 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright  2009 by Wiley Publishing, Inc., Indianapolis, Indiana Published by Wiley Publishing, Inc., Indianapolis, Indiana Published simultaneously in Canada ISBN: 978-0-470-27774-4 Manufactured in the United States of America 10 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600 Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, or online at www.wiley.com/go/permissions Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose No warranty may be created or extended by sales or promotional materials The advice and strategies contained herein may not be suitable for every situation This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services If professional assistance is required, the services of a competent professional person should be sought Neither the publisher nor the author shall be liable for damages arising herefrom The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read For general information on our other products and services please contact our Customer Care Department within the U.S at (800) 762-2974, outside the United States at (317) 572-3993, or fax (317) 572-4002 Library of Congress Cataloging-in-Publication Data MacLennan, Jamie Data mining with Microsoft SQL server 2008 / Jamie MacLennan, Bogdan Crivat, ZhaoHui Tang p cm Includes index ISBN 978-0-470-27774-4 (paper/website) SQL server Data mining I Crivat, Bogdan II Tang, Zhaohui III Title QA76.9.D343M335 2008 005.75 85 — dc22 2008035467 Trademarks: Wileyand the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc and/or its affiliates, in the United States and other countries, and may not be used without written permission Microsoft and SQL Server are registered trademarks of Microsoft Corporation in the United States and/or other countries All other trademarks are the property of their respective owners Wiley Publishing, Inc is not associated with any product or vendor mentioned in this book Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books 3:27am Page iv Maclennan ffirs.tex V3 - 10/04/2008 To Logan, because he needs it the most — Jamie MacLennan This book is for Cosmin, with great hope that he will someday find math (and data mining) to be fun and interesting — Bogdan Crivat 3:27am Page v Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page vi Maclennan f01.tex V2 - 10/04/2008 3:30am About the Authors Jamie MacLennan is the principal development manager of SQL Server Analysis Services at Microsoft In addition to being responsible for the development and delivery of the Data Mining and OLAP technologies for SQL Server, MacLennan is a proud husband and father of four He has more than 25 patents and patents pending for his work on SQL Server Data Mining MacLennan has written extensively on the data mining technology in SQL Server, including many articles in MSDN Magazine, SQL Server Magazine, and postings on SQLServerDataMining.com and his blog at http://blogs.msdn.com/jamiemac This is his second edition of Data Mining with SQL Server MacLennan has been a featured and invited speaker at conferences worldwide, including Microsoft TechEd, Microsoft TechEd Europe, SQL PASS, the Knowledge Discovery and Data Mining (KDD) conference, the Americas Conference on Information Systems (AMCIS), and the Data Mining Cup conference ZhaoHui Tang is a group program manager at Microsoft adCenter Labs, where he manages a number of research projects related to paid search and content ads He is the inventor of Microsoft Keyword Services Platform Prior to adCenter, he spent six years as a lead program manager in the SQL Server Business Intelligence (BI) group, mainly focusing on data mining development He has written numerous articles for both academic and industrial publications, such as The VLDB Journal and SQL Server Magazine He is a frequent speaker at business intelligence conferences He was also a co-author of the previous edition of this book, Data Mining with SQL Server 2005 Bogdan Crivat is a senior software design engineer in SQL Server Analysis Services at Microsoft, working primarily on the Data Mining platform vii Page vii Maclennan viii f01.tex V2 - 10/04/2008 About the Authors Crivat has written various articles on data mining for MSDN Magazine and Access/VB/SQL Advisor Magazine, as well as numerous postings on the SQLServerDataMining.com website and on the MSDN Forums He presented at various Microsoft and data mining professional conferences Crivat also blogs about SQL Server Data Mining at www.bogdancrivat.net/dm 3:30am Page viii Maclennan xxiv ftoc.tex V2 - 10/06/2008 Contents Support Probability (Confidence) Importance Finding Frequent Itemsets Generating Association Rules Prediction Algorithm Parameters MINIMUM− SUPPORT MAXIMUM− SUPPORT MINIMUM− PROBABILITY MINIMUM− IMPORTANCE MAXIMUM− ITEMSET− SIZE MINIMUM− ITEMSET− SIZE MAXIMUM− ITEMSET− COUNT OPTIMIZED− PREDICTION− COUNT AUTODETECT− MINIMUM− SUPPORT Summary Chapter 12 Microsoft Neural Network and Logistic Regression Same Principle, Two Algorithms Using the Microsoft Neural Network Text Classification Models Utility Models DMX Queries Model Content Interpreting the Model Principles of the Microsoft Neural Network Algorithm What Is a Neural Network? Combination and Activation Backpropagation, Error Function, and Conjugate Gradient A Simple Example of Processing a Neural Network Normalization and Mapping Topology of the Network Training the Ending Condition Nonlinearly Separable Classes Algorithm Parameters MAXIMUM− INPUT− ATTRIBUTES MAXIMUM− OUTPUT− ATTRIBUTES MAXIMUM− STATES HOLDOUT− PERCENTAGE HOLDOUT− SEED HIDDEN− NODE− RATIO SAMPLE− SIZE Summary 360 361 361 363 366 367 368 368 368 368 368 369 369 369 369 369 370 371 372 373 373 378 378 381 382 384 385 387 389 390 392 393 394 395 396 396 396 396 397 397 397 397 397 6:07am Page xxiv Maclennan ftoc.tex V2 - 10/06/2008 6:07am Contents Chapter 13 Mining OLAP Cubes Introducing OLAP Understanding Star and Snowflake Schemas Understanding Dimension and Hierarchy Understanding Measures and Measure Groups Understanding Cube Processing and Storage Using Proactive Caching Querying a Cube Performing Calculations Browsing a Cube Understanding Unified Dimension Modeling Understanding the Relationship between OLAP and Data Mining Mining Aggregated Data OLAP Pattern Discovery Needs OLAP Mining versus Relational Mining Building OLAP Mining Models Using Wizards and Editors Using the Data Mining Wizard Building the Customer Segmentation Model Creating a Market Basket Model Creating a Sales Forecast Model Using the Data Mining Designer Understanding Data Mining Dimensions Using MDX within DMX Queries Using Analysis Management Objects for the OLAP Mining Model Summary 399 400 401 402 404 404 405 406 407 408 408 Chapter 14 Data Mining with SQL Server Integration Services An Overview of SSIS Understanding SSIS Packages Task Flow Standard Tasks in SSIS Containers Debugging Exploring a Control Flow Example Data Flow Transformations Viewers Exploring a Data Flow Example Working with SSIS in Data Mining Data Mining Tasks Data Mining Query Task 439 440 442 442 442 443 444 444 444 445 446 447 447 448 449 413 414 415 415 417 417 417 420 424 428 429 432 434 438 xxv Page xxv Maclennan xxvi ftoc.tex V2 - 10/06/2008 Contents Analysis Services Processing Task Analysis Services Execute DDL Task Data Mining Transformations Data Mining Model Training Destination Data Mining Query Transformation Example Data Flows Using Non-Predictive Data Mining Queries in an Integration Services Pipeline Text Mining Transformations Term Extraction Transformation Term Lookup Transformation More Details on the Text Mining Process Summary 452 453 455 455 458 462 463 464 465 467 470 472 Chapter 15 SQL Server Data Mining Architecture Introducing Analysis Services Architecture XML for Analysis XMLA APIs Discover Execute XMLA and Analysis Services Processing Architecture Predictions Data Mining Administration Server Configuration Data Mining Security Security Requirements for Creating and Training Mining Objects Security for Various Deployment Scenarios Local Database and Analysis Services Local Analysis Services and a Remote Database Intranet Analysis Services and Databases on the Same Server Analysis Services and Databases behind an HTTP Endpoint in an Internet Deployment Configuring Analysis Services for Use with Data Mining Excel Add-Ins over HTTP Summary 475 476 476 477 478 479 480 482 486 487 488 489 Chapter 16 Programming SQL Server Data Mining Data Mining APIs ADO ADO.NET ADOMD.NET Server ADOMD.NET AMO 497 498 498 500 501 501 501 491 491 492 493 493 494 495 496 6:07am Page xxvi Maclennan ftoc.tex V2 - 10/06/2008 6:07am Contents xxvii Using Analysis Services APIs Using Microsoft.AnalysisServices to Create and Manage Mining Models AMO Basics AMO Applications and Security Object Creation Creating Data Access Objects Creating the Mining Structure Creating the Mining Models Processing Mining Models Deploying Mining Models Setting Mining Permissions Browsing and Querying Mining Models Predicting with ADOMD.NET More on Table-Valued Parameters in ADOMD.NET Browsing Models Stored Procedures Writing Stored Procedures Stored Procedures and Prepare Invocations A Stored Procedure Example Executing Queries inside Stored Procedures Returning Data Sets from Stored Procedures Deploying and Debugging Stored Procedure Assemblies Summary Chapter 17 Extending SQL Server Data Mining Plug-in Algorithms Plug-in Algorithm Framework Lifetime of a Plug-in Algorithm Instance Conceptual Overview Model Creation and Processing Prediction Content Navigation Custom Functions PMML Managed vs Native Plug-ins Installing Plug-in Algorithms Where to Find Out More about Plug-in Algorithms Data Mining Viewers Interfaces to Be Implemented Rendering the Information Retrieving Information from Analysis Services Registering the Viewer Where to Find Out More about Plug-in Viewers Summary 502 502 503 505 506 507 510 512 513 515 516 517 517 522 525 527 529 530 530 533 534 537 538 541 542 543 543 545 547 553 554 555 557 557 558 558 558 559 559 560 561 561 562 Page xxvii Maclennan ftoc.tex V2 - 10/06/2008 xxviii Contents Chapter 18 Implementing a Web Cross-Selling Application Source Data Description Building Your Model Identifying the Data Mining Task Using Decision Trees for Association Using the Association Rules Algorithm Comparing the Two Models Making Predictions Making Batch Prediction Queries Using Singleton Prediction Queries Integrating Predictions with Web Applications Understanding Web Application Architecture Setting the Permissions Examining Sample Code for the Web Recommendation Application Summary 563 564 564 564 565 567 568 570 570 572 573 573 574 Chapter 19 Conclusion and Additional Resources Recapping the Highlights of SQL Server 2008 Data Mining State-of-the-Art Algorithms Easy-to-Use Tools Simple-Yet-Powerful API Integration with Sibling BI Technologies Exploring New Data Mining Frontiers and Opportunities Further Reference Microsoft Data Mining General Data Mining 581 581 582 583 584 584 585 586 586 586 Appendix A: Data Sets MovieClick Data Set Voting Records Data Set Wine Sales Foodmart College Plans Data Set 589 589 591 591 593 593 Appendix B: Supported Functions DMX Language Functions VBA Functions Excel Functions ASSprocs Stored Procedures 595 595 595 595 605 Index 607 575 578 6:07am Page xxviii Maclennan f04.tex V2 - 10/04/2008 3:32am Foreword The world is absolutely exploding with digitally born data Financial transactions, online advertising analytics, consumer preference information, and the results of scientific discovery mean tremendous volumes of data exist in both structured and unstructured stores today And it is growing faster than ever before, fueled by both technology and a new generation of people adopting and integrating technology into all aspects of their lives Business intelligence practitioners struggle to make sense of the data in their charge to help their businesses operate with better understanding of what is influencing results Trends are evolving and changing more quickly than ever before It is no longer enough to look at historical data to just determine what happened Aided by data mining, you can more readily understand why something happened It can make the difference in whether history — good or bad — repeats itself Because trends change at such great speed today, automated analysis and sophisticated algorithms for identifying trends, finding outliers, and predicting future courses quickly can be the difference between winning and just competing Data mining provides the means to make sense of tremendous volumes of data by automating the processes of categorizing and clustering common elements, identifying trends and anomalies in the data, and predicting what will happen given those factors I have had the pleasure to work alongside (and learn directly from) Jamie MacLennan and Bogdan Crivat They are passionate about the difference that technology can make in our lives, and committed to putting the tools necessary to make sense of the expanding world of data into everyone’s hands In this book, they share their passions with you, clearly explaining xxix Page xxix Maclennan xxx f04.tex V2 - 10/04/2008 Foreword data mining concepts, and how to apply them in common situations using the very algorithms and tools they authored themselves as part of Microsoft SQL Server This book provides an opportunity for you to learn straight from the source, too I am sure you will discover that this text is a valuable resource Tom Casey General Manager, SQL Server Business Intelligence Microsoft Corporation 3:32am Page xxx Maclennan f05.tex V2 - 10/04/2008 3:34am Introduction Microsoft SQL Server 2008 is the third version of SQL Server that ships with included data mining technology Since it was introduced in SQL Server 2000, data mining has become a key feature of the larger product Data mining has grown from an isolated part of SQL Server Analysis Services with two algorithms, to an intrinsic part of the SQL Server Business Intelligence (BI) platform that is fully integrated with OLAP, Integration Services, and Reporting Services Other Microsoft applications (such as Microsoft Dynamix CRM and Microsoft Performance Point Server) seamlessly integrate SQL Server Data Mining to accentuate their functionality with predictive power SQL Server Data Mining has become the most widely deployed data mining server in the industry, with many third-party software and consulting companies building on, specializing, and extending the platform Enterprise, small and medium business, and even academic and scientific users have all adopted or switched to SQL Server Data Mining because of its scalability, availability, extensive functionality, and ease of use This book serves as a guide to SQL Server Data Mining, explaining how it works, providing detailed technical and practical discussions of the SQL Server Data Mining technology, and demonstrating why you should deploy and use SQL Server Data Mining for yourself How This Book Is Organized This book is written to provide you with the knowledge necessary to implement successful data mining solutions using SQL Server, by introducing the overall space, familiarizing you with the tools, giving depth and breadth on the xxxi Page xxxi Maclennan f05.tex V2 - 10/04/2008 xxxii Introduction Microsoft data mining algorithms, and then providing details on various ways to implement data mining solutions The book starts with introductory chapters that outline the tools, technologies, and ideas you need to leverage SQL Server Data Mining Then each of the SQL Server data mining algorithms is described in detail in its own chapter The subsequent chapters describe how you can integrate SQL Server Data Mining into other parts of the SQL Server BI suite The latter part of the book deals with architecture and programming issues, and gives examples of some data mining implementation scenarios Following is a brief description of the chapters: Chapter 1: Introduction to Data Mining — This chapter introduces not only the book, but also the technology It contains a detailed definition of what exactly is meant by the term data mining, and discusses what kinds of problems are addressed by this technology Chapter 2: Applied Data Mining Using Office 2007 — This chapter provides an overview of the Table Analysis Tools for Office 2007 add-in, which is a rich set of tools for Excel that are usable by any information worker This chapter explains how and why you use these tools, and provides guidance on how to get the best results Chapter 3: Data Mining Concepts and DMX — This chapter is critical to understanding the SQL Server Data Mining platform It explains the underlying concepts of how you should think about a data mining problem, as well as providing a learn-by-example framework for Data Mining Extensions (DMX) to SQL Chapter 4: Using SQL Server Data Mining — This chapter introduces you to building data mining solutions using Business Intelligence Development Studio (BI Dev Studio) In addition to a basic overview, it provides a wide range of tips and tricks that can make the difference between a successful project and a failed one This chapter also covers using SQL Server Management Studio to access and secure data mining objects In addition, it tells you how you can expose your data mining models through SQL Server Reporting Services Chapter 5: Implementing a Data Mining Process Using Office 2007 — This chapter explores the remaining tools in the Data Mining Add-ins for Office 2007 As described in this chapter, these tools provide more functionality than BI Dev Studio and SQL Server Management Studio alone, but they also have limitations that prevent them from exposing the full functionality of SQL Server Data Mining In any case, this chapter will allow you to best take advantage of the Microsoft Office tools for data mining 3:34am Page xxxii Maclennan f05.tex V2 - 10/04/2008 3:34am Introduction xxxiii Chapters 6-12: the algorithm chapters — Each of these chapters is devoted to one or more of the algorithms included with SQL Server Data Mining In each of the chapters, you will find a basic description of the algorithm, followed by usage scenarios that will help you understand how, when, and where you apply each algorithm Each chapter describes how you create, train, interpret, and apply models using the specified algorithms The chapters wrap up with a deeper technical dive into how the algorithms work Chapter 13: Mining OLAP Cubes — This chapter provides a brief introduction to Online Analytical Processing (OLAP) and the OLAP functionality of SQL Server Analysis Services The chapter examines how and when you perform data mining on OLAP cubes It also includes details on how to implement popular OLAP mining scenarios Chapter 14: Data Mining with SQL Server Integration Services — This chapter introduces SQL Server Integration Services (SSIS) and describes its various components It then details the tasks and transformations that you use to implement data mining solutions in your data integration packages This chapter also describes how to use the text mining components to prepare unstructured data for data mining scenarios Chapter 15: SQL Server Data Mining Architecture — This is the first chapter that moves away from tools and concepts and starts to delve into the programming and administration aspects of SQL Server Data Mining This chapter discusses the architecture of a server-based data mining system, including the XML for Analysis (XMLA) protocol that underlies all client-server communication The chapter also describes the administration of a data mining server, including server properties that are important for SQL Server Data Mining and data mining security roles Chapter 16: Programming SQL Server Data Mining — This chapter details the programming interfaces for SQL Server Data Mining, and includes several examples of the programmatic creation, training, and application of data mining objects Chapter 17: Extending SQL Server Data Mining — This chapter describes how you can extend SQL Server Data Mining with your own functionality It shows you how to create stored procedures for adding operations to DMX It also describes how you can implement your own data mining algorithms to plug into SQL Server Data Mining and exploit its features Additionally, this chapter describes how you can write your own data mining visualizations to display patterns in either the supplied algorithms or your own algorithm implementations, and embed them in BI Dev Studio and SQL Server Management Studio Page xxxiii Maclennan f05.tex V2 - 10/04/2008 xxxiv Introduction Chapter 18: Implementing a Web Cross-Selling Application — This chapter walks you through a common data mining scenario — implementing a recommendation engine and integrating it into a retail website It includes sample queries and code to get you started Chapter 19: Conclusion and Additional Resources — In addition to wrapping up the book, this chapter provides a list of valuable links where you can find additional information and help with your data mining projects It also includes references to some other reading materials that you can refer to if you want to learn more about data mining This book also includes two helpful appendixes: Appendix A: Data Sets — This appendix contains a brief description of the various data sets used in this book Appendix B: Supported Functions — This appendix provides, for your reference, a list of all the supported DMX functions It also contains lists of all Visual Basic for Applications (VBA) and Excel functions that you can call from DMX It also describes some supplemental stored procedures provided by the authors to assist with the sample queries presented throughout the text Who Should Read This Book This book is primarily designed for the SQL Server user who is curious about data mining A working knowledge of SQL will be greatly beneficial in understanding DMX and the DMX queries sprinkled throughout the book However, non–SQL users can still benefit from the Office 2007 and the algorithm chapters Readers who are interested in programming SQL Server Data Mining should understand NET and the C# languages to apply the relevant chapters For those of you who have read the previous edition of this book, Data Mining with SQL Server 2005 (Indianapolis: Wiley, 2005), welcome back! In this text, you will find comprehensive material on the new functionality of Microsoft SQL Server 2008 Data Mining plus new examples for most algorithm and scenarios described in the text Conventions To help you get the most from the text and keep track of what’s happening, a number of conventions are used throughout the book 3:34am Page xxxiv Maclennan f05.tex V2 - 10/04/2008 3:34am Introduction xxxv N O T E Notes and other information that is supplemental to the current discussion are offset and placed in italics like this Within the main text, the following conventions are used: Important words or terms are italicized when they are first introduced in the text Combination keyboard strokes are shown like this: Ctrl+A Filenames, URLs, and code within the text are differentiated from the rest of the text with a special font, as shown in this example: persistence.properties Blocks (or snippets) of code are shown two different ways: In code examples, new and important code is highlighted with a gray background The gray highlighting is not used for code that’s less important in the present context, or has been shown before Tools You Will Need In order to get the most benefit from this book, you will need access to the SQL Server 2008 Analysis Services software SQL Server 2008 Analysis Services is included with the Standard, Enterprise, and Developer editions of Microsoft SQL Server 2008 Time-based evaluation versions are available for download at http://www.microsoft.com/sql To follow along with Chapters and 5, you will also need Microsoft Office 2007 and SQL Server 2008 Data Mining Add-Ins for Office 2007 Evaluation versions of Microsoft Office 2007 are available at www.microsoft.com/office, and the free download of the Data Mining Add-Ins is available at www.microsoft.com/sql/dm You’ll also want to have the AdventureWorksDW2008 database installed Instructions for accessing this database can be found in the ReadMe file on this book’s website What’s on the Website Most chapters in this book have supplemental materials that you can download from www.wiley.com/go/data mining SQL 2008 As appropriate for the chapter, the site contains SQL Server database backups, SQL Server Analysis Services database backups, project files, DMX query files, and/or source code Page xxxv Maclennan f05.tex V2 - 10/04/2008 xxxvi Introduction Each chapter directory contains a readme file that describes how to use the downloads for that chapter This book will launch you into the world of SQL Server Data Mining After you absorb all the information contained within, you will be well on your way to adding predictive and descriptive analytics to your daily life With its powerful development environment and APIs, Microsoft SQL Server Data Mining can change how you and every user in your organization view and interact with data Take the leap and discover the hidden sweets locked away in the data you have been hoarding over the years — one taste and you’ll be hooked! 3:34am Page xxxvi Maclennan c01.tex V2 - 10/04/2008 1:59am CHAPTER Introduction to Data Mining in SQL Server 2008 It’s always necessary to explain exactly what is meant by the term data mining You would hope that any particular technology has a name that is either absolutely clear as to what it means (such as reporting) or completely devoid of meaning, but catchy, so the association is unique (such as Silverlight) However, this is not the case for data mining The term data mining has been used to mean anything from ad hoc queries, rules-based notifications, or pivot-chart analysis to evil government domestic-spying programs As it is used in this book, data mining is the process of analyzing data to find hidden patterns using automatic methodologies This type of data mining is often referred to using other terms such as machine learning, knowledge discovery in databases (KDD), or predictive analytics Although each of these terms has a slightly different connotation, they overlap enough to be functionally equivalent with data mining in the sense used here By far, the trendiest term today is predictive analytics, which many companies ironically are using to differentiate what they from ‘‘data mining.’’ The inherent implication is that data mining is limited to the discovery of patterns, whereas predictive analytics allows the application of the patterns to new data to impute (or predict) unknown values The motivation behind using the term predictive analytics is precisely this dilution of the meaning of data mining as it has been used in recent years Predictive analytics, however, is an incomplete term because it ignores the descriptive nature of data mining Therefore, until a marketing genius comes up with a clever, meaningless name like ‘‘Sparky,’’ the term we use will remain data mining Page Maclennan Chapter ■ c01.tex V2 - 10/04/2008 Introduction to Data Mining in SQL Server 2008 N O T E The authors of this book by no means endorse using the term Sparky when referring to SQL Server Data Mining If you call Microsoft Technical Support about a problem because your Sparky model isn’t processing correctly, or because you can’t set proper security credentials on your Sparky server, not expect a rational answer As with all data mining problems, rational results come from rational inputs So, what does data mining do, and why you need it? Over the past several years, compute power has increased exponentially according to the well-known Moore’s law However, unbeknownst to most, hard-drive capacity has increased at an order of magnitude greater than that of processor power That is, the capability to store data has greatly outpaced the capability to process it As a result, large volumes of data have been generated and persisted in databases Much of this data comes from business software, such as financial applications, enterprise resource planning (ERP) systems, customer relationship management (CRM) systems, and server logs from web servers, or even the database servers hosting the data The result of this unceasing data collection is that organizations have become data-rich and knowledge-poor The collections of data are so vast that the practical use of these stores of data becomes limited The main purpose of data mining is to extract knowledge from the data at hand, increasing its intrinsic value and making the data useful For example, Figure 1-1 shows a relational table containing a list of high school seniors For each student, the table records information such as gender, IQ, parental income, and whether or not students were encouraged by their parents to attend college, along with their actual intention to attend college Using this data, how can you answer the question, ‘‘What drives high school graduates to attend college?’’ Using traditional methods, you can write queries or slice the data using Online Analytical Processing (OLAP) tools to find out how many male students attend college versus female students You could also write a query to see the relationship between parental encouragement and attendance plans But what about male students who are encouraged by their parents? Or, what about female students who are not encouraged by their parents? You must write dozens of such queries to cover all the possible combinations Numerical columns such as ParentIncome or IQ are more difficult to analyze For example, you would need to arbitrarily choose ranges in these numeric values to determine how an income range of $40,000 to $50,000 impacted a decision to attend college Even with this fairly simple data set, ad hoc queries and OLAP are not suited to the task Imagine if there were hundreds of columns in this table You would quickly end up with an intractable number of possibilities to test in order to answer a basic question about the meaning of your data 1:59am Page ... - 10/04 /2008 Data Mining with Microsoft? ?? SQL Server? ? ?2008 Jamie MacLennan ZhaoHui Tang Bogdan Crivat Wiley Publishing, Inc 3:27am Page iii Maclennan ffirs.tex V3 - 10/04 /2008 Data Mining with Microsoft? ? ?SQL. .. the Highlights of SQL Server 2008 Data Mining State-of-the-Art Algorithms Easy-to-Use Tools Simple-Yet-Powerful API Integration with Sibling BI Technologies Exploring New Data Mining Frontiers... Chapter 13 Mining OLAP Cubes 399 Chapter 14 Data Mining with SQL Server Integration Services 439 Chapter 15 SQL Server Data Mining Architecture 475 Chapter 16 Programming SQL Server Data Mining 497

Tài liệu Wiley - Data Mining with Microsoft SQL Server 2008 (2009)01 pdf

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan