Data Mining and Knowledge Discovery Handbook, 2 Edition part 129 ppsx

10 213 0
Data Mining and Knowledge Discovery Handbook, 2 Edition part 129 ppsx

Đang tải... (xem toàn văn)

Thông tin tài liệu

1260 Qingyu Zhang and Richard S. Segall Fig. 65.26. Concept Links for Term of “statistical” in SAS Text Miner using SASPDF- SYNONYMS text file (Wood- field, 2004) SAS Text Miner uses the “drag-and-drop” principle by dragging the selected icon in the tool set to dropping it into the workspace. The workspace of SAS Text Miner was constructed with a data icon of selected animal data that was provided by SAS in their Instructor’s Trainer Kit as shown in Figure 24. Figure 25 shows the results of using SAS Text Miner with indi- vidual plots for “role by frequency”, “number of documents by frequency”, “frequency by weight”, “attribute by frequency”, and “number of documents by frequency scatter plot.” Fig- ure 26 shows “Concept Linking Figure” as generated by SAS Text Miner using SASPDF- SYNONYMS text file. 65.5.2 Megaputer PolyAnalyst Previous work by the authors Segall and Zhang (2006) have utilized Megaputer PolyAna- lyst for data mining. The new release of PolyAnalyst version 6.0 includes text mining and specifically new features for text OLAP (on-line analytical processing) and taxonomy based categorization which is useful for when dealing with large collections of unstructured docu- ments as discussed in Megaputer Intelligence Inc. (2007). The latter cites that taxonomy based classifications are useful when dealing with large collections of unstructured documents such as tracking the number of known issues in product repair notes and customer support letters. According to Megaputer Intelligence Inc. (2007), PolyAnalyst “provides simple means for creating, importing, and managing taxonomies, and carries out automated categorization of text records against existing taxonomies.” Megaputer Intelligence Inc. (2007) provides ex- amples of applications to executives, customer support specialists, and analysts. According to Megaputer Intelligence Inc. (2007), “executives are able to make better business decisions upon viewing a concise report on the distribution of tracked issues during the latest observation period”. This chapter provides several figures of actual screen shots of Megaputer PolyAnalyst version 6.0 for text mining. These are Figure 27 for workspace of text mining of Megaputer PolyAnalyst, Figure 28 is “Suffix Tree Clustering” Report for the text cluster of (desk; front), and Figure 29 is screen shot of “Link Term” Report of hotel customer survey text. Megaputer PolyAnalyst can also provide screen shots with drill-down text analysis and histogram plot of text analysis. 65 Commercial Data Mining Software 1261 Fig. 65.27. Workspace for Text Mining in Megaputer PolyAnalyst Fig. 65.28. Clustering Results in Megaputer PolyAnalyst Fig. 65.29. Link Term Report using Text Analysis in Mega- puter PolyAnalyst 65.6 Web Mining Software Two selected software are reviewed and compared in terms of data preparation, data analysis, and results reporting (see Table 4). As shown in the table below, Megaputer PolyAnalyst has unique feature of data and text mining tool integrated with web site data source input, while SPSS Clementine has linguistic approach rather than statistics based approach, Table 4 gives a visual interpretation of the differences and similarities among both selected software as shown below. 1262 Qingyu Zhang and Richard S. Segall Table 65.4. Web Mining Software Features Megaputer PolyAnalyst SPSS Clemen- tine Data Data extraction x (web site as data source input) Import server files Preparation Automatic Data Cleaning x x user segmentation x x Detect users’ sequences x Data Understand product and content affinities (link analysis) xx Analysis Predict user propensity to convert, buy, or churn x Navigation report x Keyword and Search En- gine xx Results Interactive Results Window x Reporting Support for multiple lan- guages xx Visual presentation x x Unique features Data and text mining tool integrated with web site data source input Linguistic ap- proach rather than statistics based approach 65.6.1 Megaputer PolyAnalyst Megaputer PolyAnalyst is an enterprise analytical system that integrates Web mining together with data and text mining because it does not have a separate module for Web mining. Web pages or sites can be inputted directly to Megaputer PolyAnlayst as data source nodes. Megaputer PolyAnlayst has the standard data and text mining functionalities such as Cat- egorization, Clustering, Prediction, Link Analysis, Keyword and entity extraction, Pattern dis- covery, and Anomaly detection. These different functional nodes can be directly connected to the web data source node for performing web mining analysis. Megaputer PolyAnalyst user interface allows the user to develop complex data analysis scenarios without loading data in the system, thus saving analyst’s time. According to Megaputer (2007), whatever data sources are used, PolyAnalyst provides means for loading and integrating these data. PolyAnalyst can load data from disparate data sources including all popular databases, statistical, and spread- sheet systems. In addition, it can load collections of documents in html, doc, pdf and txt for- mats, as well as load data from an internet web source. PolyAnalyst offers visual “on-the-fly integration” and merging of data coming from disparate sources to create data marts for fur- ther analysis. It supports incremental data appending and referencing data sets in previously created PolyAnalyst projects. Figures 30-32 are screen shots illustrating the applications of Megaputer PolyAnalyst for web mining to available data sets. Figure 30 shows an expanded view of PolyAnalyst workspace. Figure 31 shows screen shot of PolyAnalyst using website of Arkansas State Uni- 65 Commercial Data Mining Software 1263 versity (ASU) as the web data source. Figure 32 shows a keyword extraction report from a web page of undergraduate admission of website of Arkansas State University (ASU). Fig. 65.30. PolyAnalyst workspace with Internet data source Fig. 65.31. PolyAnalyst using www.astate.edu as web data source Fig. 65.32. Keyword extraction report 1264 Qingyu Zhang and Richard S. Segall 65.6.2 SPSS Clementine “Web Mining for Clementine is an add-on module that makes it easy for analysts to perform ad hoc predictive Web analysis within Clementine’s intuitive visual workflow interface.” Web Mining for Clementine combines both Web analytics and data mining with SPSS analytical capabilities to transform raw Web data into “actionable insights”. It enables business decision makers to take more effective actions in real time. SPSS (2007) claims examples of auto- matically discovering user segments, detecting the most significant sequences, understanding product and content affinities, and predicting user intention to convert, buy, or churn. Fig. 65.33. SPSS Clementine workspace Fig. 65.34. Decision rules for determining clusters of web data SPSS (2007) claims four key data mining capabilities: segmentation, sequence detection, affinity analysis, and propensity modeling. Specifically, SPSS (2007) indicates six Web anal- ysis application modules within SPSS Clementine that are: search engine optimization, auto- mated user and visit segmentation, Web site activity and user behavior analysis, home page activity, activity sequence analysis, and propensity analysis. Unlike other platforms used for Web mining that provide only simple frequency counts (e.g., number of visits, ad hits, top pages, total purchase visits, and top click streams), SPSS (2007) Clementine provides more meaningful customer intelligence such as: likelihood to 65 Commercial Data Mining Software 1265 Fig. 65.35. Decision tree re- sults convert by individual visitor, likelihood to respond by individual prospect, content clusters by customer value, missed crossed-sell opportunities, and event sequences by outcome. Figures 33-35 are screen shots illustrating the applications of SPSS Clementine for web mining to available data sets. Figure 33 shows the SPSS Clementine workspace. Different user modes can be defined including research mode, shopping mode, search mode, evaluation mode, and so on. Decision rules for determining clusters of web data are demonstrated in Figure 34. Figure 35 exhibits decision tree results with classifiers using different model types (e.g., CHAID, logistic, neural). 65.7 Conclusion and Future Research The conclusions of this research include the fact that each of the software selected for this research has its own unique characteristics and properties that can be displayed when applied to the available data sets. As indicated, each software has it own set of algorithm types to which it can be applied. Comparing five data mining software, Biodiscovery GeneSight focuses on cluster analysis and is able to provide a variety of data mining visualization charts and colors. BioDiscovery GeneSight have less data mining functions than the other four do. SAS Enterprise Miner, Megaputer PolyAnalyst, PASW, and IBM Intelligent Miner employ each of the same algo- rithms as illustrated in Table 1 except that SAS has a separate software SAS Text Miner for text analysis. The regression results are comparable for those obtained using these software. The cluster analysis results for SAS Enterprise Miner, Biodiscovery GeneSight, and Mega- puter PolyAnalyst each are unique to each software as to how they represent their results. In conclusion, SAS Enterprise Miner, Megaputer PolyAnalyst, PASW, and IBM Intelligent Miner offer the greatest diversification of data mining algorithms. This chapter has discussed commercial data mining software that is applicable to super- computing for 3-D visualization and very large microarray databases. Specifically it illustrated the applications of supercomputing for data visualization using two selected software of Avizo and JMP Genomics. Avizo is a general supercomputing software and JMP Genomics is a spe- cial software for genetic data. Supercomputing data mining for 3-D visualization with Avizo is applied to diverse applications such as the human skull for medical research, and the atomic structure that can be used for multipurpose applications such as chemical or nuclear. We have also presented, using JMP Genomics, the data distributions of condition, patient, frequencies, 1266 Qingyu Zhang and Richard S. Segall and characteristics for patient data of adenocarcinoma cancer. The figures of this chapter il- lustrate the level of visualization that is able to be provided by these two softwares. Comparing two text mining software, both Megaputer PolyAnalyst, and SAS Text Miner have extensive text mining capabilities. SAS Text Miner is an add-on to base SAS Enterprise Miner by inserting an additional Text Miner icon on the SAS Enterprise Miner workspace toolbar. SAS Text Miner tags parts of speech and performs transformations such as those using Singular Value Decompositions (SVD) to generate term-document frequency matrix for viewing in the Text Miner node. Megaputer PolyAnalyst similarly is a software that combines both data mining and text mining, but also includes web mining capabilities. Megaputer also has standalone Text Analyst software for text mining. Regarding web mining software, PolyAnalyst can mine web data integrated within a data mining enterprise analytical system and provide visual tools such as link analysis of the critical terms of the text. SPSS Clementine can be used for graphical illustrations of customer web activities as well as also for link analysis of different data categories such as campaign, age, gender, and income. The selection of appropriate web mining software should be based on both its available web mining technologies and also the type of data to be encountered. The future direction of the research is to investigate other data, text, web, and supercom- puting mining software for analyzing various types of data and making comparisons of the capabilities of these software between and among each other. This future research would also include the acquisition of other data sets to perform these new analyses and comparisons. Acknowledgement. The authors would like to acknowledge the support provided by a 2009 Summer Faculty Research Grant as awarded to them by the College of Business of Arkansas State University without whose program and support this work cannot be done. The authors also want to acknowledge each of the software manufactures for their support of this research. References AAAI (2002), American Association for Artificial Intelligence (AAAI) Spring Sympo- sium on Information Refinement and Revision for Decision Making: Modeling for Diagnostics, Prognostics, and Prediction, Software and Data, retrieved from http: //www.cs.rpi.edu/ ˜ goebel/ss02/software-and-data.html. Ceccato, M., M. Marin, K. Mens, L. Moonen, et al., (2006), Applying and combining three different aspect Mining Techniques, Software Quality Journal. 14(3), 209-214. Chang, J. and Lee, W. (2006), Finding frequent itemsets over online data streams, Informa- tion and Software Technology. 48(7), 606-619. Chou, C., Sinha, A. and Zhao, H. (2008), A text mining approach to Internet abuse detection, Information Systems and eBusiness Management. 6(4), 419-440. Curry, C., Grossman, R., Locke, D., Vejcik, S., and Bugajski, J. (2007), Detecting changes in large data sets of payment card data: A case study, KDD’07, August 12-15, San Jose, CA. 65 Commercial Data Mining Software 1267 Data Intelligence Group (1995), An overview of data mining at Dun & Bradstreet, DIG White Paper 95/01, retrieved from http://www.thearling.com.text/wp9501/wp9501.htm. Davi, A, Dominique Haughton, Nada Nasr, Gaurav Shah, et al (2005), A Review of Two Text-Mining Packages: SAS TextMining and WordStat. The American Statistician. 59(1), 89-104. Davies, A. (2007), Identification of spurious results generated via data mining using an Inter- net distributed supercomputer grant, Duquesne University Donahue School of Business, http://www.business.duq.edu/Research/details.asp?id=83 Deshmukah, A. V. (1997), Software review: ModelQuest Expert 1.0, ORMS Today, December 1997, retrieved from http://www.lionhrtpub.com/orms/orms-12-97/software- review.html. Ducatelle, F., (2006), Software for the data mining course, School of In- formatics, The University of Edinburgh, Scotland, UK, retrieved from http://www.inf.ed.ac.uk/teaching/courses/dme/html/software2.html. Ganapathy, S., Ranganathan, C. and Sankaranarayanan, B. (2004), Visualization strategies and tools for enhancing customer relationship management, Communications of the ACM. 47(11), 92-98. Grossman, R. (2007), Data grids, data clouds and data webs: a survey of high perfor- mance and distributed data mining, HPC Workshop: Hardware and software for large- scale biological computing in the next decade, December 11-14, Okinawa, Japan, http://www.irp.oist.jp/hpc-workshop/slides.html Hearst, M. A.(2003), What is Data Mining?, http://www.ischool.berkeley.edu/ ˜ hearstr/ text mining.html IBM DB2 Intelligent Miner Visualization: Using the Intelligent Miner Visualizers Version 8.2 SH12, Second Edition, August 2004 Kim, S., E James Whitehead Jr and Yi Zhang, (2008), Classifying Software Changes: Clean or Buggy? IEEE Transactions on Software Engineering. 34(2), 181-197. Lau, K., Lee, K. and Ho, Y. (2005), Text Mining for the Hotel Industry, Cornell Hotel and Restaurant Administration Quarterly. 46(3), 344-363. Lazarevic A., Fiea T., & Obradovic, Z., (2006), A software system for spatial data analysis and modeling, retrieved from http://www.ist.temple.edu?˜zoran/papers/lazarevic00.pdf. Leung, Y. F. (2004), My microarray software comparison - Data mining soft- ware, September 2004, Chinese University of Hong Kong, retrieved from http://www.ihome.cuhk.edu.hk/˜b400559/arraysoft mining specific.html. Megaputer Intelligence Inc.(2007), Data Mining, Text Mining, and Web Mining Software, http:///www.megaputer.com Mesrobian, E. , Muntz, R., Shek,E., Mechoso,, C. R., Farrara, J.D., Spahr, J.A., Stolorz, P.(1995), Real time data mining, management, and visualization of GCM output, IEEE Computer Society, v.81, http://dml.cs.ucla.edu/˜shek/publications/sc 94.ps.gz Metz. C.(2003), Software: Text Mining, PC Magazine, July 1, http://www.pcmag.com/print article2/0,1217.a=43573,00.asp National Center for Biotechnology Information (2006), National Library of Medicine, National Institutes of Health, NCBI tools for data mining, retrieved from http://www.ncbi.nlm,nih.gov/Tools/. Nayak, R. (2008), Data Mining in Web Services Discovery and Monitoring, International Journal of Web Services Research. 5(1), 63-82. Nisbet, R. A.(2006), Data mining tools: Which one is best for CRM? Part 3, DM Re- view, March 21, 2006, retrieved from http://www.dmreview.com/editorial/dmreview/ print action.cfm?articleId=1049954. 1268 Qingyu Zhang and Richard S. Segall Pabarskaite, Z. and Raudys, A. (2007), A process of knowledge discovery from web log data: Systematization and critical review, Journal of Intelligent Information Systems. 28(1), 79-105. Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo- sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008. Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE In- ternational Conference on Data Mining, IEEE Computer Society Press, pp. 473–480, 2001. Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer- Verlag, 2004. Sanchez, E. (1996), Speedier: Penn researchers to link supercomputers to community prob- lems, The Compass, v. 43, n. 4, p. 14, September 17, http://www.upenn.edu/pennnews/ features/1996/091796/research Sanchez, M., Moreno, M., Segrera,S. and Lopez, V. (2008), Framework for the develop- ment of a personalised recommender system with integrated web-mining functionali- ties,International Journal of Computer Applications in Technology, 33(4), 312-327. SAS (2009), JMP Genomics 4.0 Product Brief, http://www.jmp.com/software/genomics /pdf/103112 jmpg4 prodbrief.pdf Segall, R. and Zhang, Q. (2006), Data visualization and data mining of continuous numer- ical and discrete nominal-valued microarray databases for biotechnology, Kybernetes: International Journal of Systems and Cybernetics, 35(9/10),1538-1566. Seigle, G. (2002), CIA, FBI developing intelligence supercomputer, Global Security. Sekijima, M. (2007), Application of HPC to the analysis of disease related protein and the design of novel proteins, HPC Workshop: “Hardware and software for large- scale biological computing in the next decade”, December 11-14, Okinawa, Japan, http://www.irp.oist.jp/hpc-workshop/slides.html SPPS (2009a): PASW Modeler 13: Overview Demo, http://www.spss.com/media/demos/ modeler/ demo-modeler-overview/index.htm SPPS (2009b): PAWS Modeler Auto Cluster and Cluster Viewer, http://www.spss.com/media/demos/modeler/demo-modeler-autocluster/index.htm SPSS (2007), Web Mining for Clementine, http://www.spss.com/web mining for clementine, viewed 16 May 2007. StatSoft, Inc. (2006), Electronic textbook, retrieved from http://www.statsoft.com/textbook/glosa.html. VSG Visualization Sciences Group (2009), Avizo The 3D visualization software for scien- tific and industrial data, http://www.vsg3d.com/vsg prod avizo overview.php Wikipedia (2006), Supercomputers, Retrieved May 19, 2009 from BookRags.com: http://www.bookrags.com/wiki/Supercomputer Wikipedia (2007), Web mining, http://en.wikipedia.org/wiki/Web mining Woodfield, Terry (2004), Mining Textual Data Using SAS Text Miner for SAS9 Course Notes, SAS Institute, Inc., Cary, NC. Zhang, Q. and Segall, R. (2008), Web mining: a survey of current research, techniques, and software, International Journal of Information Technology & Decision Making, 7(4), 683-720. 66 Weka-A Machine Learning Workbench for Data Mining Eibe Frank 1 , Mark Hall 1 , Geoffrey Holmes 1 , Richard Kirkby 1 , Bernhard Pfahringer 1 , Ian H. Witten 1 , and Len Trigg 2 1 Department of Computer Science, University of Waikato, Hamilton, New Zealand {eibe, mhall, geoff, rkirkby, bernhard, ihw}@cs.waikato.ac.nz 2 Reel Two, P O Box 1538, Hamilton, New Zealand len@reeltwo.com Summary. The Weka workbench is an organized collection of state-of-the-art machine lear- ning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the command line. However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on dis- tributed computing platforms, and for designing configurations for streamed data processing. These interfaces constitute an advanced environment for experimental data mining. The sys- tem is written in Java and distributed under the terms of the GNU General Public License. Key words: machine learning software, Data Mining, data preprocessing, data visualization, extensible workbench 66.1 Introduction Experience shows that no single machine learning method is appropriate for all possible learn- ing problems. The universal learner is an idealistic fantasy. Real datasets vary, and to obtain accurate models the bias of the learning algorithm must match the structure of the domain. The Weka workbench is a collection of state-of-the-art machine learning algorithms and data preprocessing tools. It is designed so that users can quickly try out existing machine learning methods on new datasets in very flexible ways. It provides extensive support for the whole process of experimental Data Mining, including preparing the input data, evaluating learning schemes statistically, and visualizing both the input data and the result of learning. This has been accomplished by including a wide variety of algorithms for learning different types of concepts, as well as a wide range of preprocessing methods. This diverse and compre- hensive set of tools can be invoked through a common interface, making it possible for users O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_66, © Springer Science+Business Media, LLC 2010 . of the ACM. 47(11), 92- 98. Grossman, R. (20 07), Data grids, data clouds and data webs: a survey of high perfor- mance and distributed data mining, HPC Workshop: Hardware and software for large- scale. 33(4), 3 12- 327 . SAS (20 09), JMP Genomics 4.0 Product Brief, http://www.jmp.com/software/genomics /pdf/1031 12 jmpg4 prodbrief.pdf Segall, R. and Zhang, Q. (20 06), Data visualization and data mining. Vejcik, S., and Bugajski, J. (20 07), Detecting changes in large data sets of payment card data: A case study, KDD’07, August 12- 15, San Jose, CA. 65 Commercial Data Mining Software 126 7 Data Intelligence

Ngày đăng: 04/07/2014, 06:20

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan