Data Mining and Knowledge Discovery Handbook, 2 Edition part 127 pptx

1240 Oded Maimon and Abel Browarnik Taxonomies and ontologies NHECD uses, at several stages, manually prepared taxonomies. It is arguable that using an ontology of the Nanotox domain could enhance the quality of information extraction (either textual, graphic or tabular). On the other hand, no Nanotox ontology exists. Research towards ontology learning could use NHECD results. In turn, the learned ontology could improve information extraction, implementing a kind of boot- strapping process. Data mining on the second NHECD product can have a strong influence on the ontology learning process. As a result, the ontology can be further enhanced. References Arbel, R. and Rokach, L., Classifier evaluation under limited resources, Pattern Recognition Letters, 27(14): 1619–1631, 2006, Elsevier. Averbuch, M. and Karson, T. and Ben-Ami, B. and Maimon, O. and Rokach, L., Context- sensitive medical information retrieval, The 11th World Congress on Medical Informat- ics (MEDINFO 2004), San Francisco, CA, September 2004, IOS Press, pp. 282–286. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7 (Apr. 1998), 107-117. Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp. 3592-3612, 2007. Maimon O., and Rokach, L. Data Mining by Attribute Decomposition with semiconductors manufacturing case study, in Data Mining for Design and Manufacturing: Methods and Applications, D. Braha (ed.), Kluwer Academic Publishers, pp. 311–336, 2001. Maimon O. and Rokach L., “Improving supervised learning by feature decomposition”, Pro- ceedings of the Second International Symposium on Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, Springer, pp. 178-196, 2002. Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Series in Machine Perception and Artificial In- telligence - Vol. 61, World Scientific Publishing, ISBN:981-256-079-3, 2005. Rokach, L., Decomposition methodology for classification tasks: a meta decomposer frame- work, Pattern Analysis and Applications, 9(2006):257–271. Rokach L., Genetic algorithm-based feature set partitioning for classification problems,Pattern Recognition, 41(5):1676–1700, 2008. Rokach L., Mining manufacturing data using genetic algorithm-based feature set decomposition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008. Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap- proach, Proceedings of the 14th International Symposium On Methodologies For Intel- ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag, 2003, pp. 24–31. Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer- Verlag, 2004. Rokach, L. and Maimon, O. and Arbel, R., Selective voting-getting more for less in sensor fusion, International Journal of Pattern Recognition and Artificial Intelligence 20 (3) (2006), pp. 329–350. Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE In- ternational Conference on Data Mining, IEEE Computer Society Press, pp. 473–480, 2001. 64 NHECD - Nano Health and Environmental Commented Database 1241 Rokach L. and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel- ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158. Rokach, L. and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery Handbook, pp. 321–352, 2005, Springer. Rokach, L. and Maimon, O., Data mining for improving the quality of manufacturing: a feature set decomposition approach, Journal of Intelligent Manufacturing, 17(3):285– 299, 2006, Springer. Rokach, L., Maimon, O., Data Mining with Decision Trees: Theory and Applications, World Scientific Publishing, 2008. Part VIII Software 65 Commercial Data Mining Software Qingyu Zhang and Richard S. Segall 1 Arkansas State University, Department of Computer and Info. Tech., Jonesboro, AR 72467-0130,USA. qzhang@astate.edu 2 Arkansas State University, Department of Computer and Info. Tech., Jonesboro, AR 72467-0130,USA. rsegall@astate.edu Summary. This chapter discusses selected commercial software for data mining, supercomputing data mining, text mining, and web mining. The selected software are compared with their features and also applied to available data sets. The software for data mining are SAS Enterprise Miner, Megaputer PolyAnalyst 5.0, PASW (formerly SPSS Clementine), IBM In- telligent Miner, and BioDiscovery GeneSight. The software for supercomputing are Avizo by Visualization Science Group and JMP Genomics from SAS Institute. The software for text mining are SAS Text Miner and Megaputer PolyAnalyst 5.0. The software for web mining are Megaputer PolyAnalyst and SPSS Clementine . Background on related literature and software are presented. Screen shots of each of the selected software are presented, as are conclusions and future directions. 65.1 Introduction In the data mining community, there are three basic types of mining: data mining, web mining, and text mining (Zhang and Segall, 2008). In addition, there is a special category called supercomputing data mining, which is today used for high performance data mining and data intensive computing of large and distributed data sets. Much software has been developed for visualization of data intensive computing for use with supercomputers, including that for large-scale parallel data mining. Data mining primarily deals with structured data. Text mining mostly handles unstructured data/text. Web mining lies in between and copes with semi-structured data and/or unstructured data. The mining process includes preprocessing, patterns analysis, and visualization. To effec- tively mine data, a software with sufficient functionalities should be used. Currently there are many different software, commercial or free, available on the market. A comprehensive list of mining software is available on web page of KDnuggets (http:// www.kdnuggets.com/software /index.html). This chapter discusses selected software for data mining, supercomputing data mining, text mining, and web mining that are not available as free open source software. The selected O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_65, © Springer Science+Business Media, LLC 2010 1246 Qingyu Zhang and Richard S. Segall software for data mining are SAS Enterprise Miner, Megaputer PolyAnalyst 5.0, PASW (formerly SPSS Clementine), IBM Intelligent Miner, and BioDiscovery GeneSight. The selected software for text mining are SAS Text Miner and Megaputer PolyAnalyst 5.0. The selected software for web mining are Megaputer PolyAnalyst and SPSS Clementine. The software for supercomputing are Avizo by Visualization Science Group and JMP Genomics from SAS In- stitute. Avizo is 3-D visualization software for scientific and industrial data that can process very large datasets at interactive speed. JMP Genomics from SAS is used for discovering the biological patterns in genomics data. These software are described and compared as to the existing features and algorithms for each and also applied to different available data sets. Background on related literature and software are also presented. Screen shots of each of the selected software are reported as are conclusions and future directions. 65.2 Literature Review Data mining is defined by the Data Intelligence Group (1995) as the extraction of hidden predictive information form large databases. According to them, “data mining tools scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.” According to StatSoft (2006), algorithms are operations or procedures that will produce a particular outcome with a completely defined set of steps or operations. This is opposed to heuristics that are general recommendations or guides based upon theoretical reasoning or statistical evidence such as “data mining can be a useful tool if used appropriately.” Data mining and algorithms are widely implemented and rapidly developed (Kim et al., 2008; Nayak, 2008; Segall and Zhang, 2006). According to Wikipedia (2009), supercomputers or HPC (High Performance Computing) are used for highly calculation-intensive tasks such as problems involving quantum mechan- ical physics, weather forecasting, global warming, molecular modeling, physical simulations (such as for simulation of airplanes in wind tunnels and simulation of detonation of nuclear weapons). Sanchez (1996) cited the importance of data mining using supercomputers by stat- ing “Data mining with these big, superfast computers is a hot topic in business, medicine and research because data mining means creating new knowledge from vast quantities of information, just like searching for tiny bits of gold in a stream bed”. According to Sanchez (1996), The Children’s Hospital of Pennsylvania took MRI scans of a child’s brain in 17 seconds using supercomputing for that which otherwise normally would require 17 minutes assuming no movement of the patient. The increasing availability of textual knowledge applications and online textual sources has caused a boost in text mining and web mining research. Hearst (2003) defines text mining as “the discovery of new, previously unknown information, by automatically extracting information from different written sources.” He distinguishes text mining from data mining by noting that “in text mining the patterns are extracted from natural language rather than from structured database of facts.” Metz (2003) describes text mining as those for that “applications are clever enough to run conceptual searches, locating, say, all the phone numbers and places names buried in a collection of intelligence communiqus.” More impressive, the software can identify relationships, patterns, and trends involving words, phrases, numbers, and other data. Web mining is the application of data mining techniques to discover patterns from the Web and can be classified into three different types of web content mining, web usage mining, and web structure mining (Pabarskaite and Raudys, 2007; Sanchez et al., 2008). Web content mining is the process to discover useful information from the content of a web page that may 65 Commercial Data Mining Software 1247 consist of text, image, audio or video data in the web; web usage mining is the application that uses data mining to analyze and discover interesting patterns of user’s usage of data on the web; and web structure mining is the process of using graph theory to analyze the node and connection structure of a web site (Wikipedia, 2007). An example of the latter would be discovering the authorities and hubs of any web document, e.g. identifying the most appropriate web links for a web page. There is a wealth of software today for data, supercomputing, text and web mining such as presented in American Association for Artificial Intelligence (AAAI) (2002) and Ducatelle (2006) for teaching data mining, Nisbet (2006) for CRM (Customer Relationship Manage- ment) and software review of Deshmukah (1997). StatSoft (2006) presents screen shots of several softwares that are used for exploratory data analysis and various data mining techniques. Kim et al. (2008) classify software changes in data mining and Ceccato et al. (2006) combine three mining techniques. Nayak (2008) develops and applies data mining techniques in web services discovery and monitoring. Davi et al. (2005) review two text mining packages of SAS text mining and Wordstat. Chou et al. (2008) apply text mining approach to Internet abuse detection and Lau et al. (2005) discuss text mining for the hotel industry. Lazarevic et al. (2006) discussed a software system for spatial data analysis and modeling. Leung (2004) compares microarray data mining software. National Center for Biotechnology Information (2006) referred to as NCBI provides tools for data mining including those specifically for each of the following categories of nu- cleotide sequence analysis, protein sequence analysis and proteomics, genome analysis, and gene expression. Chang and Lee (2006) find frequent itemsets using online data streams. Pabarskaite and Raudys (2007) review the knowledge discovery process from web log data. Sanchez et al. (2008) integrate software engineering and web mining techniques in the development of an e- commerce recommender system capable of predicting the preferences of its users and present them a personalized catalogue. Ganapathy et al. (2004) discuss visualization strategies and tools for enhancing customer relationship management. Some applications of supercomputers for data mining include that of Davies (2007) using Internet distributed supercomputers, Seigle (2002) for CIA/FBI, Mesrobian et al. (1995) for real time data mining, and Curry et al. (2007) for detecting changes in large data sets of payment card data. DMMGS06 conducted a workshop on data mining and management on the grid and supercomputers in Nottingham, UK. Grossman (2007) wrote a survey of high performance and distributed data mining. Sekijima (2007) studied the application of HPC to analysis of disease related protein. 65.3 Data Mining Software The research is to compare the five selected software for data mining including SAS Enter- prise Miner, Megaputer PolyAnalyst 5.0, PASW Modeler/ formerly SPSS Clementine, IBM Intelligent Miner, and BioDiscovery GeneSight. The data mining algorithms to be performed include those for neural networks, genetic algorithms, clustering, and decision trees. As can be visualized from Table 1, SAS Enterprise Miner , PolyAnalyst 5, PASW, and IBM Intelligent Miner offer more algorithms than GeneSight. 1248 Qingyu Zhang and Richard S. Segall Table 65.1. Data Mining Software ALGORITHMS GeneSight PolyAnalyst SAS Enter- prise Miner PASW Modeler/ SPSS Clementine IBM In- telligent Miner Statistical Analysis xxxxx Neural Networks x x x(add on) x Decision Trees x x x Regression Analysis x x x x Cluster Analysis xxxxx Self-Organizing Map (SOM) xx Link/Association Analysis x x x x 65.3.1 BioDiscovery GeneSight GeneSight is a product of BioDiscovery, Inc. of El Segundo, CA that focuses on cluster analysis using two main techniques of hierarchical and partitioning for data mining of microarray gene expressions. Figure 1 shows the k-means clustering of global variations using the Pearson correlation. This can also be done by self-organizing map (SOM) clustering using the Euclidean distance metric for the first three variables of aspect, slope and elevation. Figure 2 shows the two- dimensional self-organizing map (SOM) for the eleven variables for all of the data using the Chebychev distance metric. Fig. 65.1. K-means clustering of global variations with the Pear- son correlation using GeneSight 65.3.2 Megaputer PolyAnalyst 5.0 PolyAnalyst 5 is a product of Megaputer Intelligence, Inc. of Bloomington, IN and contains sixteen (16) advanced knowledge discovery algorithms. 65 Commercial Data Mining Software 1249 Fig. 65.2. Self-organizing map (SOM) with the Chebychev distance metric using GeneSight Figure 3 shows input data window for the forest cover type data in PolyAnalyst 5.0. The link diagram given by Figure 4, illustrates for each of the six (6) forest cover types for each of the 5 elevations present for each of the 40 soil types. Figure 5 provides the bin selection rule for the variable of selection. The Decision Tree Report indicates a classification probability of 80.19% with a total classification error of 19.81%. Per PolyAnalyst output the decision tree has a tree depth of 100 with 210 leaves, and a depth of constructed tree of 16, and a classification efficiency of 47.52%. Fig. 65.3. Input data window for the forest cover type data in PolyAnalyst 5.0 65.3.3 SAS Enterprise Miner SAS Enterprise Miner is a product of SAS Institute Inc. of Cary, NC and is based on the SEMMA approach that is the process of Sampling (S), Exploring (E), Modifying (M), Model- ing (M), and Assessing (A) large amounts of data. SAS Enterprise Miner utilizes a workspace with a drop-and-drag of icons approach to constructing data mining models. SAS Enterprise Miner utilizes algorithms for decision trees, regression, neural networks, cluster analysis, and association and sequence analysis. . 20 05b, pp 131–158. Rokach, L. and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery Handbook, pp. 321 –3 52, 20 05, Springer. Rokach, L. and Maimon, O., Data mining for improving the. parallel data mining. Data mining primarily deals with structured data. Text mining mostly handles unstructured data/ text. Web mining lies in between and copes with semi-structured data and/ or. mining, text mining, and web mining that are not available as free open source software. The selected O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI

Data Mining and Knowledge Discovery Handbook, 2 Edition part 127 pptx

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan