Data Mining and Knowledge Discovery Handbook, 2 Edition part 125 ppsx

1220 Nissan Levin and Jacob Zahavi Heckman, J., Sample Selection Bias as a Specification Error, Econometrica, Vol. 47, No. 1, pp. 153-161, 1979. Gilbert A. and Churchill, Jr., Marketing Research. Seventh edition The Dryden Press, 1999. George, E.I., The Variable Selection Problem, University of Texas, Austin, 2000. Herz, F., Ungar, L. and Labys, P., A Collaborative Filtering System for the Analysis of Con- sumer Data. Univ. of Pennsylvania, Philadelphia, 1997. Hodges, J.L. Jr., ”The Significance Probability of the Smirnov Two-Sample Test,” Arkiv for Matematik, 3, 469 -486, 1957. Kass, G., An Exploratory Technique for Investigating large Quantities of Categorical Data, Applied Statistics, 29, 1983. Kohonen, K., Makisara, K., Simula, O. and Kangas, J., Artificial Networks. Amsterdam, 1991. Lauritzen, S.L., The EM algorithm for Graphical Association Models with Missing Data. Computational Statistics and Data Analysis, 19, 191-201, 1995. Long, S.J., Regression Models for Categorical and Limited Dependent Variables, Sage Pub- lications, Thousand Oaks, CA, 1997. Lambert P.J., The Distribution and Redistribution of Income. Manchester University Press., 1993. Levin, N. and Zahavi, J., Segmentation Analysis with Managerial Judgment, Journal of Di- rect Marketing, Vol. 10, pp. 28-47, 1996. Levin, N. and Zahavi, J., Applying Neural Computing to Target Marketing, The Journal of Direct Marketing, Vol. 11, No. 1, pp. 5-22, 1997a. Levin, N. and Zahavi, J., Issues and Problems in Applying Neural Computing to Target Marketing, The Journal of Direct marketing, Vol. 11, No. 4, pp. 63-75, 1997b. Miller, A., Subset Selection in Regression, Chapman and Hall, London, 2002. Quinlan, J.R., Induction of Decision Trees, Machine Learning, 1, pp. 81-106, 1986. Quinlan, J.R., C4.5: Program for Machine Learning, CA., Morgan Kaufman Publishing, 1993. Rumelhart, D.E., McClelland, J.L., and Williams, R.J., Learning Internal Representation by Error Propagation, in Parallel Distributed Processing: Exploring the Microstructure of Cognition, Rumelhart, D.E., McClelland, J.L. and the PDP Researcg Group, eds., MIT Press, Cambridge, MA, 1986. Schwarz, G., Estimating the Dimension of a Model, Annals of Statistics, Vol. 6, pp. 486-494, 1978. Shepard, D. (ed.), The New Direct Marketing, New York, Irwin, 1995. Silverman, B.W., Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. Smith, W.R., Product Differentiation and Market Segmentation as Alternative Marketing Strategies, Journal of Marketing, 21, 3-8, 1956. Sonquist, J., Baker, E. and Morgan, J.N., Searching for Structure, Ann Arbor, University of Michigan, Survey Research Center, 1971. Tobin, J., Estimation of Relationships for Limited-Dependent Variables, Econometrica, Vol. 26, pp. 24-36, 1958. Zhang, R., Ramakrishnan, R. and Livny, M., An Efficient Data Clustering Method for Very Large Databases. Proceedings ACM SIGKDD International Conference on Management of Data. 103-114, 1996. 64 NHECD - Nano Health and Environmental Commented Database Oded Maimon 1 and Abel Browarnik 1 Department of Industrial Engineering, Tel-Aviv University, Ramat-Aviv 69978, Israel, maimon@eng.tau.ac.il Summary. The impact of nanoparticles on health and the environment is a significant research subject, driving increasing interest from the scientific community, regulatory bodies and the general public. We present a smart repository system with text and data mining for this domain. The growing body of knowledge in this area, consisting of scientific papers and other types of publications (such as surveys and whitepapers) emphasize the need for a methodol- ogy to alleviate the complexity of reviewing all the available information and discovering all the underlying facts, using data mining algorithms and methods. The European Commission-funded project NHECD (whose full name is “Creation of a critical and commented database on the health, safety and environmental impact of nanoparticles”) converts the unstructured body of knowledge produced by the different groups of users (such as researchers and regulators) into a repository of scientific papers and reviews augmented by layers of information extracted from the papers. Towards this end we use taxonomies built by domain experts and metadata, using advanced methodologies. We implement algorithms for textual information extraction, graph mining and table information extraction. Rating and relevance assessment of the papers are also part of the system. The project is com- posed of two major layers, a backend consisting of all the above taxonomies, algorithms and methods, and a frontend consisting of a query and navigation system. The frontend has web interface which address the needs (and knowledge) of the different user groups. Documentum, a content management system (CMS), is the backbone of the backend process component. The frontend is a customized application built using an open source CMS. It is designed to take advantage of the taxonomies and metadata for search and navigation, while allowing the user to query the system, taking advantage of the extracted information. 64.1 Introduction Nanoparticles toxicity (or NanoTox) is currently one of the main concerns for the scientific community, for regulators and for the public. Nanoparticles impact on health and the environment is a research subject driving increasing interest. This fact is reflected by the number of papers published on the subject, both in scientific journals and on the press. O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_64, © Springer Science+Business Media, LLC 2010 1222 Oded Maimon and Abel Browarnik The published material (e.g., scientific papers) is essentially unstructured. It always uses natural language (in the form of text), sometimes accompanied by tables and/or graphs. Usually, when searching a body of unstructured knowledge (such as a corpus of scientific papers) the “search engine” uses a method called “full text search”. Full text search can be done either directly, by scanning all the available text or by using indexing mechanisms. Direct search is feasible only for small volumes of data. Index-based search applies when the amount of data rules out direct search. There are several indexing mechanisms, the most famous being Google’s Page Rank (Brin, 1998). Indexing mechanisms are rated according to the results returned by the search engines using them. In either case, users interact with the search engine (and through it with the indexing mechanism) by means of queries. Scientific papers are written in natural language. It could be easier for users to formulate queries using the same natural language. However, the understanding of natural language is an extremely non-trivial task. Therefore, using it for queries would add additional complexity to a problem with enough complexity by itself. To avoid it, search engines use different approaches to deal with queries: 1. Keywords: Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text. 2. Boolean queries: Searches using Boolean operators can dramatically increase the precision of a free text search. The AND operator says, in effect, ”Do not retrieve any document unless it contains both of these terms.” The NOT operator says, in effect, ”Do not retrieve any document that contains this word.” If the retrieval list retrieves too few documents, the OR operator can be used to increase recall. 3. Phrase search: A phrase search matches only those documents that contain a specified phrase. 4. Concordance search: A concordance search produces an alphabetical list of all principal words that occur in a text with their immediate context. 5. Proximity search: A phrase search matches only those documents that contain two or more words that are separated by a specified number of words. 6. Regular expression: A regular expression employs a complex but powerful querying syn- tax that can be used to specify retrieval conditions with precision. 7. Wildcard search: A search that substitutes one or more characters in a search query for a wildcard character such as an asterisk. “Skin Deep” 1 , a product safety guide dealing with cosmetics, run by the Environmental Working Group, grants public access to a database containing more than 42,000 products with more than 8,300 ingredients from the U.S., nearly a quarter of all products on the market (fig- ures updated to May 2009). The database is based on a link between a collection of personal care product ingredient listings with more than 50 toxicity and regulatory databases. Skin Deep uses a restricted user interface for simple queries. The visitor is asked on a product, ingredient or company (see Figure 1). A query for “vitamin a” returned 614 results, matching at least one word. The advanced query screen allows for a much more detailed search (see Figure 2). Visitors can ask to find products, ingredients or companies with higher granularity. The results returned by Skin Deep consist of an exhaustive analysis of the substance, as shown in Figure 3. 1 The Environmental Working Group Repository of Cosmetics, http://www.cosmeticsdatabase.com/about.php 64 NHECD - Nano Health and Environmental Commented Database 1223 Fig. 64.1. Skin Deep simple query. Fig. 64.2. Skin Deep advanced queries. 1224 Oded Maimon and Abel Browarnik Fig. 64.3. Skin Deep result example. ICON, the International Council on Nanotechnology, from RICE University, uses an ap- proach that constrains the user to formulate a query within a restricted (although very rich) template, together with a “controlled vocabulary” (e.g., a list of predefined values) as shown in Figure 4. Fig. 64.4. ICON database. Results obtained from ICON are, as stated on the ICON website: 64 NHECD - Nano Health and Environmental Commented Database 1225 “. . . a quick and thorough synopsis of our Environment, Health and Safety Database using two types of analyses. The first is a Simple Distribution Analysis (pie chart) which compares categories within a specified time range. The second type is a Time Progressive Distribution Analysis (histogram) which compares categories over a specified overall time range and data grouping period. Other useful features include the ability to: 1. Generate and export custom reports in pdf and xls formats. 2. Click on a report result to generate a list of publications meeting your criteria”. TOXNET - Databases on toxicology, hazardous chemicals, environmental health, and toxic releases, an initiative by the US National Library of Medicine, lets visitors query its network of databases by using keywords, as shown in Figure 5. Fig. 64.5. Toxnet query. There are several initiatives related to toxicity of nanoparticles, but to date none of it is a real alternative to the existing (and limited) databases. Examples of such initiatives are the Environmental Defense Fund Nanotech section 2 and the NANO Risk Framework. There are also initiatives that aim at mapping current nanotox research. The OECD (Organisation for Economic Co-operation and Development) runs a “Database on Research into the Safety of Manufactured Nanomaterials” 3 . As suggested by its name, the database maps research on the area. It uses extensive metadata 4 , as seen in Figure 6. 2 Environmental Defense Fund, http://www.edf.org/page.cfm?tagID=77 3 http://webnet.oecd.org/NanoMaterials/Pagelet/Front/Default.aspx 4 Metadata: “data about data” (see http://en.wikipedia.org/wiki/Metadata) 1226 Oded Maimon and Abel Browarnik Fig. 64.6. OECD NanoTox advanced search. NIOSH, the U.S. National Institute for Occupational Safety and Health, runs a Nanopar- ticle Information Library (NIL) 5 IMPART-Nanotox 6 , an EU funded project ended in 2008, includes a public, web acces- sible database of nanotox publications. The search can be done by publications’ metadata, as seen in Figure 7. Figure 7 - Impart-Nanotox extended search Fig. 64.7. Impart-Nanotox extended search. 5 The U.S. National Institute for Occupational Safety and Health Nanoparticle Information Library (NIL) – http://www.cdc.gov/search.do?action=search&subset=niosh 6 http://www.impart-nanotox.org 64 NHECD - Nano Health and Environmental Commented Database 1227 SAFENANO 7 , another EU funded project, contains a database of publications and metadata searchable on the web (see Figure 6). Fig. 64.8. SAFENANO Publication Search. 7 http://www.safenano.org 1228 Oded Maimon and Abel Browarnik Nano Archive 8 , another EU FP7 project, has the objective of allowing researchers to share and search information, mainly through metadata exchange (see Figure 9). ObservatoryNano 9 , yet another EU FP7 funded project has an ambitious target, “to create a European Observatory on Nanotechnologies to present reliable, complete and responsible science-based and economic expert analysis, across different technology sectors, establish dialogue with decision makers and others regarding the benefits and opportunities, balanced against barriers and risks, and allow them to take action to ensure that scientific and technological developments are realized as socio-economic benefits.” Fig. 64.9. Nano Archive search. The review above brought us to the conclusion that the following shortcomings should be dealt with: 1. Many efforts are being dedicated to creating repositories of raw metadata of nanotox publications. There is no evidence as to the contribution of such repositories to the ad- vancement of nanotox research and implementation. 2. No significant searchable repository of nanotox data (as compared to metadata) exists currently. 8 http://www.nanoarchive.org/information.html 9 http://www.observatorynano.eu/project/search/extendedsearch/ 64 NHECD - Nano Health and Environmental Commented Database 1229 3. The query capabilities of widely used search engines do not include the option to query the text for fact patterns (as well as more complex, derived patterns), such as “what con- clusions were reached in scientific papers where <fact X> and <fact Y> occurred in that order?”. Those are examples of queries that may help nanotox researchers and regulators, as well as the general public. 4. There is no tool capable of extracting information specific to nanotox. NHECD 10 , an EU FP7 funded project, is aiming at transforming the emerging body of unstructured knowledge (in the form of scientific papers and other publications) into structured data by means of textual information extraction, solves the above shortcomings by: 1. Developing taxonomies for the nanotox domain 2. Developing and implementing algorithms for information extraction from nanotox papers 3. Creating a repository of papers augmented by structured knowledge extracted from the papers 4. Allowing visitors (e.g., nanotox scientists, regulators, general public) to navigate the repository using the taxonomies 5. Letting visitors search the repository using complex patterns (such as facts) 6. Enabling data mining algorithms to predict toxicity based on characteristics extracted by text mining methods. Thus free text can be used for data mining inference. 64.2 The NHECD Model NHECD is, as suggested by its full name (Nanotox Health and Environment Commented Database) an initiative to obtain a database (e.g., structured information that can be queried) from available unstructured information such as scientific papers and other publications. The process of obtaining the structured data involves many resources, from the domain of Nanotox and from the areas of information sciences and technologies (IT). The NHECD model is depicted in Figure 10. The process starts with a collection of documents (e.g., scientific papers) gathered by means of a search using criteria given by Nanotox experts. The process used to populate the repository is called crawling. The documents are accompanied by the corresponding metadata (e.g., authors, publication dates, journals, keywords supplied by the authors, abstract and more). The process requires Nanotox taxonomies. Taxonomies are classification artifacts used at the information extraction stage (taxonomies are also used in NHECD for document navigation). Taxonomy building tasks are “located” at the boundary between the Nanotox experts and the IT experts (see Figure 10), due to its interdisciplinary nature. Nanotox experts annotate papers to train the system towards the information extraction stage. This stage is implemented using text mining algorithms. Further to the information extraction process, a set of rating algorithms is applied on the documents to provide an additional layer of information (e.g., the rating). The result of the process consists of: 1. A corpus of results, updated on an ongoing, asynchronous basis. 2. A commented collection of scientific papers. By commented we refer to the added layer of metadata, rating and other information extracted from the document. The whole process can be represented with a block diagram as shown in figure 11. 10 “http://www.nhecd-fp7.eu - Creation of a critical and commented database on the health, safety and environmental impact of nanoparticles” . journals and on the press. O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4_64, © Springer Science+Business Media, LLC 20 10 122 2 Oded. Cosmetics, http://www.cosmeticsdatabase.com/about.php 64 NHECD - Nano Health and Environmental Commented Database 122 3 Fig. 64.1. Skin Deep simple query. Fig. 64 .2. Skin Deep advanced queries. 122 4 Oded Maimon and Abel. http://www.cdc.gov/search.do?action=search&subset=niosh 6 http://www.impart-nanotox.org 64 NHECD - Nano Health and Environmental Commented Database 122 7 SAFENANO 7 , another EU funded project, contains a database of publications and metadata searchable

Data Mining and Knowledge Discovery Handbook, 2 Edition part 125 ppsx

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan