learning to map between schemas ontologies

Learning to Map Between Schemas Ontologies Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Agenda  Ontology mapping is a key problem in many applications: – – – –  Data integration Semantic web Knowledge management E-commerce LSD: – – – – Solution that uses multi-strategy learning We’ve started with schema matching (I.e., very simple ontologies) Currently extending to more expressive ontologies Experiments show the approach is very promising! The Structure Mapping Problem  Types of structures: –  Input: – – –  Database schemas, XML DTDs, ontologies, …, Two (or more) structures, S1 and S2 Data instances for S1 and S2 Background knowledge Output: – A mapping between S1 and S2 – Should enable translating between data instances – Semantics of mapping? Semantic Mappings between Schemas  Source schemas = XML DTDs house address contact-info agent-name num-baths agent-phone 1-1 mapping non 1-1 mapping house location contact name full-baths half-baths phone Motivation  Database schema integration – –  Model matching: key operator in an algebra where models and mappings are first-class objects See [Bernstein et al., 2000] for more The Semantic Web –  On the WWW, in enterprises, large science projects Model management: – –  database merging, data warehouses, data migration Data integration / information gathering agents –  A problem as old as databases themselves Ontology mapping System interoperability – E-services, application integration, B2B applications, …, Desiderata from Proposed Solutions   Accuracy, efficiency, ease of use Realistic expectations: –   Some notion of semantics for mappings Extensibility: –  Unlikely to be fully automated Need user in the loop Solution should exploit additional background knowledge “Memory”, knowledge reuse: – – System should exploit previous manual or automatically generated matchings Key idea behind LSD LSD Overview   L(earning) S(ource) D(escriptions)    Key idea: generate the first mappings manually, and learn from them to generate the rest Problem: generating semantic mappings between mediated schema and a large set of data source schemas Technique: multi-strategy learning (extensible!) Step 1: –  [SIGMOD, 2001]: 1-1 mappings between XML DTDs Current focus: – – Complex mappings Ontology mapping Outline  Overview of structure mapping  Data integration and source mappings  LSD architecture and details  Experimental results  Current work Data Integration Find houses with four bathrooms priced under $500,000 Query reformulation and optimization source schema mediated schema source schema source schema wrappers realestate.com homeseekers.com homes.com Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code Semantic Mappings between Schemas  Source schemas = XML DTDs house address contact-info agent-name num-baths agent-phone 1-1 mapping non 1-1 mapping house location contact name full-baths half-baths phone 10 Moving Up the Expressiveness Ladder     Schemas are very simple ontologies More expressive power = More domain constraints Mappings become more complex, but constraints provide more to learn from Non 1-1 mappings: –  F1(A1,…,Am) = F2(B1,…,Bm) Ontologies (of various flavors): – – – Class hierarchy (I.e., containment on unary relations) Relationships between objects Constraints on relationships 33 Finding Non 1-1 Mappings Current work  Given two schemas, find – – –  1-many mappings: address = concat(city,state) many-1: half-baths + full-baths = num-baths many-many: concat(addr-line1,addr-line2) = concat(street,city,state) 1-many mappings – expressed as query – value correspondence expression: room-rate = rate * (1 + tax-rate) – relationship: state of tax-rate = state of hotel that has rate – special case: 1-many mappings between two relational tables Mediated schema address description num-baths Source schema city state comments half-baths full-baths 34 Brute-Force Solution  Define a set of operators – concat, +, -, *, /, etc  For each set of mediated-schema columns – enumerate all possible mappings – evaluate & return best mapping Mediated-schema columns Source-schema columns compu t using e similarity all ba se lea rners m1 m1, m2, , mk 35 Search-Based Solution  States = columns – goal state: mediated-schema column – initial states: all source-schema columns – use 1-1 matching to reduce the set of initial states   Operators: concat, +, -, *, /, etc Column-similarity: – use all base learners + recognizers 36 Multi-Strategy Search   Use a set of expert modules: L1, L2, , Ln Each module – – –  searches a small subspace uses a cheap similarity measure to compare columns Example – – –  applies to only certain types of mediated-schema column L1: text; concat; TF/IDF L2: numeric; +, -, *, /; [Ho et al 2000] L3: address; concat; Naive Bayes Search techniques – – beam search as default specialized, not have to materialize columns 37 Multi-Strategy Search (cont’d)  Apply all applicable expert modules L1: m11, m12, m13, , m1x L2: m21, m22, m23, , m2y L3: m31, m32, m33, , m3z  Combine modules’ predictions & select the best one m11, m12, m21, m22, m31,m32 compu te sim il using all ba arity se lea rners m11 38 Related Work Recognizers + Schema + 1-1 Matching Single Learner + 1-1 Matching TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et al 98] CUPID [Madhavan et al 01] SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et al 97] Hybrid + 1-1 Matching DELTA [Clifton et al 97] Multi-Strategy Learning Learners + Recognizers Schema + Data 1-1 + non 1-1 Matching Schema + Data 1-1 + non 1-1 Matching Sophisticated Data-Driven User Interaction CLIO [Miller et al 00],[Yan et al 01] LSD [Doan et al 2000, 2001] ? 39 Summary  LSD: – – –  LSD is extensible and incorporates domain and user knowledge, and previous techniques Experimental results show the approach is very promising Future work and issues to ponder: – – –  uses multi-strategy learning to semi-automatically generate semantic mappings Accommodating more expressive languages: ontologies Reuse of learned concepts from related domains Semantics? Data management is a fertile area for Machine Learning research! 40 Backup Slides 41 Mapping Maintenance Mediated-schema M Source-schema S m1 m2 m3  Ten months later – are the mappings still correct? Mediated-schema M’ Source-schema S’ m1 m2 m3 42 Information Extraction from Text  Extract data fragments from text documents –   Intensive research on free-text documents Many documents have substantial structure –  XML pages, name card, tables, list Each such document = a data source – – –  date, location, & victim’s name from a news article structure forms a schema only one data value per schema element “real” data source has many data values per schema element Ongoing research in the IE community 43 Average Matching Acccuracy (%) Contribution of Each Component 100 80 60 40 20 Real Estate I Course Offerings Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system Faculty Listings Real Estate II 44 Exploiting Hierarchical Structure  Existing learners flatten out all structures Gail Murphy MAX Realtors  XML learner Developed – Victorian house with a view Name your price! To see it, contact Gail Murphy at MAX Realtors similar to the Naive Bayes learner – input instance = bag of tokens – differs in one crucial aspect – consider not only text tokens, but also structure tokens 45 Domain Constraints  Impose semantic regularities on sources –  Examples – – –  verified using schema or data a = address & b = address a = house-id a=b a is a key a = agent-info & b = agent-name b is nested in a Can be specified up front – – when creating mediated schema independent of any actual source schema 46 The Constraint Handler Predictions from Meta-Learner Domain Constraints area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) a = address & b = adderss area: address 0.7 contact-phone: agent-phone 0.9 extra-info: address 0.6 0.378   0.3 0.1 0.4 0.012 area: address 0.7 contact-phone: agent-phone 0.9 extra-info: description 0.4 0.252 Can specify arbitrary constraints User feedback = domain constraint –  a=b ad-id = house-id Extended to handle domain heuristics – a = agent-phone & b = agent-name a & b are usually close to each other 47 ... instances – Semantics of mapping? Semantic Mappings between Schemas  Source schemas = XML DTDs house address contact-info agent-name num-baths agent-phone 1-1 mapping non 1-1 mapping house location... integration, warehousing, custom code Semantic Mappings between Schemas  Source schemas = XML DTDs house address contact-info agent-name num-baths agent-phone 1-1 mapping non 1-1 mapping house location... promising Future work and issues to ponder: – – –  uses multi-strategy learning to semi-automatically generate semantic mappings Accommodating more expressive languages: ontologies Reuse of learned

learning to map between schemas ontologies

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Learning to Map Between Schemas Ontologies

Agenda

The Structure Mapping Problem

Semantic Mappings between Schemas

Motivation

Desiderata from Proposed Solutions

LSD Overview

Outline

Data Integration

Slide 10

Semantics (preliminary)

Why Matching is Difficult

Current State of Affairs

Slide 14

The LSD Approach

Example

Multi-Strategy Learning

Base Learners

Training the Base Learners

Entity Recognizers

Tài liệu cùng người dùng

Tài liệu liên quan