2 morphology

10 6 0
  • Loading ...
1/10 trang

Thông tin tài liệu

Ngày đăng: 31/12/2018, 14:51

Motivation Morphology  Factoring words • Cats  CAT + N(oun) + PL(ural)  Read Chapter - Speech and Language Processing Used in: • • • • • Traditional NLP applications Finding word boundaries (e.g., Latin, Chinese) Document retrieval (keyword retrieval) Text classification … Morphology Inflectional morphology   Morphology is the study of how words are built up from smaller meaningful units called morphemes  the same class as the stem  relates to the syntax of a sentence Ex: disadvantages = dis + advantage + s  a stem + a grammatical morpheme  a word:  lớp:  Inflectional morphology (Hình thái học biến tố)  Derivational morphology (Hình thái học dẫn xuất) Example: subject-verb agreement  He hit-s the ball  We hit the ball  Plural and possessive markers  Cats, cat’s Problem Derivational morphology Build a morphological parser to compute the morphology of words:  a stem + a grammatical morpheme  a word:  different class, e.g., transmit->transmission (Verb to Noun) Input Cats Cat Cities Goose Geese Gooses Merging caught  Irregular meaning change Suffix -ation -ee -er -ness -less Base Verb/Adjective computerize(V) appoint(V) love(V) fuzzy(Adj) clue(N) Derived form computerization(N) appointee(N) lover(N) fuzziness clueless Morphological Parsed Output Cat + N + PL Cat + N + SG City + N + PL Goose + N + SG Goose + N + PL Goose + V + 3SG Merge + V + PRES-PART (catch + V + PAST-PART) or (catch + V + PAST-PART) Solution 2: Look individual morphemes up Solution 1: A large dictionary Impractical: some languages associate a single meaning with a number of distinct surface forms (600 billion in Turkish) German: Leben+s+versichergun+gesellschaft+s+angestellter (life+CmpAug+insurance+CmpAug+company+Comp Aug+employee) Chinese compounding: about 3000 ‘words,’ combine to yield tens of thousands  mis + interpret + ation + s MIS + INTERPRET + noun form + plural  unrealistic: we might not find all the pieces in the dictionary, because of interference from the sound system (phonology) Ex: cities  citie + s; cities  citi + es Define the problem Basic Terminology & Motivation What knowledge we need?  What endings follow what roots, and in what order  Cat/cats (inflectional)  Dog/dogged (derivational)    Only some endings go on some words, not others  Do+er ok; (a class of verbs) but not following be Stem: core meaning unit (morpheme) of a word Affixes: pieces that combine with the stem to modify its meaning and grammatical functions  Prefix: un- , anti-, etc  Suffix: -ity, -ation, etc  Infix:  Spelling change rules adjust the surface form vs the lexicon form:  Get+er double the t  getter  Fox+s  insert e  foxes  Fly+s  insert e  flyes  Y to I  flies Tagalog: um+hinigi  humingi (borrow) Picture of finite-state automata (fsa): How to do?  We want to model pure concatenation  We need to ‘remember’ that certain items can only combine with certain other items  There’s a perfect model for this – finite-state automata 10 11 12 Definition of finite-state automaton (fsa) How: 2-level machine  f l i e      Finite-state transducer Lexicon Surface form F s L Y + A (deterministic) finite-state automaton (FSA) is a quintuple (Q,Σ d, , q0, F) where S Q is a finite set of states Σ is a finite set of terminal symbols, the alphabet q0  Q is the initial state F  Q, the set of final states  is a function from Q x Σ into Q, the transition function Underlying form 13 Formal languages & grammars  Plan: A language is a set of strings defined over some alphabet Σ, with some properties: Build fsa to recognize different stemendings and prefix-stems Suppose Σ ={a, b} Then we can have: Build fsa to recognize spelling changes Turn these into parsers by turning the fsa’s into finite-state transducers L  {x  * | P ( x)}  14 15 16 Using fsa’s to build recognizer for morphophonemic forms FSA for nominal inflection  Build fsa system for English inflectional morphology English derivational morphology fsa Use this to recognize a valid word Then show how to parse by extended to transducer Add spelling-change rules  Remember, we don’t have to worry about spelling changes classes of word:  Regular: cat, table, city: add s  Irregular: goose, mouse, sheep (memorize) 17 18 English derivational morphology Resulting fsa   Much more complex than inflectional Consider adjectives:      19 Big, bigger, biggest Cool, cooler, coolest, coolly Clear, clearer, clearest, clearly, unclear, unclearly Happy, happier, happiest, happily Unhappy, unhappier, unhappiest, unhappily 20 Will this fsa work? Will this fsa work? NO!     Accepts all adjectives above, but Also accepts unbig, realest Common problem: overgeneration Solution? Need classes of roots that say which can occur with which suffixes 21 22 Revised picture More English 23 24 FSA at the level of individual letters From recognizer to transducer   Why: need to map (correspond) inputs and outputs (e.g., goose-geese) A finite state transducer is a quintuple:  Q a finite set of states;  Σ a finite alphabet of complex symbols Each is an input-output pair, i:o, I from alphabet I and o from alphabet O So Σ I x O I,O can include the empty symbol ε or λ ;  q0 a start state  F, the set of final states, FQ   the transition function between states Aardvarks, foxs, … 25 FSTs in morphological processing FSA vs FST   26 operations An FSA defines a formal language (a set of strings) An FST defines a relation between sets of strings (defines a set of pairs of strings) 27  Composition (tổng hợp): if transducer T1 maps from I1 to O1 and T2 from I2 to O2 then T1o T2 maps from I1 to O2  Useful to replace series of transducers  Inversion (đảo): T(T-1) switches input and output labels  Useful to convert parser to generator 28 Automaton for singular/plural suffix, call this Tnum Automaton for stems, call this Tstem (cats#, cat N PL) (geese#, goose N PL) 29 30 Spelling change rules Tlex=TnumTstems 31 Name Description Example Consonant Doubling (gemination, G) E deletion (elision, EL), 1-letter consonant doubled before -ing/ed beg/begging E insertion (epenthesis, EP) e added after -s, -z, -ch, -sh before -s Y replacement (Y) -y changes to -ie before - try/tries ed I spelling (I) I goes to y before vowel Silent e dropped before - make/making ing, -ed fox/foxes lie/lying 32 So another view of the situation is this (see notes2) recognizing ‘foxes’ Fst spelling of “foxes”“FOX+S” root= always 1st ‘class’ root F/f f:f,o:o x:x +:e = FST1 (word classes) O/o e: e = FST2 (spell changes) s:s X/x 0/e +/0 Automaton blocks +/e #:# f o x e e s # surface F O X + e S # underlying Noun C1 leftover input s 33 Two-level morphology parsing (analysis) algorithm END! S/s C2 #/# Fox+s, Plural 34 Parsing Algorithm, cont’d Initialize set of paths to P = {} Read input symbols, one at a time At each symbol, generate all lexical symbols possibly corresponding to the (empty) symbol Prolong all paths in P by all such possible (x:0) pairs Check each new path extension against the phonological FST and lexical FSA (lexical symbols only); delete impossible paths prefixes Repeat 4-5 until max # of consecutive 0s reached Generate all possible lexical symbols (get from all FSTs) for the current input symbol, form pairs Extend all paths from P using all such pairs Check all paths from P (next step in FST/FSA) Delete all outright impossible paths 10 Repeat from until end of input 11 Collect lexical “glosses” from all surviving paths 35 36 Generation algorithm      Do not use the lexicon (well you have to put the “right” lexical strings together somehow!) Start with a lexical string L Generate all possible pairs l:s for every symbol in L Find all (hopefully only 1!) traversals through the FST which end in a final state From all such traversals, print out the sequence of surface letters 37
- Xem thêm -

Xem thêm: 2 morphology , 2 morphology

Gợi ý tài liệu liên quan cho bạn

Nhận lời giải ngay chưa đến 10 phút Đăng bài tập ngay